Zbornik 27. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2024 Zvezek A Proceedings of the 27th International Multiconference INFORMATION SOCIETY – IS 2024 Volume A Slovenska konferenca o umetni inteligenci Slovenian Conference on Artificial Intelligence Uredniki / Editors Mitja Luštrek, Matjaž Gams, Rok Piltaver http://is.ijs.si 10.–11. oktober 2024 / 10–11 October 2024 Ljubljana, Slovenia Uredniki: Mitja Luštrek Odsek za inteligentne sisteme, Institut »Jožef Stefan«, Ljubljana Matjaž Gams Odsek za inteligentne sisteme, Institut »Jožef Stefan«, Ljubljana Rok Piltaver Outfit7, Ljubljana Založnik: Institut »Jožef Stefan«, Ljubljana Priprava zbornika: Mitja Lasič, Vesna Lasič, Lana Zemljak Oblikovanje naslovnice: Vesna Lasič Dostop do e-publikacije: http://library.ijs.si/Stacks/Proceedings/InformationSociety Ljubljana, oktober 2024 Informacijska družba ISSN 2630-371X Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni knjižnici v Ljubljani COBISS.SI-ID 214409987 ISBN 978-961-264-299-0 (PDF) PREDGOVOR MULTIKONFERENCI INFORMACIJSKA DRUŽBA 2024 Leto 2024 je hkrati udarno in tradicionalno. Že sedaj, še bolj pa v prihodnosti bosta računalništvo, informatika (RI) in umetna inteligenca (UI) igrali ključno vlogo pri oblikovanju napredne in trajnostne družbe. Smo na pragu nove dobe, v kateri generativna umetna inteligenca, kot je ChatGPT, in drugi inovativni pristopi utirajo pot k superinteligenci in singularnosti, ključnim elementom, ki bodo definirali razcvet človeške civilizacije. Naša konferenca je zato hkrati tradicionalna znanstvena, pa tudi povsem akademsko odprta za nove pogumne ideje, inkubator novih pogledov in idej. Letošnja konferenca ne le da analizira področja RI, temveč prinaša tudi osrednje razprave o perečih temah današnjega časa – ohranjanje okolja, demografski izzivi, zdravstvo in preobrazba družbenih struktur. Razvoj UI ponuja rešitve za skoraj vse izzive, s katerimi se soočamo, kar poudarja pomen sodelovanja med strokovnjaki, raziskovalci in odločevalci, da bi skupaj oblikovali strategije za prihodnost. Zavedamo se, da živimo v času velikih sprememb, kjer je ključno, da s poglobljenim znanjem in inovativnimi pristopi oblikujemo informacijsko družbo, ki bo varna, vključujoča in trajnostna. Letos smo ponosni, da smo v okviru multikonference združili dvanajst izjemnih konferenc, ki odražajo širino in globino informacijskih ved: CHATMED v zdravstvu, Demografske in družinske analize, Digitalna preobrazba zdravstvene nege, Digitalna vključenost v informacijski družbi – DIGIN 2024, Kognitivna znanost, Konferenca o zdravi dolgoživosti, Legende računalništva in informatike, Mednarodna konferenca o prenosu tehnologij, Miti in resnice o varovanju okolja, Odkrivanje znanja in podatkovna skladišča – SIKDD 2024, Slovenska konferenca o umetni inteligenci, Vzgoja in izobraževanje v RI. Poleg referatov bodo razprave na okroglih mizah in delavnicah omogočile poglobljeno izmenjavo mnenj, ki bo oblikovala prihodnjo informacijsko družbo. “Legende računalništva in informatike” predstavljajo slovenski “Hall of Fame” za odlične posameznike s tega področja, razširjeni referati, objavljeni v reviji Informatica z 48-letno tradicijo odličnosti, in sodelovanje s številnimi akademskimi institucijami in združenji, kot so ACM Slovenija, SLAIS in Inženirska akademija Slovenije, bodo še naprej spodbujali razvoj informacijske družbe. Skupaj bomo gradili temelje za prihodnost, ki bo oblikovana s tehnologijami, osredotočena na človeka in njegove potrebe. S podelitvijo nagrad, še posebej z nagrado Michie-Turing, se avtonomna RI stroka vsakoletno opredeli do najbolj izstopajočih dosežkov. Nagrado Michie-Turing za izjemen življenjski prispevek k razvoju in promociji informacijske družbe je prejel prof. dr. Borut Žalik. Priznanje za dosežek leta pripada prof. dr. Sašu Džeroskemu za izjemne raziskovalne dosežke. »Informacijsko limono« za najmanj primerno informacijsko tematiko je prejela nabava in razdeljevanjem osebnih računalnikov ministrstva, »informacijsko jagodo« kot najboljšo potezo pa so sprejeli organizatorji tekmovanja ACM Slovenija. Čestitke nagrajencem! Naša vizija je jasna: prepoznati, izkoristiti in oblikovati priložnosti, ki jih prinaša digitalna preobrazba, ter ustvariti informacijsko družbo, ki bo koristila vsem njenim članom. Vsem sodelujočim se zahvaljujemo za njihov prispevek k tej viziji in se veselimo prihodnjih dosežkov, ki jih bo oblikovala ta konferenca. Mojca Ciglarič, predsednica programskega odbora Matjaž Gams, predsednik organizacijskega odbora i PREFACE TO THE MULTICONFERENCE INFORMATION SOCIETY 2024 The year 2024 is both ground-breaking and traditional. Now, and even more so in the future, computer science, informatics (CS/I), and artificial intelligence (AI) will play a crucial role in shaping an advanced and sustainable society. We are on the brink of a new era where generative artificial intelligence, such as ChatGPT, and other innovative approaches are paving the way for superintelligence and singularity—key elements that will define the flourishing of human civilization. Our conference is therefore both a traditional scientific gathering and an academically open incubator for bold new ideas and perspectives. This year's conference analyzes key CS/I areas and brings forward central discussions on pressing contemporary issues—environmental preservation, demographic challenges, healthcare, and the transformation of social structures. AI development offers solutions to nearly all challenges we face, emphasizing the importance of collaboration between experts, researchers, and policymakers to shape future strategies collectively. We recognize that we live in times of significant change, where it is crucial to build an information society that is safe, inclusive, and sustainable, through deep knowledge and innovative approaches. This year, we are proud to have brought together twelve exceptional conferences within the multiconference framework, reflecting the breadth and depth of information sciences: • CHATMED in Healthcare • Demographic and Family Analyses • Digital Transformation of Healthcare Nursing • Digital Inclusion in the Information Society – DIGIN 2024 • Cognitive Science • Conference on Healthy Longevity • Legends of Computer Science and Informatics • International Conference on Technology Transfer • Myths and Facts on Environmental Protection • Data Mining and Data Warehouses – SIKDD 2024 • Slovenian Conference on Artificial Intelligence • Education and Training in CS/IS. In addition to papers, roundtable discussions and workshops will facilitate in-depth exchanges that will help shape the future information society. The “Legends of Computer Science and Informatics” represents Slovenia’s “Hall of Fame” for outstanding individuals in this field. At the same time, extended papers published in the Informatica journal, with over 48 years of excellence, and collaboration with numerous academic institutions and associations, such as ACM Slovenia, SLAIS, and the Slovenian Academy of Engineering, will continue to foster the development of the information society. Together, we will build the foundation for a future shaped by technology, yet focused on human needs. The autonomous CS/IS community annually recognizes the most outstanding achievements through the awards ceremony. The Michie-Turing Award for an exceptional lifetime contribution to the development and promotion of the information society was awarded to Prof. Dr. Borut Žalik. The Achievement of the Year Award goes to Prof. Dr. Sašo Džeroski. The "Information Lemon" for the least appropriate information topic was given to the ministry's procurement and distribution of personal computers. At the same time, the "Information Strawberry" for the best initiative was awarded to the organizers of the ACM Slovenia competition. Congratulations to all the award winners! Our vision is clear: to recognize, seize, and shape the opportunities brought by digital transformation and create an information society that benefits all its members. We thank all participants for their contributions and look forward to this conference's future achievements. Mojca Ciglarič, Chair of the Program Committee Matjaž Gams, Chair of the Organizing Committee ii KONFERENČNI ODBORI CONFERENCE COMMITTEES International Programme Committee Organizing Committee Vladimir Bajic, South Africa Matjaž Gams, chair Heiner Benking, Germany Mitja Luštrek Se Woo Cheon, South Korea Lana Zemljak Howie Firth, UK Vesna Koricki Olga Fomichova, Russia Mitja Lasič Vladimir Fomichov, Russia Blaž Mahnič Vesna Hljuz Dobric, Croatia Alfred Inselberg, Israel Jay Liebowitz, USA Huan Liu, Singapore Henz Martin, Germany Marcin Paprzycki, USA Claude Sammut, Australia Jiri Wiedermann, Czech Republic Xindong Wu, USA Yiming Ye, USA Ning Zhong, USA Wray Buntine, Australia Bezalel Gavish, USA Gal A. Kaminka, Israel Mike Bain, Australia Michela Milano, Italy Derong Liu, Chicago, USA Toby Walsh, Australia Sergio Campos-Cordobes, Spain Shabnam Farahmand, Finland Sergio Crovella, Italy Programme Committee Mojca Ciglarič, chair Marjan Heričko Baldomir Zajc Bojan Orel Borka Jerman Blažič Džonova Blaž Zupan Franc Solina Gorazd Kandus Boris Žemva Viljan Mahnič Urban Kordeš Leon Žlajpah Cene Bavec Marjan Krisper Niko Zimic Tomaž Kalin Andrej Kuščer Rok Piltaver Jozsef Györkös Jadran Lenarčič Toma Strle Tadej Bajd Borut Likar Tine Kolenik Jaroslav Berce Janez Malačič Franci Pivec Mojca Bernik Olga Markič Uroš Rajkovič Marko Bohanec Dunja Mladenič Borut Batagelj Ivan Bratko Franc Novak Tomaž Ogrin Andrej Brodnik Vladislav Rajkovič Aleš Ude Dušan Caf Grega Repovš Bojan Blažica Saša Divjak Ivan Rozman Matjaž Kljun Tomaž Erjavec Niko Schlamberger Robert Blatnik Bogdan Filipič Stanko Strmčnik Erik Dovgan Andrej Gams Jurij Šilc Špela Stres Matjaž Gams Jurij Tasič Anton Gradišek Mitja Luštrek Denis Trček Marko Grobelnik Andrej Ule Nikola Guid Boštjan Vilfan iii iv KAZALO / TABLE OF CONTENTS Slovenska konferenca o umetni inteligenci / Slovenian Conference on Artificial Intelligence ................ 1 PREDGOVOR / FOREWORD ............................................................................................................................... 3 PROGRAMSKI ODBORI / PROGRAMME COMMITTEES ............................................................................... 5 PandaChat-RAG: Towards the Benchmark for Slovenian RAG Applications / Kuzman Taja, Pavleska Tanja, Rupnik Urban, Cigoj Primož .............................................................................................................................. 7 Choosing Features for Stress Prediction with Machine Learning / Bengeri Katja, Lukan Junoš, Luštrek Mitja . 11 Predictive Modeling of Football Results in the WWIN League of Bosnia and Herzegovina / Vladić Ervin, Mehanović Dželila, Avdić Elma ...................................................................................................................... 15 Sarcasm Detection in a Less-Resourced Language / Đoković Lazar, Robnik-Šikonja Marko ............................ 19 Speech-to-Service: Using LLMs to Facilitate Recording of Services in Healthcare / Smerkol Maj, Susič Rok, Ratajec Mariša, Halbwachs Helena, Gradišek Anton ...................................................................................... 23 Performance Comparison of Axle Weight Prediction Algorithms on Time-Series Data / Kolar Žiga, Susič David, Konečnik Martin, Prestor Domen, Pejanovič Nosaka Tomo, Kulauzović Bajko, Kalin Jan, Skobir Matjaž, Gams Matjaž ....................................................................................................................................... 27 Comparison of Feature- and Embedding-based Approaches for Audio and Visual Emotion Classification / Trojer Sebastijan, Anžur Zoja, Luštrek Mitja, Slapničar Gašper ..................................................................... 31 Multi-modal Data Collection and Preliminary Statistical Analysis for Cognitive Load Assessment / Krstevska Ana, Kramar Sebastjan, Gjoreski Hristijan, Gjoreski Martin, Lukan Junoš, Trojer Sebastijan, Luštrek Mitja, Slapničar Gašper .............................................................................................................................................. 35 Predicting Health-Related Absenteeism with Machine Learning: A Case Study / Piciga Aleksander, Kukar Matjaž ............................................................................................................................................................... 39 Puzzle Generation for Ultimate-Tic-Tac-Toe / Zirkelbach Maj, Sadikov Aleksander ......................................... 43 Ethical Consideration and Sociological Challenges in the Integration of Artificial Intelligence in Mental Health Services / Poljak Lukek Saša........................................................................................................................... 47 Optimization Problem Inspector: A Tool for Analysis of Industrial Optimization Problems and Their Solutions / Tušar Tea, Cork Jordan, Andova Andrejaana, Filipič Bogdan ........................................................................ 51 Multi-Agent System for Autonomous Table Football: A Winning Strategy / Založnik Marcel, Šoln Kristjan ... 55 Towards a Decision Support System for Project Planning: Multi-Criteria Evaluation of Past Projects Success / Hafner Miha, Bohanec Marko .......................................................................................................................... 59 Minimizing Costs and Risks in Demand Response Optimization: Insights from Initial Experiments / Nedić Mila, Tušar Tea ................................................................................................................................................ 63 Predicting Hydrogen Adsorption Energies on Platinum Nanoparticles and Surfaces With Machine Learning / Gašparič Lea, Kokalj Anton, Džeroski Sašo .................................................................................................... 67 SmartCHANGE Risk Prediction Tool: Demonstrating Risk Assessment for Children and Youth / Jordan Marko, Reščič Nina, Kramar Sebastjan, Založnik Marcel, Luštrek Mitja ....................................................... 71 Predicting Mental States During VR Sessions Using Sensor Data and Machine Learning / Kizhevska Emilija, Luštrek Mitja .................................................................................................................................................... 75 Biomarker Prediction in Colorectal Cancer Using Multiple Instance Learning / Shulajkovska Miljana, Jelenc Matej, Jonnagaddala Jitenndra, Gradišek Anton .............................................................................................. 79 Feature-Based Emotion Classification Using Eye-Tracking Data / Božak Tomi, Luštrek Mitja, Slapničar Gašper .......................................................................................................................................................................... 83 Indeks avtorjev / Author index ................................................................................................................... 87 v vi Zbornik 27. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2024 Zvezek A Proceedings of the 27th International Multiconference INFORMATION SOCIETY – IS 2024 Volume A Slovenska konferenca o umetni inteligenci Slovenian Conference on Artificial Intelligence Uredniki / Editors Mitja Luštrek, Matjaž Gams, Rok Piltaver http://is.ijs.si 10.–11. oktober 2024 / 10–11 October 2024 Ljubljana, Slovenia 1 2 PREDGOVOR Umetna inteligenca doživlja neverjeten in pospešen razvoj, ko se po tričetrt stoletja, ko je Alan Mathison Turing postavil temelje računalništva in umetne inteligence, končno približuje ne le človeški inteligenci, temveč tudi drugim ključnim človeškim lastnostim, kot sta ustvarjalnost, čustvena inteligenca in zavest. Na številnih področjih umetna inteligenca že presega zmogljivosti večine ljudi in celo strokovnjakov. Veliki jezikovni modeli dosegajo tovrstne rezultate tudi pri dosti manj strukturiranih problemih, kot je bilo predstavljivo pred nekaj leti, npr. pri strokovnih izpitih ter besedilnih nalogah iz matematike in programiranja. Generativna umetna inteligenca že zdaj spreminja svet. Postala je nepogrešljivo orodje v poslovnem svetu, raziskavah in vsakdanjem življenju, saj omogoča pisanje besedil, ustvarjanje kode, generiranje slik in reševanje kompleksnih problemov. Možno je celo, da smo priča začetkom singularnosti – prelomnega trenutka, ko bo umetna inteligenca presegla človeško inteligenco in omogočila revolucijo na področju produktivnosti in inovacij, čeprav bo treba na sodbo o tem še počakati. Optimizem glede prihodnosti je utemeljen: če se bo razvoj nadaljeval s trenutnim tempom, si lahko predstavljamo svet, kjer bo umetna inteligenca povsem preoblikovala gospodarstvo, znanost in način življenja, pri čemer bo omogočila višjo kakovost življenja za vse. Čeprav nekateri umetno inteligenco vidijo kot grožnjo, njen trenutni razmah resnejših težav še ni prinesel. Nadejamo se, da bo zadosten del raziskav usmerjen v varnost umetne inteligence, da bo tako ostalo. Z morebitnimi škodljivimi učinki umetne inteligence se spopadajo tudi regulatorji, za katere upamo, da bodo uspešno krmarili med tem ciljem in pretiranim zaviranjem razvoja. Dostopnost velikih jezikovnih modelov, kot so GPT-ji, pomeni, da so naloge, ki zahtevajo razumevanje in generiranje naravnega jezika, lažje kot kadar koli prej. Mnogi raziskovalci verjamejo, da bo prihodnost programiranja prešla iz tradicionalnih jezikov, kot je Python, na velike jezikovne modele, kjer bo umetna inteligenca generirala kodo in rešitve po meri. Čeprav je razvoj teh modelov zahtevna naloga, ki presega zmožnosti večine organizacij, se ljudje navajamo na uporabo tega fenomenalnega orodja. Pričakujemo, da bo umetna inteligenca postala učinkovit in zanesljiv partner človeštva. Že letos vidimo, da so konference v sklopu Informacijske družbe posvečene prav velikim jezikovnim modelom. V okviru Slovenske konference o umetni inteligenci organiziramo formalno debato dijakov – izkušenih debaterjev, ki se udeležujejo mednarodnih tekmovanj – o tem, kako bo umetna inteligenca oblikovala prihodnost in zakaj bi to lahko bila najboljša prihodnost doslej. Matjaž Gams Mitja Luštrek Rok Piltaver predsedniki Slovenske konference o umetni inteligenci 3 FOREWORD Artificial intelligence is experiencing incredible and accelerated development. After three-quarters of a century since Alan Mathison Turing laid the foundations of computing and artificial intelligence, it is finally approaching not only human intelligence but also other key human traits such as creativity, emotional intelligence and consciousness. In many areas, artificial intelligence already surpasses the capabilities of most people and even experts. Large language models are achieving such results even in much less structured problems than was imaginable a few years ago, such as professional exams, and mathematics and programming tasks described in free text. Generative artificial intelligence is already transforming the world. It has become an indispensable tool in the business world, research, and everyday life, enabling text writing, code generation, image creation, and solving complex problems. It is even possible that we are witnessing the beginnings of the singularity—the pivotal moment when artificial intelligence will surpass human intelligence and enable a revolution in productivity and innovation, although time will show whether this is actually the case. Optimism about the future is well-founded: if development continues at its current pace, we can imagine a world where artificial intelligence completely transforms the economy, science, and way of life, leading to a higher quality of life for all. Although some see artificial intelligence as a threat, its current rapid progress has not yet led to serious problems. We hope that a sufficient part of the research will be directed towards AI safety so that this remains the case. Regulators are also addressing the potential harmful effects of artificial intelligence, and we hope they will successfully navigate between this goal and excessive hindering of development. The accessibility of large language models, such as GPTs, means that tasks requiring the understanding and generation of natural language are easier than ever before. Many researchers believe that the future of programming will shift from traditional languages, like Python, to large language models, where artificial intelligence will generate custom code and solutions. Although developing these models is a challenging task beyond the capabilities of most organizations, people are getting accustomed to using this phenomenal tool. We expect artificial intelligence to become an effective and reliable partner for humanity. Already this year, we are seeing conferences within the framework of the Information Society dedicated to large language models. As part of the Slovenian Conference on Artificial Intelligence, we are organizing a formal debate for high school students—experienced debaters who participate in international competitions—on how artificial intelligence will shape the future and why this might be the best future yet. Matjaž Gams Mitja Luštrek Rok Piltaver Slovenian Conference on Artificial Intelligence chairs 4 PROGRAMSKI ODBOR / PROGRAMME COMMITTEE Mitja Luštrek Matjaž Gams Rok Piltaver Cene Bavec Marko Bohanec Marko Bonač Ivan Bratko Bojan Cestnik Aleš Dobnikar Erik Dovgan Bogdan Filipič Borka Jerman Blažič Marjan Krisper Marjan Mernik Biljana Mileva Boshkoska Vladislav Rajkovič Niko Schlamberger Tomaž Seljak Peter Stanovnik Damjan Strnad Miha Štajdohar Vasja Vehovar 5 6 PandaChat-RAG: Towards the Benchmark for Slovenian RAG Applications Taja Kuzman Urban Rupnik Tanja Pavleska Primož Cigoj {taja,tanja}@pc7.io {urban,primoz}@pc7.io PC7, d.o.o. PC7, d.o.o. Ljubljana, Slovenia Ljubljana, Slovenia Jožef Stefan Institute Ljubljana, Slovenia Abstract sources, which facilitates the evaluation of the system’s accu- Retrieval-augmented generation (RAG) is a recent method for racy [2]. These advantages have spurred quick adoption of RAG enriching the large language models’ text generation abilities systems across various applications. For instance, PandaChat1 with external knowledge through document retrieval. Due to leverages RAG to provide explainable responses with high accu- its high usefulness for various applications, it already powers racy in Slovenian and other languages, integrated in customer multiple products. However, despite the widespread adoption, service bots and platforms that allow LLM-based retrieval of there is a notable lack of evaluation benchmarks for RAG systems, information from texts. particularly for less-resourced languages. This paper introduces Although RAG benchmarking is a relatively recent endeavor, the PandaChat-RAG – the first Slovenian RAG benchmark estab- some initial frameworks have already emerged [3, 5, 7]. However, lished on a newly developed test dataset. The test dataset is based these benchmarks are only limited to English and Chinese, leav- on the semi-automatic extraction of authentic questions and an- ing a gap in the evaluation of RAG systems for other languages. swers from a genre-annotated web corpus. The methodology for To address this gap, we make the following contributions: the test dataset construction can be efficiently applied to any of • We present the first benchmark for RAG systems for the the comparable corpora in numerous European languages. The Slovenian language. The benchmark is based on the newly test dataset is used to assess the RAG system’s performance in re- developed PandaChat-RAG-sl test dataset2, which com- trieving relevant sources essential for providing accurate answers prises authentic questions, answers and source texts. to the given questions. The evaluation involves comparing the • We introduce a methodology for an efficient semi-automated performance of eight open- and closed-source embedding models, development of RAG test datasets that is easily replica- and investigating how the retrieval performance is influenced ble for the languages included in the MaCoCu [1] and by factors such as the document chunk size and the number of CLASSLA-web corpora collections [10], which include retrieved sources. These findings contribute to establishing the all South Slavic languages, Albanian, Catalan, Greek, Ice- guidelines for optimal RAG system configurations not only for landic, Maltese, Ukrainian and Turkish. Slovenian, but also for other languages. • As the first step of RAG evaluation, we evaluate the re- triever’s performance in terms of its ability to provide Keywords relevant sources crucial to retrieve accurate answers to retrieval-augmented generation, RAG, embedding models, large the posed questions. The evaluation encompasses compar- language models, LLMs, benchmark, Slovenian ison of performance of several open- and closed-source embedding models. Furthermore, we provide insights on the impact of the document chunk size and the number 1 Introduction of retrieved sources, to identify optimal configurations The advent of large language models (LLMs) has introduced sig- of the indexing and retrieval components for robust and nificant advancements in the field of natural language processing accurate retrieval. (NLP). Although LLMs have shown impressive capabilities in gen- The paper is organized as follows: in Section 2, we provide an erating coherent text, they are prone to hallucinations [7, 16], i.e., introduction to the previous research concerning the evaluation providing false information. Furthermore, they are dependent on of RAG systems; Section 3 introduces the PandaChat-RAG-sl static and potentially outdated corpora [9]. Retrieval-augmented dataset (Section 3.1) and the RAG system architecture (Section generation (RAG) is a method devised to address these challenges 3.2), which is evaluated in Section 4. Finally, in Section 5, we by augmenting LLMs with external information retrieved from a conclude the paper with a discussion of the main findings and provided document collection. Connecting LLMs with a relevant suggestions for future work. database improves the factual accuracy and temporal relevance of the generated responses. Moreover, RAG contributes to the explainability of the generated answers by providing verifiable 2 Related Work Despite the recent introduction of the RAG architecture, several Permission to make digital or hard copies of all or part of this work for personal benchmarking initiatives have already emerged [3, 5, 7, 15]. How-or classroom use is granted without fee provided that copies are not made or ever, since the RAG systems can be applied to various end tasks, distributed for profit or commercial advantage and that copies bear this notice and the benchmarks focus on different aspects of these systems. Inter the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia 1https://pandachat.ai/ © 2024 Copyright held by the owner/author(s). 2The PandaChat-RAG benchmark and its test dataset are openly available at https: https://doi.org/10.70314/is.2024.scai.538 //github.com/TajaKuzman/pandachat-rag-benchmark. 7 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Kuzman et al. alia, current benchmarks assess their performance in text citation Table 1: Statistics for the PandaChat-RAG-sl dataset. [7], text continuation, question-answering with support of external knowledge, hallucination modification, and multi-document Number summarization [12]. The closest to our work is the evaluation of the RAG systems Instances 206 on the task of Attributable Question Answering [2]. This task Unique texts 160 involves providing a question as input to the system, which Words (questions) 1,184 then generates both an answer and an attribution, indicating the Words (texts w/o questions) 83,467 source text on which the answer is based. The advantage of this Total words (questions + texts) 84,651 task over the closed-book question-answering task is that it also measures the system’s capability to provide the correct source. The majority of RAG benchmarks assess RAG systems in Eng- inspection of the extracted texts should there be a need to prepare lish [3, 5, 7, 15] or Chinese [5, 12]. Consequently, the general-a larger dataset. izability of their findings to other languages remains uncertain. Table 1 provides the statistical overview for the PandaChat-Furthermore, a limitation of many benchmarks is their reliance RAG-sl dataset. The dataset consists of 206 instances, that is, on synthetic data generated by LLMs [5, 12, 15]. To avoid poten-triplets of a question, an answer and a source text, derived from tial biases introduced by LLMs and to better represent the com- 160 texts. The total size of the dataset is 84,651 words, encom- plexity and diversity of real-world language use, a more reliable passing both the questions and the texts containing the answers. evaluation would be based on non-synthetic test datasets. De- spite focusing on a different task, recent research [6] has shown 3.2 RAG System that resource-efficient development of non-synthetic and non- The RAG pipeline encompasses three main components: index- machine translated question-answering datasets is feasible by ing, retrieval, and text generation. During the indexing phase, the leveraging the availability of general web corpora and genre user-provided text collection is transformed into a database of classifiers. numerical vectors (embeddings) to facilitate document retrieval by the retriever. This process involves segmenting the documents into fixed-length chunks, which are then converted into embed- 3 Methodology dings using large language models. The choice of the embedding 3.1 PandaChat-RAG-sl Dataset model and the chunk size are critical factors that can signifi- cantly impact the retrieval performance of the model. Selecting The PandaChat-RAG-sl dataset comprises questions, answers, an appropriate embedding model is essential to ensure that the and the corresponding source texts that encompass the answers. textual information is converted into a meaningful numerical It was created through a semi-automated process involving the representation for effective retrieval. Moreover, the chunk size, in extraction of texts from the Slovenian web corpus CLASSLA- terms of the number of tokens, plays a crucial role in determining web.sl 1.0 [11], followed by a manual extraction of high-quality the informativeness of the embeddings. Incorrect chunk sizes instances. Since the texts were automatically extracted from a may lead to numerical vectors that lack important information general text collection, the dataset encompasses a diverse range necessary for connecting the question to the corresponding text of topics that were not predefined or decided upon. chunk, thereby compromising retrieval accuracy [12]. The CLASSLA-web.sl 1.0 corpus is a collection of texts, col- When presented with a question, the retrieval component uses lected from the web in 2021 and 2022 [10]. It was chosen due the semantic search (also known as dense retrieval) to retrieve to its numerous advantages: 1) it has high-quality content, with the most relevant text chunks. The search is based on determin- the majority of texts meeting the criteria for publishable quality ing the smallest cosine distance between the chunk vectors and [17]; 2) it is one of the largest and most up-to-date collections the question vector. Lastly, during the text generation phase, of Slovenian texts, comprising approximately 4 million texts; 3) the retriever provides the large language model (LLM) with a the texts are enriched with genre labels, facilitating genre-based selection of top retrieved sources. The LLM is prompted to pro- text selection; and 4) it is developed in the same manner as 6 vide a human-like answer to the provided question based on the other CLASSLA-web corpora [10] and 7 additional MaCoCu web retrieved text sources. The selection of an appropriate number corpora in various European languages [1]. This enables easy of top retrieved sources is crucial in this phase: including more expansion of the benchmark to other languages, including all than just one retrieved source may enhance retrieval accuracy South Slavic languages and various European languages, such as and address situations where the first retrieved source fails to Albanian, Catalan, Greek, Icelandic, Ukrainian and Turkish. encompass all relevant information, especially in the case when The development of the PandaChat-RAG-sl dataset involves more texts cover the same subject matter. However, increasing the following steps: 1) the genre-based selection of texts from the the number of sources also leads to a longer prompt provided CLASSLA-web.sl corpus; 2) the extraction of texts that comprise to the LLM, potentially increasing the costs of using the RAG paragraphs ending with a question (80,215 texts); 3) the extraction system. of questions and answers (paragraphs, following the question); In this study, we assess the indexing and retrieval compo- 4) a manual review process to identify high-quality instances. In nents, focusing on the impact of different embedding models, the genre-based selection phase, we extract texts labeled with chunk sizes, and the number of retrieved sources on retrieval genres that are most likely to contain objective questions and performance. answers, that is, Information/Explanation, Instruction and Legal. In its present iteration, the dataset consists of 206 instances Embedding Models. The evaluation includes a range of mul- derived from the first 1,800 extracted texts. It is important to tilingual open-source and closed-source models. The selection note that this effort can easily be continued with further manual of open-source models is based on the Massive Text Embedding 8 PandaChat-RAG Benchmark Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Benchmark (MTEB) Leaderboard3 [13]. Specifically, we choose medium-sized multilingual models with up to 600 million parameters that have demonstrated strong performance on Polish and Russian – Slavic languages that are linguistically related to Slovenian. The models used in the evaluation are: • Closed-source embedding models provided by the OpenAI: an older model text-embedding-ada-002 (OpenAI-Ada) [8], and two recently published models: text-embedding-3- small (OpenAI-3-small), and text-embedding-3-large (OpenAI- 3-large) [14]. • Open-source embedding models, available on the Hugging Face repository: BGE-M3 model [4], base-sized mGTE model (mGTE-base) [19], and small (mE5-small), base (mE5-base) and large sizes (mE5-large) of the Multilin- gual E5 model [18]. Figure 1: The impact of the chunk size on the retrieval Chunk size. The impact of the chunk size on retrieval per- performance. formance is assessed by varying chunk sizes of 128, 256, 512, and 1024 tokens, with a default chunk overlap of 20 tokens. In 4.2 Number of Retrieved Sources these experiments, the performance is evaluated based on the first retrieved source. Figure 2 shows the performance of the RAG systems when in- creasing the number of retrieved sources. The results demon- Number of retrieved sources. Previous work indicates that in- strate that increasing the number of retrieved sources initially creasing the number of retrieved sources improves the retrieval improves the performance, however, after a certain threshold, accuracy [12]. In this study, we examine the retrieval accuracy the performance levels off. of embedding models, with a chunk size set to 128 tokens, when Increasing the number of retrieved sources results in larger the models retrieve 1 to 5 sources. In this scenario, if any of the inputs to the LLM in the text generation component, incurring multiple retrieved sources matches the correct source, the output higher costs. Using more than two retrieved sources does not is evaluated as being correct. significantly improve results in most systems. What is more, with The retrieval capabilities of the RAG system are evaluated the top two retrieved sources, certain embedding models, namely, on the task of Attributed Question-Answering. The evaluation BGE-M3 and mE5-large, already reach perfect accuracy. Thus, is based on accuracy, measured as the percentage of questions our findings indicate that using more than the top two retrieved correctly matched with the relevant source. sources is unnecessary. The experiments are performed using the LlamaIndex library4. The chunk size is defined using the SentenceSplitter method in the indexing phase. Number of retrieved sources (similarity top k), the embedding model and the prompt for the LLM model are specified as parameters of the chat engine. The closed-source embedding models are used via the OpenAI API, while the ex- periments with the open-source models are conducted on a GPU machine. 4 Experiments and Results In this section, we present the results of the experiments examin- ing the impact of the chunk size, the number of retrieved sources, and the selection of the embedding model on the retrieval per- formance of the RAG system. 4.1 Chunk Size Figure 1 shows the impact of the chunk size on the retrieval Figure 2: Impact of the number of retrieved sources on the performance of the RAG systems that are based on different em- retrieval performance. bedding models. The findings suggest that, with the exception of the OpenAI-Ada model, all systems demonstrate the best perfor- mance when the text chunk size is set to 128 tokens. Increasing 4.3 Embedding Models the chunk size hinders the retrieval performance, which is con- We provide the final comparison of the performance of systems sistent with previous research [12]. These results confirm that that use different embedding models. We use the parameters smaller chunk sizes enable the embedding models to capture finer that have shown to provide the best results in the previous ex- details that are essential for retrieving the most relevant text for periments: the chunk size of 128 tokens and top two retrieved the given question. sources. As shown in Table 2, the retrieval systems that use the open-source BGE-M3 and mE5-large embedding models achieve 3https://huggingface.co/spaces/mteb/leaderboard the perfect retrieval score. They are closely followed by the closed- 4https://www.llamaindex.ai/ source OpenAI-3-small and the mE5-base models which achieve 9 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Kuzman et al. Table 2: Performance comparison between the open-source Translation. European Association for Machine Translation, Ghent, Belgium, and closed-source embedding models. (June 2022), 303–304. https://aclanthology.org/2022.eamt-1.41. [2] Bernd Bohnet et al. 2022. Attributed question answering: Evaluation and modeling for attributed large language models. arXiv preprint arXiv:2212.08037. embedding model speed (s) retrieval accuracy [3] Shuyang Cao and Lu Wang. 2024. Verifiable Generation with Subsentence- Level Fine-Grained Citations. arXiv preprint arXiv:2406.06125. BGE-M3 0.58 100 [4] Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng mE5-large 0.58 100 Liu. 2024. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi- Granularity Text Embeddings Through Self-Knowledge Distillation. (2024). OpenAI-3-small 0.69 99.51 arXiv: 2402.03216 [cs.CL]. mE5-base 0.29 99.51 [5] Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. 2024. Benchmarking large language models in retrieval-augmented generation. In OpenAI-3-large 1.19 99.03 Proceedings of the AAAI Conference on Artificial Intelligence number 16. Vol. 38, 17754– mGTE-base 0.31 99.03 17762. OpenAI-Ada 0.63 98.54 [6] Anni Eskelinen, Amanda Myntti, Erik Henriksson, Sampo Pyysalo, and Veronika Laippala. 2024. Building Question-Answer Data Using Web Regis- mE5-small 0.15 98.54 ter Identification. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC- COLING 2024). Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessan- accuracy of 99.5%. While having slightly lower scores, all other re- dro Lenci, Sakriani Sakti, and Nianwen Xue, editors. ELRA and ICCL, Torino, Italia, (May 2024), 2595–2611. https://aclanthology.org/2024.lrec-main.234. trieval systems still achieve high performance, ranging between [7] Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. 2023. Enabling Large 98.5% and 99% in accuracy. Language Models to Generate Text with Citations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 6465–6488. Additionally, Table 2 provides the inference speed of the mod- [8] Ryan Greene, Ted Sanders, Lilian Weng, and Arvind Neelakantan. 2022. els measured in seconds per instance. If inference speed is a New and improved embedding model. https://openai.com/index/new-and-i priority, the mE5-base model emerges as the optimal selection, as mproved-embedding-model/. [Accessed 26-08-2024]. (2022). [9] Angeliki Lazaridou et al. 2021. Mind the gap: Assessing temporal generaliza- it yields high retrieval accuracy of 99.51% and is two times faster tion in neural language models. Advances in Neural Information Processing than the two best performing models. In cases where users are re- Systems, 34, 29348–29363. stricted to closed-source models due to the unavailability of GPU [10] Nikola Ljubešić and Taja Kuzman. 2024. CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre resources, the OpenAI-3-small model stands out as the most suit- Annotation. In Proceedings of the 2024 Joint International Conference on Com- able option. Its inference speed is comparable to the OpenAI-Ada putational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 3271–3282. model, while it achieves a superior retrieval accuracy. [11] Nikola Ljubešić, Peter Rupnik, and Taja Kuzman. 2024. Slovenian web corpus CLASSLA-web.sl 1.0. In Slovenian language resource repository CLARIN.SI. 5 Conclusion and Future Work http://hdl.handle.net/11356/1882. [12] Yuanjie Lyu et al. 2024. CRUD-RAG: A comprehensive Chinese benchmark In this paper, a novel test dataset was introduced to assess the for retrieval-augmented generation of large language models. arXiv preprint performance of the RAG system on Slovenian language. A gen- arXiv:2401.17043. [13] Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023. eral methodology for efficient creating of non-synthetic RAG MTEB: Massive Text Embedding Benchmark. In Proceedings of the 17th test datasets was established that can be extended to other lan- Conference of the European Chapter of the Association for Computational guages. We evaluated the retrieval accuracy of the RAG system, Linguistics, 2014–2037. [14] OpenAI. 2024. New embedding models and API updates. https://openai.co examining the impact of the embedding models, the document m/index/new-embedding-models-and-api-updates/. [Accessed 26-08-2024]. chunk size, and the number of retrieved sources. The assess- (2024). [15] Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. 2024. ment of embedding models encompassed eight open-source and ARES: An Automated Evaluation Framework for Retrieval-Augmented Gen- closed-source LLM models. It revealed that open-source models, eration Systems. In Proceedings of the 2024 Conference of the North American specifically, BGE-M3 and mE5-large, reached perfect retrieval Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 338–354. accuracy, demonstrating their suitability for RAG applications on [16] Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. Slovenian texts. Furthermore, the evaluation of the optimal chunk 2021. Retrieval Augmentation Reduces Hallucination in Conversation. In size and the number of retrieved sources showed that smaller Findings of the Association for Computational Linguistics: EMNLP 2021, 3784– 3803. chunk sizes yielded superior results. In contrast, increasing the [17] Rik van Noord, Taja Kuzman, Peter Rupnik, Nikola Ljubešić, Miquel Esplà- number of retrieved sources enhanced results up to a certain Gomis, Gema Ramírez-Sánchez, and Antonio Toral. 2024. Do Language Mod- els Care about Text Quality? Evaluating Web-Crawled Corpora across 11 threshold, beyond which the model performance plateaued. Cer- Languages. In Proceedings of the 2024 Joint International Conference on Com- tain models already achieved perfect accuracy when evaluated putational Linguistics, Language Resources and Evaluation (LREC-COLING based on the top two retrieved sources. 2024), 5221–5234. [18] Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, While the novel test dataset can be used to evaluate all the and Furu Wei. 2024. Multilingual E5 Text Embeddings: A Technical Report. components of the RAG system, in this paper, we focused on arXiv preprint arXiv:2402.05672. the evaluation of the indexing and retrieval components. In our [19] Xin Zhang et al. 2024. mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval. (2024). https://arxiv future work, we will extend the evaluations to the text generation .org/abs/2407.19669 arXiv: 2407.19669 [cs.CL]. component with regard to fluency, correctness, and usefulness of the generated answers. Furthermore, we plan to expand the benchmark to encompass a wider range of languages. The plans include extending the dataset and evaluation to South Slavic languages and other European languages that are covered by comparable MaCoCu [1] and CLASSLA-web [10] corpora. References [1] Marta Bañón et al. 2022. MaCoCu: Massive collection and curation of mono- lingual and bilingual data: Focus on under-resourced languages. In Proceed- ings of the 23rd Annual Conference of the European Association for Machine 10 Choosing Features for Stress Prediction with Machine Learning Katja Bengeri Junoš Lukan∗ Mitja Luštrek∗ University of Ljubljana Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Department of Intelligent Systems Department of Intelligent Systems kb96968@student.uni-lj.si Ljubljana, Slovenia Ljubljana, Slovenia junos.lukan@ijs.si mitja.lustrek@ijs.si Abstract and Slovenia (26 participants). They answered questionnaires Feature selection is a crucial step in building effective machine named Ecological Momentary Assessments (EMAs) roughly ev- learning models, as it directly impacts model accuracy and in- ery 90 minutes, with smartphone sensor and usage data continu- terpretability. Driven by the aim of improving stress prediction ously collected by an Android application [7], while also wearing models, this article evaluates multiple approaches for identify-an Empatica E4 wristband recording physiological data. In 15 ing the most relevant features. The study explores filter-based days of their participation, each participant responded to more methods that assess feature importance through correlation anal- than 96 EMA sessions, on average, which resulted in around 2200 ysis, alongside wrapper methods that iteratively optimize feature labels. subsets. Additionally, techniques such as Boruta are analysed for their effectiveness in identifying all important features, while 3 Target and feature extraction strategies for handling highly correlated variables are also con- To fully leverage the potential of the data, we computed a com- sidered. By conducting a comprehensive analysis of these ap- prehensive set of features. While some sensors only reported proaches, we assess the role of feature selection in developing relatively rare events, such as phone calls, others had a high stress prediction models. sampling frequency, such blood volume pulse which sampled data at 32 Hz. On the other hand, labels were only available every Keywords 90 min. Therefore, we preprocessed the data in several steps. Feature selection, Correlation matrix, Balanced accuracy score 3.1 Target variable 1 Introduction While participants responded to various questionnaires, for this Machine learning models are increasingly being applied to predict study, we selected their responses to Stress Appraisal Measure- stress, which is critical in various domains such as healthcare, ment [9] as the target variable. It was used to report stress levels workplace management, and wearable technology. However, one on a scale from 0 to 4, so using it as is the prediction task can be of the major challenges in developing reliable predictive models approached as a regression problem. is identifying the most relevant features from extensive datasets, However, many stress detection studies tend towards a dis- comprising physiological and behavioural information. crete approach, treating stress predominantly as a classification Feature selection plays a key role in addressing this challenge. task, often only working with a binary target variable. To con- By selecting only the most informative features, we can reduce vert this into a classification problem, we discretized the target noise, prevent overfitting, and enhance model accuracy. As we variable into two distinct categories: “no stress”, which included showed in previous work [8], even simple feature selection tech-all responses with a value of 0, while all others were coded as niques can increase the 𝐹1 score of predictive models. This paper “stress”. With that, we ensured a balanced distribution of the builds upon this finding and explores several feature selection target variable values. techniques, ranging from simple correlation-based methods to more sophisticated wrapper approaches. 3.2 Features The aim of this work is to assess how feature selection can en- hance stress prediction models. By comparing different methods, 3.2.1 Data preprocessing. In our work, features were calculated we aim to identify the optimal strategies for feature selection in on 30-minute intervals preceding each questionnaire session. stress prediction which would lead to more reliable and more From the wide variety of smartphone data and physiological easily interpretable machine learning models. measures, a total of 352 features were extracted and grouped into 22 categories, listed in Table 1. Using physiological data from 2 Data collection Empatica wristband, we first calculated specialized physiological features on smaller windows (from 4 s to 120 s, depending on the The data used in this work comes from the STRAW project [1], sensor; see [4] for more details), which were then aggregated results of which have been previously presented at Information over 30 min windows by calculating simple statistical features: Society [6, 8]. The dataset includes the data of 56 participants, mean, median, standard deviation, minimum, and maximum. All recruited from academic institutions in Belgium (29 participants) of the categorical features were converted into a set of binary ∗Also with Jožef Stefan International Postgraduate School. features using the one hot encoding technique and the missing values were replaced with the mode. Permission to make digital or hard copies of all or part of this work for personal First, some preliminary data cleaning was performed by ex-or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and cluding one of the feature in pairs exhibiting a correlation coeffi-the full citation on the first page. Copyrights for third-party components of this cient of |𝑟 | ≥ 0.95. Despite this, some of the remaining features work must be honored. For all other uses, contact the owner/author(s). still exhibited quite strong correlations as shown in Fig. 1. An Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia © 2024 Copyright held by the owner/author(s). interesting observation used in the later stages of feature selec- https://doi.org/10.70314/is.2024.scai.991 tion was that high correlation, |𝑟 | ≥ 0.8, was mostly observed 11 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Katja Bengeri, Junoš Lukan, and Mitja Luštrek Table 1: Feature categories with the number of features on each class. When adjusted for random chance, it is calculated included in each category in parentheses as Balanced accuracy 𝑇 𝑃 𝑇 𝑁 = + − 1. Empatica electrodermal activity (99) 12. Phone screen events (7) 1, 𝑇 𝑃 + 𝐹 𝑁 𝑇 𝑁 + 𝐹 𝑃 2. Empatica inter-beat interval (50) 13. Phone light (6) in the binary case, where 3. Empatica temperature (33) 14. Phone battery (5) 𝑇 𝑃 is the number of true positives, 4. Empatica accelerometer (23) 15. Phone speech (4) 𝑇 𝑁 is the number of true negatives, 𝐹 𝑁 is the number of false 5. Empatica data yield (1) 16. Phone interactions (2) negatives and 𝐹𝑃 is the number of false positives. This definition 6. Phone applications foreground (47) 17. Phone messages (2) 7. Phone location (18) 18. Phone data yield (1) is equivalent to Youden’s J [11], which assigns a 0 to a random 8. Phone Bluetooth connections (18) 19. Baseline psychological features (7) classifier (indeed, a dummy classifier achieved a score of 0.0208 9. Phone calls (10) 20. Language (2) in our case), while a perfect classifier would achieve a score of 1. 10. Phone activity recognition (7) 21. Gender (2) 11. Phone Wi-Fi connections (7) 22. Age (1) To evaluate the stress detection models described in the fol- lowing sections, we considered several ways of data partitioning. Since the variations in the results depending on the data split between features of the same category. As an example, corre- were significant, in order to achieve more consistent accuracy, lations between features related to phone application use are we employed shuffled 5-fold cross-validation. shown in Fig. 2. We also considered a leave-one-subject-out cross-validation technique. However, this method yielded poor results: using all Empatica accelerometer available features, balanced accuracy was 0.05, while with the 5-fold cross validation it was 0.45. This suggested that the partici- 1.00 pants were quite different from each other, making it challenging Empatica electrodermal activity to generalize predictions for a subject the model had not encoun- 0.75 tered. 0.50 Empatica inter beat interval 0.25 4.2 Baseline model 0.00 Our initial approach for building a prediction model was to use Empatica temperature Limesurvey all available features. This served as a baseline, which we aimed 0.25 Phone activity recognition to improve through feature selection. Phone applications foreground 0.50 We evaluated various predictive models, as shown in Table 2, Phone battery 0.75 Phone bluetooth all as implemented in scikit-learn [10]. Among these, the Phone calls Phone light 1.00 Random Forest model yielded the best performance. Phone location Phone screen In this work, we aimed to find the best model for predicting Phone speech Phone wifi stress and improve it using the optimal feature subset. Conse- e ound een quently, we used the Random Forest as the benchmark for com- ometer egr mal activity Limesurveyecognition Phone callsPhone light Phone wifi paring feature selection algorithms. Phone battery Phone speech oder Phone bluetooth Phone location Phone scr Empatica acceler Empatica temperatur Table 2: Performance of different models for the classifica- Phone activity r Empatica inter beat interval tion problem. The mean over five folds, its standard error, Empatica electr Phone applications for and the maximum are shown. Figure 1: Correlation matrix of the initial feature set. Only feature categories with more than two features are labelled. Model Mean Max SEM Logistic Regression 0.077 0.151 0.025 Support Vector Machines 0.090 0.158 0.022 Gaussian Naive Bayes 1.00 0.061 0.122 0.020 Stochastic Gradient Descent 0.027 0.054 0.007 0.75 Random Forest 0.475 0.558 0.026 0.50 XGBoost 0.441 0.473 0.013 0.25 0.00 In Table 2, SEM represents the Standard Error of the Mean. 0.25 It measures how far the sample mean of the data is likely to be 0.50 from the true population mean. 0.75 4.3 Correlation-Based Feature Reduction Figure 2: Correlation matrix of the feature set in the Phone We began the feature selection process by eliminating highly applications foreground category. correlated features. For each highly correlated pair, we removed the feature with the lower rank when sorted by mutual informa- tion, setting the correlation threshold at |𝑟 | ≥ 0.8 to maintain 4 Prediction models a manageable number of features. This reduction left us with approximately 180 features out of the original 352 for model 4.1 Model performance and validation training and evaluation. To evaluate the performance of the models we used balanced While selecting the optimal set of features for stress prediction, accuracy score which is defined as the average of recall obtained we aimed to retain all 22 different categories from Table 1, as 12 Choosing Features for Stress Prediction with Machine Learning Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Empatica accelerometer of features selected varied across folds, ranging from 50 to 93 features. Empatica electrodermal activity 1.0 0.8 4.6 Sequential Forward Selection Empatica inter beat interval 0.6 Another feature selection method we employed was Sequential 0.4 Feature Selector (SFS), a wrapper-based technique [2]. SFS and Empatica temperature 0.2 RFECV differ in their approaches. SFS constructs models for each Limesurvey Phone activity recognition 0.0 feature subset at every step, while RFECV builds a single model Phone applications foreground and evaluates feature importance scores. Consequently, SFS is 0.2 Phone battery more computationally expensive, as it must evaluate numerous 0.4 Phone bluetooth feature combinations before identifying the optimal subset. Phone calls 0.6 Phone light In the absence of specified parameters for number of fea- Phone location tures to select (n_features_to_select) and tolerance (tol), the Phone messages Phone screen Phone speech method defaults to selecting half of the available features. The Phone wifi e default configuration was used in our analysis, leading the SFS ound een ometer egr to select the top 50 features. mal activity Limesurvey ecognition Phone calls Phone light Phone wifi Phone scr Phone speech oder Phone battery Phone bluetooth Phone location Phone messages 4.7 Boruta method Empatica acceler Empatica temperatur Empatica inter beat interval Phone activity r The final feature selection technique we employed was the Boruta Empatica electr Phone applications for method [5]. With the assistance of “shadow features”, which are Figure 3: Correlation matrix of the feature set after original features that have been randomly shuffled, the method correlation-based feature reduction. Only feature cate- identifies a subset of features that are relevant to the classification gories with more than two features are labelled. task at hand. In our case, shadow features were introduced into the feature subset obtained after the preprocessing step. The updated dataset was trained using the Random Forest model for 100 iterations. In each iteration, all original features each could provide unique information. Comparing Figs. 1 and 3, ranked higher in importance than the highest-ranked shadow we were left with about half the number of features which were feature were marked as relevant. still moderately correlated. Ultimately, a binomial distribution is used to evaluate which features have enough confidence to be kept in the final selection. 4.4 Feature Selection using the mutual The number of features selected varied across folds, ranging from information scoring function 47 to 55 features. Before applying more complex feature selection algorithms, it was necessary to reduce computational complexity by further 5 Results reducing our set of 180 features obtained through correlation- In Table 3, the final scores for a Random Forest model built on based reduction. Therefore, we used the SelectKBest method various feature subsets, as derived from the methods described and the mutual information scoring function to retain the top 100 above, are presented. The data was split using shuffled 5-fold features. This resulted in features derived from 19 to 20 categories, cross-validation, to ensure that the results were not overly de- as categories language, gender, and, in some cases, Empatica pendent on a data split. accelerometer were not deemed important for predicting stress. Going forward, we will refer to the elimination of features Table 3: Adjusted balanced accuracy scores of a Random within highly correlated pairs and the selection of the top 100 Forest model, trained on the different feature sets. Last features using the mutual information scoring function as the column represents a number of features selected. preprocessing step. Feature set Mean Max SEM N 4.5 Recursive Feature Elimination with All available features Cross-Validation (RFECV) 0.464 0.498 0.011 352 Correlation-based reduction 0.483 0.507 0.007 ∼180 One of the previously mentioned complex feature selection meth- Correlation-based, 100 best 0.486 0.498 0.006 100 ods we employed was Recursive Feature Elimination with Cross- Preprocessing, RFECV 0.471 0.511 0.012 50 to 93 Validation (RFECV) [3]. The feature set we got after the prepro-Preprocessing, SFS 0.483 0.520 0.017 50 cessing step was passed to the RFECV algorithm for thorough Preprocessing, Boruta 0.481 0.545 0.020 47 to 55 evaluation. RFECV only 0.465 0.494 0.020 16 to 89 RFECV operates by initially fitting a model to the dataset SFS only 0.426 0.468 0.017 30 and evaluating its performance through cross-validation. After Boruta only 0.456 0.509 0.015 ∼75 the initial fit, RFECV ranks feature importance and iteratively removes the least important features based on the models feature From Table 3, we can see that the most significant improve- importances attributes, which in the case of Random Forest are ment in accuracy came after removing the highly correlated impurity-based feature importances. This process continues until features, with the average adjusted balanced accuracy score ris- there is no significant improvement in the model’s performance. ing from 0.46 to 0.48. Best mean accuracy was achieved after the To ensure a reasonable duration for the feature selection process, preprocessing step, with only a minor improvement from 0.483 we set the cross-validation in RFECV to 3 folds. The number to 0.486. 13 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Katja Bengeri, Junoš Lukan, and Mitja Luštrek After eliminating highly correlated features, wrapper methods wrapper methods alone were unable to effectively perform correlation- did not significantly improve the accuracy on average (rows 3 based feature reduction. We can therefore conclude that simply to 6 in Table 3). The Boruta method performed best among the relying on feature selection methods, however sophisticated, is three, with the highest overall maximum accuracy in a single fold. not as effective as also considering relationships between fea- These results led us to investigate whether the wrapper feature tures. selection method alone could manage correlated features without It should be noted that the improvements in balanced accuracy their prior removal and to evaluate the impact of the correlation are low in all cases. This indicates that results cannot be easily threshold. generalized and correlation-based feature selection should not We employed the RFECV, SFS, and Boruta method on the be seen as sufficient in general. Instead, we can speculate that entire feature set of 352 features without applying the prepro- no single feature selection method is the best one and that sev- cessing step. For SFS, only 30 features were selected due to its eral should be considered. We should also note that the Pearson computational complexity. As shown in the last three rows of correlation coefficient that we used in this work only considers Table 3, none of the methods alone were able to improve the linear relationships between features. Other methods can select result achieved with correlation removal. Highly correlated fea- features even if they are related in a different way. tures were left in the final feature set: for example, we identified three pairs of features with a correlation coefficient exceeding References |𝑟 | ≥ 0.8 using SFS alone. Poor results could be attributed either [1] Larissa Bolliger, Junoš Lukan, Mitja Luštrek, Dirk De Bacquer, and Els to the importance of the correlation removal step or to the feature Clays. 2020. Protocol of the STRess at Work (STRAW) project: how to disentangle day-to-day occupational stress among academics based on EMA, subset being too small in the case of the SFS. physiological data, and smartphone sensor and usage data. International Journal of Environmental Research and Public Health, 17, 23, (Nov. 2020), 5.1 Selecting the best correlation threshold 8835. doi: 10.3390/ijerph17238835. [2] Francesc J. Ferri, Pavel Pudil, Mohamad Hatef, and Josef V. Kittler. 1994. As previously mentioned, the biggest improvement in score came Comparative study of techniques for large-scale feature selection. Machine from removing the feature inside the highly correlated pair. There- Intelligence and Pattern Recognition, 16, 403–413. doi: 10.1016/b978-0-444-81 892-8.50040-7. fore, we have also experimented with different correlation cut-off [3] Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. 2002. values to determine the best threshold. Gene selection for cancer classification using Support Vector Machines. Machine Learning, 46, 1/3, 389–422. doi: 10.1023/a:1012487302797. The highest score was achieved with a correlation threshold [4] Vito Janko, Matjaž Boštic, Junoš Lukan, and Gašper Slapničar. 2021. Library of |𝑟 | ≥ 0.75 (Table 4). Considering the impact of cross-validation for feature calculation in the context-recognition domain. In Proceedings of splits and the relatively minor variance in scores, it appears that the 24th International Multiconference Information Society – IS 2021. Slovenian Conference on Artificial Intelligence (Ljubljana, Slovenia, Oct. 4–8, 2021). our initial threshold of |𝑟 | ≥ 0.8 was also quite effective. Vol. A, 23–26. [5] Miron B. Kursa and Witold R. Rudnicki. 2010. Feature selection with the Boruta package. Journal of Statistical Software, 36, 11, 1–13. doi: 10.18637/js Table 4: Adjusted balanced accuracy scores of a Random s.v036.i11. Forest model trained on a feature subset excluding features [6] Junoš Lukan, Larissa Bolliger, Els Clays, Primož Šiško, and Mitja Luštrek. above the correlation threshold. The number of features 2022. Assessing sources of variability of hierarchical data in a repeated- measures diary study of stress. In Proceedings of the 25th International Multi-left after correlation-based feature selection differed over conference Information Society – IS 2022. Pervasive Health and Smart Sensing validation folds and its range is shown in the final column. (Ljubljana, Slovenia, Oct. 10–14, 2022). Vol. A, 31–34. [7] Junoš Lukan, Marko Katrašnik, Larissa Bolliger, Els Clays, and Mitja Luštrek. 2020. STRAW application for collecting context data and ecological momen- Threshold Mean Max SEM N tary assessment. In Proceedings of the 23rd International Multiconference Information Society – IS 2020. Slovenian Conference on Artificial Intelli- gence (Ljubljana, Slovenia, Oct. 5–9, 2020). Vol. A, 63–67. 0.55 0.462 0.506 0.018 28 to 33 [8] Marcel Franse Martinšek, Junoš Lukan, Larissa Bolliger, Els Clays, Primož 0.60 0.467 0.493 0.009 39 to 41 Šiško, and Mitja Luštrek. 2023. Social interaction prediction from smart- 0 phone sensor data. In Proceedings of the 26th International Multiconference .65 0.474 0.498 0.008 47 to 50 Information Society – IS 2023. Slovenian Conference on Artificial Intelligence 0.70 0.460 0.501 0.017 61 to 65 (Ljubljana, Slovenia, Oct. 9–13, 2023). Vol. A, 11–14. 0.75 0.498 0.526 0.012 74 to 80 [9] Edward J. Peacock and Paul T. P. Wong. 1990. The stress appraisal measure 0 (SAM). A multidimensional approach to cognitive appraisal. Stress Medicine, .80 0.470 0.543 0.022 101 to 107 6, 3, (July 1990), 227–236. doi: 10.1002/smi.2460060308. [10] F. Pedregosa et al. 2011. Scikit-learn: machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830. [11] Charles Sanders Peirce. 1884. The numerical measure of the success of 6 Conclusions predictions. Science, ns-4, 93, 453–454. doi: 10.1126/science.ns-4.93.453.b. This paper examined different feature selection algorithms to find the most effective subset for stress prediction. The model using the feature subset after correlation removal achieved the highest adjusted balanced accuracy score of 0.483. Alternative feature selection approaches, including the wrap- per methods SFS and RFECV, as well as the Boruta method, ap- plied to the preprocessed feature subset, did not lead to further optimization of the feature subset in terms of model performance. Additionally, applying these methods to the entire set of features did not achieve accuracy levels as high as those obtained after the correlation-based reduction. In the case of SFS, this may be attributed to its selection of only 30 features. Therefore, our results underscore the critical role of the correlation- based reduction step. In contrast, when this step was omitted 14 Predictive Modeling of Football Results in the WWIN League of Bosnia and Herzegovina Ervin Vladić Dželila Mehanović Elma Avdić International Burch University International Burch University International Burch University Sarajevo, Bosnia and Herzegovina Sarajevo, Bosnia and Herzegovina Sarajevo, Bosnia and Herzegovina ervin.vladic@stu.ibu.edu.ba dzelila.mehanovic@ibu.edu.ba elma.avdic@ibu.edu.ba Abstract a place in the UEFA Conference League. Since the founding of the WWIN League of Bosnia and Herzegovina, team with the highest Predictive modeling in football has emerged as a valuable tool for number of titles was HŠK Zrinjski from Mostar who emerged enhancing decision-making in sports management. This study as the winner eight times, followed by Sarajevo which won four applies machine learning techniques to predict football match times, Zeljeznicar and Borac both won three times, Siroki Brijeg outcomes in the WWIN League of Bosnia and Herzegovina. The won two times and Leotar and Modrica both won once [12]. aim is to evaluate the effectiveness of various models, including Depending on which entity association they belong to, the teams Support Vector Machines (SVM), Logistic Regression, Random that occupy the last two places in the league at the end of the Forest, Gradient Boosting, and k-Nearest Neighbors (kNN), in season are relegated to the league below, with two teams from accurately predicting match results based on key features such the First League of the Federation of BiH and the First League of as shots on target, possession percentage, and home/away status. the RS being promoted in their stead. To elevate football in our By (1) gathering and analyzing match data from three seasons, (2) country to the highest level, we must support in-depth analyses comparing the performance of machine learning models, and (3) of matches and the factors influencing their outcomes. This will drawing conclusions on key performance factors, we demonstrate enable coaches to fine-tune strategies for future games, help that SVM achieves the highest accuracy at 83%, outperforming commentators provide more insightful commentary, and allow other models. These insights contribute to football management, fans to develop a deeper understanding and get more pleasure allowing for data-driven strategic planning and performance from the match. optimization. Future research will integrate additional factors The study aims to evaluate the performance of various ML such as player injuries and weather conditions to improve the models, including Support Vector Machines (SVM), Logistic Re- predictive models further. gression, Random Forest, Gradient Boosting, and k-Nearest Neigh- Keywords bors (kNN), in predicting match results. By examining key fea- tures such as shots on target, possession percentage, and home/away Football match prediction, machine learning, WWIN league, Sup- status, we conduct an analysis based on match data from three port Vector Machines, Random Forest seasons of the WWIN League, encompassing 400 matches and key performance metrics. 1 Introduction The remainder of the paper is structured as follows: Section Accurate predictions of match outcomes can inform a wide range II provides an overview of related work in football ML-based of decisions, from tactical adjustments to player acquisitions, and prediction. Section III describes the methodology, including the can improve engagement for fans and stakeholders. While pre- dataset and models used. Section IV presents the results and dictive modeling has been extensively applied to top-tier football analysis of models performance, with a discussion on the practical leagues like the English Premier League, there is limited research implications of the findings for football management. Finally, on regional leagues such as the WWIN League of Bosnia and Section V concludes the paper and outlines directions for future Herzegovina. The specificity of the country that is Bosnia and research. Herzegovina and the WWIN League, which has not been re- searched in the sphere of sports research, provides context for 2 Literature Review this step. The WWIN League of Bosnia and Herzegovina was established The prediction of the results of football matches has been recently in the year 2000 and the same year the WWIN was formed by studied extensively because of its relation to betting and decision- the merging of three leagues, it became a league covering the making in sports. Studies examining the employment of ML entire territory of Bosnia and Herzegovina. Originally, the league methods are primarily focused on large European leagues, where consisted of 16 clubs, and, from the 2016-2017 season, the league extensive and highly detailed data is available. The application of contains 12 clubs which makes the level of the league higher these techniques to regional football leagues, such as the WWIN [25]. The winner is the team that has the most points by the League of B&H, remains underexplored. completion of thirty-three rounds; this position will grant a team Rodrigues and Pinto [15] used a variety of ML methods, in- a place in the UEFA Champions League qualifications [10]; the cluding Naive Bayes, K-nearest neighbors, Random Forest, and remaining two teams and the winner of the cup will compete for SVM, to predict the match outcomes based on previous match data and player attributes. Their studies revealed excellent re- Permission to make digital or hard copies of all or part of this work for personal sults in terms of soccer betting profit margins, with the Random or classroom use is granted without fee provided that copies are not made or Forest approach obtaining a success rate of 65.26% and a profit distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this margin of 26.74%. Rahman [13] dedicated his work to employ-work must be honored. For all other uses, contact the owner /author(s). ing deep learning frameworks especially Deep Neural Networks Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia (DNNs) for football match outcome prediction, particularly dur- © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2024.scai.1642 ing FIFA World Cup 2018. The study classified match outcomes 15 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Vladić et al. with 63.3% accuracy with DNN architectures with LSTM or GRU cells. Baboota and Kaur [3] used machine learning approaches to predict English Premier League match results. The models compared included Support Vector Machines, Random Forest; and Gradient Boosting. From their study , they ascertained that Gradient Boosting outperformed other models in accuracy and overall predictiveness. Authors in [16] used machine learning techniques, notably SVM and Random Forest Classifier, to predict English Premier League (EPL) football match results. They got 54.3% accuracy with SVM and 49.8% with Random Forest after evaluating data from 2013/2014 to 2018/2019 seasons. Another study [8] employed a few machine learning algorithms to predict matches of the English Premier League season 2017-2018. Models including Linear Regression, SVM, Logistic Regression, Random Forest, and Multinomial Naïve Bayes classier show that the K-nearest neighbors give the best accurate predictions. In summary, while existing studies have demonstrated the effectiveness of machine learning in football matches prediction, there remains a gap in the application of these techniques to regional leagues like the WWIN League, due to the availability Figure 1: Workflow diagram and quality of data. The characteristics of these leagues, such as smaller datasets and potentially different factors influencing Table 1: Class Distribution match outcomes, require a tailored approach. In lesser-known football leagues models might perform differently due to varia- Match Type Count tions in competitive structures and gameplay strategies, as well. The study of Munđar and Šimić [11] in which they developed a Home Win 301 simulation model using the Poisson distribution to predict the Away Win 142 seasonal rankings of teams in the Croatian First Football League, Draw 151 highlighted the predictive power of statistical models and demon- strated the significance of home advantage in determining match The table sums up a type of match result in terms of its fre- outcomes, which is also an important factor in the WWIN League. quency in the dataset. In the recorded 594 matches, 301 ended in home team victories, 142 in away team victories and 151 were tied. The following pie chart describes the percentage distribution of the match outcome. 3 Materials and Methods Curiously, home wins are in the majority, comprising 50.7% of all In this section, we describe the study conducted, detailing the data collection and feature selection processes, the machine learn- ing models applied, the evaluation metrics used to assess model performance, and the approach taken to analyze key features influencing match outcomes. As a result of providing numerous procedures that are declared in this section, we represent the graphical illustration of our methodology. The steps involved in predicting the outcomes of the WWIN League of Bosnia and Herzegovina, including data collection, preprocessing, model development, and algorithm evaluation. 3.1 Dataset The authors created the dataset for this study by consolidating in- Figure 2: Class distribution of the dataset formation from rezultati.com [14], 1XBET [1], and Sofascore [24]. The unique dataset represents the seasons 2021/2022, 2022/2023, matches. However, away victories contribute to approximately and 2023/2024 of WWIN League of Bosnia and Herzegovina. The 23.9% of all recorded match results, while 25.4% contribute to platforms provide a wide range of football match data so it is draw results. The Fig.2 depicts the frequency of each of the match easy to find important information for examination. The dataset outcomes. includes key match facts as date, day of the week, time, home team, away team, final as well as half-time goals scored in the 3.2 Machine Learning Prediction game, referee details, shots taken at goal as well as corner kicks resulting from these attempts on target, bookings made during In football, the concept of machine learning prediction entails de- play by both teams and other relevant performance indicators. veloping models to forecast match outcomes based on the teams’ 16 Predictive Modeling of WWIN League Football Results Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia and players’ histories and other attributes [5]. These models Gradient Boosting combines multiple weak learners (typically employ such methods as regression analysis, classification, and decision trees) to create a stronger predictive model, improving neural nets to determine the results given the data fed as the accuracy by focusing on correcting errors from previous models input. [6]. k-Nearest Neighbors (kNN) is an instance-based learning method 3.2.1 Models initialisation, preprocessing, training and testing. that classifies data by identifying the majority label among the While implementing Logistic Regression, we have set the max_iter k closest points. Though simple, it can be computationally ex- =1000 and random_state = 42. Again, with the same classifier, pensive as it requires storing all training data and performing the kernel argument was assigned a linear value while the ran- real-time comparisons [7]. dom_state was set to 42 to keep the results predictable. Gaussian Naive Bayes was employed with no modification of its settings 3.2.3 Evaluation Metrics. Last but not the least, the trained mod- because of the model’s simplistic nature. For Random Forest, we els are assessed by metrics such as accuracy of the models [19], used the default parameters since the algorithm is capable of precision of the models [21], the recall of the models [22], and changing the setting on its own based on the complexity of given F1-score value of the models [20]. This evaluation enables one data. We initiated the Gradient Boosting with the default param-to analyze how well each of the models is likely to perform in eters so that the gradients could easily learn and an ensemble terms of match outcome prediction. could be formed. Last but not least, we left all the parameters of k-Nearest Neighbors (kNN) for default value because the algo- 4 Results and Discussion rithm can find the optimal number of neighbors appropriate for In this study, we employed six different classifiers to predict the distribution of the data. football match outcomes and conducted a comparative analysis Following that, we proceed with the process of dividing this of their performance. The effectiveness of each classifier was gathered data into two sets: the training and the testing ones. We evaluated based on its accuracy, providing a clear comparison of split the data into training, where 70% of the data was allocated their predictive capabilities across the dataset. and the testing data where 30% was allocated. Subsequently, the phase of model preprocessing is created for 4.1 Model Performance which it is essential to filter data effectively to ensure proper Among the classifiers employed, SVM predicted the most accurate model training. In the case of feature transformation, we used results at 83% This model performed almost well, with balanced scikit-learn’s ColumnTransformer [17] to empower the numeric precision and recall across all three classes (A, D, and H), show-features normalization via the StandardScaler [23] while trans-ing that it can predict match outcomes. In comparison, Random forming the categorical variables into the binary format by the Forest achieved a lower accuracy of 65%, with especially evident use of the OneHotEncoder [18]. This technique pays a lot of deficits in precision and recall for class ’D’. Logistic Regression attention to ensuring that feature types are standard as well performed worse than Support Vector Machines, with accuracy of as harmonious. This method ensures consistency by creating a 77%. Despite its simplicity and computational efficiency, Gaussian pipeline where preprocessing processes and the model are joined Naive Bayes had the lowest accuracy of any classifier tested, at in the same line of work. This means that there is always uni- 39%. This model struggled to predict class ’D’, with low accuracy formity in the training and the testing of the model, hence a and recall scores. Random Forest, an established ensemble learn- manageable variability. Assuming the pipeline has been defined ing approach, performed not so good, with an accuracy of 54%. and is ready to proceed, we proceed to the next step of model This model has generally balanced accuracy and recall across all training. classes, making it an acceptable alternative for predicting match 3.2.2 Models in Detail. In this study, many supervised learn- results. Gradient Boosting, another ensemble learning technique, ing classifier techniques that have proven valuable in the sports has a little higher accuracy than Random Forest at 64%. While area for predictive purposes are employed. Logistic Regression Gradient Boosting is recognized for its ability to manage compli- is a statistical technique that predicts the probability of a binary cated connections, it produced poorer recall ratings, especially classification, using a sigmoid function to map outputs to a [0,1] for class ’D’. Lastly, k-Nearest Neighbors (kNN) resulted in 51% probability space. Coefficients indicate the strength and direction accuracy, showing that the classifier was relatively poor, they of relationships between variables, with positive values increas- had relatively fair precision and recall with all the classes. ing the likelihood of an event and negative values decreasing it For making the match predictions, we employed the following [9]. classification models – Logistic Regression, Support Vector Ma- Random Forest extends the bagging method by generating chine, Gaussian Naive Bayes, Random Forest, Gradient Boost and multiple decision trees using randomly selected data samples. k-Nearest Neighbors. We obtained the results varying from 39% Each tree operates independently, and the final prediction is the to 83%, in which Support Vector Machines were the most effective. average result across all trees, reducing overfitting and improving Our findings are partially consistent with prior research because accuracy in classification tasks [4]. classifiers like Support Vector Machines, Logistic Regression, SVM aims to find the best hyperplane to separate data points and Random Forest have manifested robustness in predicting by class, maximizing the margin between them. It handles non- the match outcome across datasets. Nevertheless, the results are linear boundaries by transforming the input data into a higher- not in conformity with some emerging works, particularly con- dimensional space [2]. cerning the efficacy of Gaussian Naive Bayes, which performed Naïve Bayes, based on Bayes’ theorem, assumes feature inde- poorly in our study in contrast to other research results. It should pendence, making it fast and easy to implement, especially in be noted that results may vary significantly between different applications like spam detection and text classification. Despite studies depending on the quality, the quantity, and the nature of the simplicity of this assumption, it performs well in practice the data that had been used for creating the models of Gaussian [26]. Naive Bayes. 17 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Vladić et al. Model Accuracy Precision Recall F1 score establish a strong foundation for data-driven decision-making in football management. Future work should incorporate additional Logistic Regression 77% 75% 74% 74% factors such as player injuries and weather conditions to enhance SVM 83% 86% 83% 84% predictive accuracy. Gaussian NB 39% 47% 42% 36% Random Forest 54% 43% 46% 43% References Gradient Boosting 64% 64% 59% 60% [1] 1XBET. 2007–2024. 1xbet. Retrieved May 26, 2024, from https://1xlite- 57954 kNN 51% 49% 49% 49% 2.top/en?tag=s_245231m_5435c_. (2007–2024). Table 2: Model Performances [2] Mariette Awad and Rahul Khanna. 2015. Support vector machines for classi- fication. In Efficient Learning Machines. Rahul Khanna, editor. Apress, 39–66. doi: 10.1007/978- 1- 4302- 5990- 9_3. [3] Rahul Baboota and Harleen Kaur. 2019. Predictive analysis and modeling football results using a machine learning approach for the english premier league. International Journal of Forecasting, 35, 2, 741–755. doi: 10.1016/j.ijf The Table 2 shows how accurate various machine learning orecast.2018.01.003. models are in predicting WWIN League of Bosnia and Herzegov- [4] Leo Breiman. 2001. Random forests. Machine Learning, 45, 5–32. doi: 10.102 3/A:1010933404324. ina match outcomes. [5] Rory P. Bunker and Fadi Thabtah. 2019. A machine learning framework for sport result prediction. Applied Computing and Informatics, 15, 1, 27–33. 4.1.1 Key factors influencing match outcomes. While this study [6] Stefanos Fafalios, Pavlos Charonyktakis, and Ioannis Tsamardinos. 2020. does not perform formal feature analysis, the observed perfor- Gradient Boosting Trees. Gnosis Data Analysis PC, (Apr. 2020). [7] Gongde Guo, Hui Wang, David Bell, Yaxin Bi, and Kieran Greer. 2003. Knn mance trends allow us to draw conclusions about the key factors model-based approach in classification. In On The Move to Meaningful Inter- influencing match outcomes. In line with prior research, home net Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International advantage emerged as a critical factor, with teams winning at Conferences, CoopIS, DOA, and ODBASE 2003. Springer Berlin Heidelberg, Catania, Sicily, Italy, (Nov. 2003), 986–996. home in over 50% of cases (Table 1) which reinforces the psy- [8] Ishan Jawade, Rushikesh Jadhav, Mark Joseph Vaz, and Vaishnavi Yamgekar. chological and tactical advantages that come with playing on 2021. Predicting football match results using machine learning. International Research Journal of Engineering and Technology (IRJET), 8, 7, 177. https://w familiar ground. ww.irjet.net. Offensive metrics, particularly shots on target, also revealed [9] Daniel Jurafsky and James H. Martin. 2023. Logistic Regression. Stanford themselves as strong predictors of success. Teams that generated University, 5. https://web.stanf ord.edu/~juraf sky/slp3/5.pdf . [10] Haris Kruskic. 2019. Uefa champions league explained: how the tournament more attempts on goal were significantly more likely to win, rein- works. Bleacher Report. Retrieved from https://bleacherreport.com/articles forcing the widely accepted view that aggressive, forward-driven /2819840- uef a- champions- league- explained- how- the- tournament- works. play translates directly into better results. This trend mirrors ob- (2019). [11] Dušan Munđar and Diana Šimić. 2016. Roatian first football league: teams’ servations from other football leagues, where offensive intensity performance in the championship. roatian Review of Economic, Business and is often directly correlated with victory. Social Statistics 2, 2, 1, 15–23. https://hrcak.srce.hr/file/245359. [12] Prva Liga BiH. 2022. Osvajači trofeja. Retrieved from https://plbih.ba/osvaja 4.1.2 Limitations and future work. Despite the promising results, ci- trof eja/. (2022). [13] Ashiqur Rahman. 2020. A deep learning framework for football match this study has several limitations. First, the dataset used does prediction. SN Applied Sciences, 2, 2, 165. doi: 10.1007/s42452- 019- 1821- 5. not account for external factors such as player injuries, weather [14] 2006–2024. Rezultati. Retrieved May 26, 2024, from https://www.rezultati.c conditions, or team morale, all of which can influence match om/. (2006–2024). [15] Fátima Rodrigues and Ângelo Pinto. 2022. Prediction of football match outcomes. Future research should incorporate these factors to im- results with machine learning. Procedia Computer Science, 204, 463–470. doi: prove the accuracy of predictions. Second, while SVM performed 10.1016/j.procs.2022.08.057. [16] Sayed Muhammad Yonus Saiedy, Muhammad Aslam HemmatQachmas, and well in this context, more advanced models such as deep learn- Dr. Amanullah Faqiri. 2020. Predicting epl football matches results using ing could potentially offer even better predictive performance, machine learning algorithms. International Journal of Engineering Applied particularly when dealing with larger datasets. Sciences and Technology, 5, 3, 83–91. http://www.ijeast.com. [17] scikit-learn. 2024. Columntransformer. Retrieved from https://scikit- learn.o Future work will explore the integration of additional domain- rg/stable/modules/generated/sklearn.compose.ColumnTransf ormer.html. specific features, such as player statistics, team form, and envi- (2024). ronmental conditions, to further refine the predictive models. [18] scikit-learn. 2024. Onehotencoder. Retrieved from https://scikit- learn.or g/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html. We will also experiment with more complex algorithms, such as (2024). neural networks, to capture the intricate relationships between [19] scikit-learn. 2024. Sklearn.metrics.accuracy_score. Retrieved from https://sc ikit- learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.h features that may be missed by traditional machine learning tml. (2024). models. [20] scikit-learn. 2024. Sklearn.metrics.f1_score. Retrieved from https://scikit- le arn.org/stable/modules/generated/sklearn.metrics.f 1_score.html. (2024). 5 Conclusion [21] scikit-learn. 2024. Sklearn.metrics.precision_score. Retrieved from https://s cikit- learn.org/stable/modules/generated/sklearn.metrics.precision_score This study demonstrates that machine learning, particularly SVM, .html. (2024). [22] scikit-learn. 2024. Sklearn.metrics.recall_score. Retrieved from https://sciki effectively predicts football match outcomes in the WWIN League t- learn.org/stable/modules/generated/sklearn.metrics.recall_score.html. of Bosnia and Herzegovina. Support Vector Machine has been (2024). [23] scikit-learn developers. 2024. Sklearn.preprocessing.standardscaler. https: found to be the highest accurate classifier with 83% of accuracy //scikit- learn.org/stable/modules/generated/sklearn.preprocessing.Standa rate on match result prediction. SVM has moderate accuracy and rdScaler.html. (2024). recall with all three outcome classes: Home Win, Away Win, and [24] Sofascore. 2024. Sofascore. Retrieved May 26, 2024, from https://www.sof as core.com/. (2024). Draw, indicating football prediction applicability. However, it [25] SportMonks. 2022. Premier league api bosnia. Retrieved from https://www has also revealed that other classifiers’ performances are varying .sportmonks.com/f ootball- api/premier- league- api- bosnia/. (2022). with Logistic Regression producing 77% of accuracy and Gauss- [26] Geoffrey I. Webb. 2016. Naïve bayes. In Encyclopedia of Machine Learning and Data Mining. Claude Sammut and Geoffrey I. Webb, editors. (Jan. 2016), ian Naïve Bayes a poor 39% accuracy. Both Random Forest and 1–2. doi: 10.1007/978- 1- 4899- 7502- 7_581- 1. Gradient Boosting, which are ensemble learning algorithms, have similar levels of accuracy; 54% and 64% respectively. While fur- ther refinement of the models is needed, the current findings 18 Sarcasm Detection in a Less-Resourced Language Lazar Ðoković Marko Robnik-Šikonja lazardjokoviclaki02@gmail.com marko.robnik@f ri.uni- lj.si University of Ljubljana, Faculty of Computer and University of Ljubljana, Faculty of Computer and Information Science Information Science Ljubljana, Slovenia Ljubljana, Slovenia Abstract user annotation via distant supervision through hashtags, such as #sarcasm, #sarcastic, #not, etc. This method is popular since 1) The sarcasm detection task in natural language processing tries only the author of a post can determine whether it is sarcastic or to classify whether an utterance is sarcastic or not. It is related not, and 2) it allows large-scale dataset creation. However, this to sentiment analysis since it often inverts surface sentiment. Be- method introduces large amounts of noise due to lack of context, cause sarcastic sentences are highly dependent on context, and user errors, and common misuse on social media platforms. The they are often accompanied by various non-verbal cues, the task sarcasm detection datasets created through manual annotation is challenging. Most of related work focuses on high-resourced tend to be of higher quality but are typically much smaller. These languages like English. To build a sarcasm detection dataset for problems are further compounded for non-English datasets, both a less-resourced language, such as Slovenian, we leverage two manually labeled and automatically collected. Further, as sarcasm modern techniques: a machine translation specific medium-size strongly relies on its context, using classical machine translation transformer model, and a very large generative language model. (MT) from English often produces inadequate results. This makes We explore the viability of translated datasets and how the size of sarcasm detection in less-resourced languages, like Slovenian, an a pretrained transformer affects its ability to detect sarcasm. We even bigger challenge. Therefore, developing reliable sarcasm train ensembles of detection models and evaluate models’ perfor- detection models is of crucial importance for robust sentiment mance. The results show that larger models generally outperform analysis in these languages. smaller ones and that ensembling can slightly improve sarcasm We develop a methodology for sarcasm detection in less-resour- detection performance. Our best ensemble approach achieves an ced languages and test it on the Slovenian language. We address F -score of 0.765 which is close to annotators’ agreement in the 1 the problem of missing datasets by comparing state-of-the-art source language. machine translation with large generative models. We explore Keywords the viability of such datasets and how the number of parameters affects a model’s ability to detect sarcasm. We construct various natural language processing, large language models, sarcasm ensembles of large pretrained language models and evaluate their detection, neural machine translation, BERT model, GPT model, performance. LLaMa model, ensembles The rest of this work is organized as follows. In Section 2, we 1 Introduction discuss the proposed approach for detecting sarcasm in a less- resourced language such as Slovenian. We present the creation Sentiment analysis is a popular task in natural language process- of a dataset, details of the training methodology and deployed ing (NLP), concerned with the extraction of underlying attitudes ensemble techniques. We lay out our experimental results and and opinions, usually categorized as “positive”, “negative”, and their interpretations in Sections 2.3 and 4. In Section 5, we provide “neutral”. Detection of sentiment is challenging if the utterances conclusions and directions for future work. are ironic or sarcastic. Sarcasm is a form of verbal irony that trans- forms the surface polarity of an apparently positive or negative 2 Sarcasm Detection Dataset utterance/statement into its opposite [6]. Sarcasm is frequent Existing attempts at automatic sarcasm detection have resulted in our day-to-day communication, especially on social media in the creation of datasets in a small number of languages with [5]. This poses a significant problem for sentiment analysis tools differing sizes and quality. It is unclear if models trained on these since sarcasm polarity switches create ambiguity in meaning. datasets would generalize well to unseen languages [1]. Since Sarcasm is highly dependent on its context. For example, the no dataset exists for Slovenian, we leverage recent advances sentence “I just love hot weather” could be interpreted as sarcastic, in machine translation and large language models (LLMs) to depending on the situation, e.g., during summer heat waves. create a dataset for supervised sarcasm detection. We thus apply Historical developments of sarcasm detection are surveyed by a translate-train approach when fine-tuning our models. Joshi et al. [3], while recent developments are covered by Moores The prevalence of research done on sarcasm in English means and Mago [5]. The problem of automatic sarcasm detection in that English datasets are usually larger and of higher quality than text is most commonly formulated as a classification task. Unfor- their counterparts in other languages. Further, as the translation tunately, sarcasm detection is affected by the lack of large-scale, from (and to) English is usually of better quality, we consider noise-free datasets. Existing datasets are mostly harvested from only English datasets. microblogging platforms such as Twitter and Reddit, relying on Preliminary tests showed poor quality and poor translation Permission to make digital or hard copies of all or part of this work for personal ability of Sarcasm on Reddit1 dataset, and News Headlines or classroom use is granted without fee provided that copies are not made or Dataset For Sarcasm Detection2. Hence, we chose the recent distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this iSarcasmEval3 dataset from the SemEval-2022 shared task. We work must be honored. For all other uses, contact the owner /author(s). Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia 1 www.kaggle.com/datasets/danofer/sarcasm 2 © 2024 Copyright held by the owner/author(s). www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection 3 https://doi.org/10.70314/is.2024.scai.4212 github.com/iabufarha/iSarcasmEval 19 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Ðoković and Robnik-Šikonja believe that relatively low performance scores obtained in this You will be provided with a sarcastic/non-sarcastic sentence in shared task could be improved with the use of larger LLMs. English, and your task is to translate it into the Slovenian language. It should keep the original meaning. Examples: 2.1 iSarcasmEval Dataset • love getting assignments at 6:25pm on a Friday!! // iSarcasmEval is a dataset of both English and Arabic sarcastic obožujem, ko mi v petek ob 18:25 pošljejo naloge!! and non-sarcastic short-form tweets obtained from Twitter. We • I still can’t believe England won the World Cup // use only the English part, which is pre-split into a train and test Še vedno ne morem verjeti, da je Anglija zmagala na sve- set. Both sets are unbalanced, the former having 867 sarcastic and tovnem prvenstvu 2601 non-sarcastic examples, while the latter has 200 sarcastic • taking spanish at ut was not my best decision // and 1200 non-sarcastic examples. The authors of the shared task Učenje španščine na UT ni bila moja najboljša odločitev , claim that both distant supervision and manual annotation of datasets produce noisy labels in terms of both false positives We manually assessed the outputs of both transformers in order to and false negatives [1]. Thus, they ask Twitter users to directly determine the best translations for fine-tuning detection models. provide one sarcastic and three non-sarcastic tweets they have 2.3 Translation Results posted in the past. These responses are then filtered to ensure their quality. The produced dataset is not entirely clean since it During translation, the T5 model sometimes had trouble with contains links, emojis, and capitalized text. We chose to leave all examples that had multiple newline characters in a row. It oc- of these potential features in the text, as they commonly occur casionally dropped parts of texts it didn’t understand (mostly in online conversations and could be indicative of sarcasm. slang and various types of informal text styles). This shows that Let us mention, that an ensemble approach with 15 trans- a 10B parameter model is not large enough to robustly translate former models and transfer from three external sarcasm datasets all features of a language such as English into a less-resourced proved to be the most accurate modeling technique for English language such as Slovenian. [9] achieving an F -score of 0.605. 1 On the other hand, the GPT model performed surprisingly well in most instances and it seemed to have a more nuanced 2.2 Translating iSarcasmEval understanding of phrases used in online speech. It consistently translated entire texts, keeping the original structure and mean- Our preliminary testing using smaller BERT-like classifiers showed ing. Consequently, we used GPT’s translations when training that the models learned the distribution of the data and defaulted sarcasm detection models. The translations can be seen in our to the majority classifier (1200/1400 = 0.857 test accuracy). To 6 repository . try to dissuade this, we merged the train and test sets, kept all the sarcastic instances, and randomly sampled an equal number 3 Model Training of non-sarcastic examples. This left us with a balanced dataset of 2134 samples. We tested the performance of a wide range of LLMs of different To enable task specific instructions that would preserve sar- sizes. Their overview is contained in Table 1. casm, we skipped classical machine translation tools, and tried Table 1: Summary of used sarcasm detection models. two alternative translation approaches: • using a medium-sized T5 model trained specifically for Model Parameters neural machine translation, • SloBERTa 110M leveraging a significantly larger model via OpenAI’s API. BERT-BASE-MULTILINGUAL-CASED 179M The T5 model uses both the encoder and decoder stacks of XLM-RoBERTa-BASE 279M the Transformer architecture and is trained within a text-to- XLM-RoBERTa-LARGE 561M text framework. We chose Google’s 32-layer T5 model called META-Llama-3.1-8B-INSTRUCT 8.03B MADLAD400-10B-MT4, which has 10.7 billion parameters and is META-Llama-3.1-70B-INSTRUCT 70.6B pretrained on the MADLAD-400 [4] dataset with 250 billion tokens META-Llama-3.1-405B-INSTRUCT 406B covering 450 languages. Fine-tuning for machine translation was GPT-3.5-TURBO-0125 ? done on a combination of parallel data sources in 157 languages, GPT-4o-2024-05-13 ? including Slovenian. As a generative model, we chose OpenAI’s decoder-based 3.1 Encoder Models Under 1B Parameters GPT-4o-2024-05-135. Its true size is not known to the public, but it’s speculated that it is significantly smaller than GPT-4, since The four smallest models are encoder-based models that embed it is much faster and more efficient. OpenAI claims that it has the input text and use a classification head to assign it a class. They best performance across non-English languages of any of their required additional fine-tuning to perform sarcasm detection. For models, thus making it suitable for our task. these models, we conducted hyperparameter optimization. When prompting generative decoder-based models, it is nec- We chose the SloBERTa [7, 8] model in order to check whether essary to craft clear and specific instructions to achieve the best using a monolingual Slovenian model impacts sarcasm detection results. We used few-shot learning [2], and randomly sampled performance. We also wanted to compare BERT and RoBERTa-three training instances, manually translated them, and included like models, so we used their multilingual variants and fine-tuned them in the following prompt where the double forward slash them on Slovenian data. was used as a delimiter between the query and the expected The models were trained for a maximum of five epochs with response. the use of early stopping, where the training was halted if the validation loss didn’t improve after two epochs. 4 huggingface.co/google/madlad400-10b-mt 5 6 platform.openai.com/docs/models/gpt-4o github.com/GalaxyGHz/Diploma 20 Sarcasm Detection Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia 3.2 Llama 3.1 Models When 𝑛 is set to zero, this approach is equivalent to hard voting, and in the case of 𝑛 being equal to the predictor count, it Since the teams that competed in the 2022 shared task on sarcasm is equivalent to soft voting. We report both results. Additionally, mostly used BERT and RoBERTa models, we extend the testing to we compare the results of voting using all trained models with include significantly larger models. We chose Meta’s open-source the results obtained by using only the models with large weights Llama family of models, more specifically, their newest Llama 3.1 in our regularized logistic regression ensemble. variants. These come in three different sizes, which was perfect for studying the effects of parameter counts on sarcasm detection. 4 Sarcasm Detection Results We decided to use the “instruct” version of all three models since these were fine-tuned to be better at following instructions. Table 2 summarizes all our results. It is roughly sorted by model When prompting LLama and GPT generative models, the fol-size, smaller models being on top and larger ones being on bot- lowing few-shot classification prompt was given, with two pos- tom. The (NFT) tag indicates that a model was not fine-tuned, itive and two negative examples randomly sampled from our while the (LoRA) tag means that a model was trained with LoRA. dataset. Results are rounded to three decimal places. You will be provided with text in the Slovenian language, and your task is to classify whether it is sarcastic or not. Use ONLY token 0 Table 2: Summary of performance results for all tested (not sarcastic) or 1 (sarcastic) as in the examples: models. The best scores are in bold. • Spanje? Kaj je to... Še nikoli nisem slišal za to? 1 Model Accuracy F -score 1 • Lepo je biti primerjan z zidom 1 SloBERTa 0.621 0.632 • To sploh nima smisla. Nehaj kopati. 0 BERT-BASE-MULTILINGUAL-CASED 0.499 0.666 • Dne 12. 10. 21 ob 10:30 je bil nivo reke 0,37 m. 0. XLM-RoBERTa-BASE 0.578 0.579 XLM-RoBERTa-LARGE 0.550 0.597 We used full versions of the 8B and 70B parameter models, Llama-3.1-8B-INSTRUCT (NFT) 0.560 0.676 while the 405B parameter model was loaded in 16-bit precision Llama-3.1-8B-INSTRUCT (LoRA) 0.569 0.682 mode. To minimize the use of resources and costs, we employed Llama-3.1-70B-INSTRUCT (NFT) 0.660 0.724 LoRA parameter-efficient fine-tuning. We provided the models Llama-3.1-70B-INSTRUCT (4-bit-LoRA) 0.637 0.717 Llama-3.1-405B-INSTRUCT (16-bit-NFT) 0.686 0.751 with training and validation sets and trained them for a maximum GPT-3.5-TURBO-0125 (NFT) 0.564 0.679 of 10 epochs. No hyperparameter optimization was conducted GPT-3.5-TURBO-0125 0.749 0.760 in this case due to time constraints. We used the validation loss GPT-4o-2024-05-13 (NFT) 0.686 0.746 to choose the best model, and we used the same early stopping L2-LOGISTIC-REGRESSION 0.759 0.765 technique as with the smaller models. L2-LOGISTIC-REGRESSION-NON-COMMERCIAL 0.707 0.749 HARD-VOTING-ALL 0.670 0.738 3.3 GPT 3 and 4 Models SOFT-VOTING-ALL 0.658 0.732 HARD-VOTING-BEST-5 0.686 0.749 We also tested two models offered on the OpenAI platform, SOFT-VOTING-BEST-5 0.686 0.749 GPT-4o-2024-05-13 and GPT-3.5-TURBO-0125. We first used them in few-shot mode and classified all our examples without Individual Model Performance any fine-tuning. When fine-tuning, the platform’s tier system Out of all of the used models, only BERT-BASE-MULTILINGUAL- limited us to only the smaller GPT-3.5-TURBO-0125 model. We -CASED failed to learn any pattern in our data and defaulted to fine-tuned the model for a maximum of three epochs. In the end, the dummy classifier. we used the model with the lowest validation loss to classify the GPT-3.5-TURBO-0125 sometimes predicts the correct token examples in the test set. but then continues to generate additional text, such as 11 and 1\n1. This happens with a small quantity of examples in our 3.4 Sarcasm Detection Ensembles testing set. We decided to truncate these responses and only kept the first token as the answer. When constructing ensemble models, we tried two techniques: The Llama models sometimes refused to generate tokens zero stacking and voting. In both cases, we used the predicted proba- or one. We decided to drop these examples altogether. We report bility of the sarcastic class from each model as input features. the test accuracy and trained the ensemble models without them. 3.4.1 Stacking With Regularized Logistic Regression. Smaller encoder models performed poorly when compared to Our first some of the larger models. Only the SloBERTa model manages ensemble used stacking approach, and logistic regression with to achieve an accuracy above 0.6. Despite being the smallest of Ridge regularization as the meta-level classifier. This choice was the four small models we tested, SloBERTa performed the best. motivated primarily by the need for feature selection, as we This suggests that the three larger multilingual encoder models wanted to identify the most important model predictions and may lack sufficient understanding of Slovenian. It also highlights determine which models would be assigned a lower weight. The that model size alone does not necessarily correlate with better best models were then used for voting. performance when it comes to sarcasm detection. 3.4.2 Standard and Mixed Voting. The second ensembling method The Llama models fared better, achieving accuracies of up to was voting. We tried cut-off-based mixed voting inspired by [9]. 0.686 with the 405B model being comparable to GPT-4o in perfor- Formally, we used hard voting when the absolute difference be- mance. They still fell short of the fine-tuned GPT-3.5-TURBO-0125 tween the number of sarcastic and non-sarcastic predictions was model, which managed to successfully classify about three-quarters greater than 𝑛, and we used soft voting otherwise. We optimized of our examples with a F -score of 0.76. 1 the value of 𝑛 based on the ensembles performance on our vali- Some models had significantly higher F -scores and lower 1 dation set. accuracies. We show the confusion matrix of one of the models 21 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Ðoković and Robnik-Šikonja Table 3: Confusion Matrix for non-fine-tuned Slovenian sarcasm detection power, but we also note that a pos- Llama-3.1-405B-INSTRUCT model. sible alternative could be local fine-tuning of the Llama-3.1-8B- -INSTRUCT model. Our testing shows that using aggressive quan- Predicted \ Actual Positive Negative tization combined with LoRA results in significant performance degradation. Positive 202 123 We also constructed ensemble models based on voting and Negative stacking methods. Observations showed that voting didn’t result 11 91 in any performance improvements. On the other hand, stacking that exhibited the largest difference in Table 3. These models with the use of a regularized logistic regression managed to seem to have a tendency to incorrectly classify non-sarcastic text improve on the performance of its base models. as sarcastic, leading to a high rate of false positives. Additional work needs to be done in dataset construction. Our testing also showed that loading the Llama-3.1-70B- Sarcastic examples could be extended with context or labels of the -INSTRUCT model in 4-bit mode and fine-tuning it with LoRA types of sarcasm they represent. This might help guide models does not produce satisfactory results, and it is thus better to towards better understanding of sarcasm. Future work could conduct full fine-tuning with the smaller Llama model or to use also explore incorporating heterogeneous models into ensembles one of OpenAI’s models via their fine-tuning API. or the creation of Mixture of Experts (MoE) ensembles, whose GPT-3.5-TURBO-0125 performed the best among individual baseline models would focus on different aspects of sarcasm. models, so if costs associated with the use of OpenAI’s API are acceptable, we recommend its use for sarcasm detection in Slove- Acknowledgements nian. This shows that very large models can effectively identify This research was supported by the Slovenian Research and In- sarcasm. We believe that with better parameter tuning, Llama 8B novation Agency (ARIS) core research programme P6-0411 and could be one of the best (and most economical) options for sar- projects J7-3159, CRP V5-2297, L2-50070, and PoVeJMo. casm detection in Slovenian, provided that the user has sufficient hardware resources. References Ensemble Model Performance [1] Ibrahim Abu Farha, Silviu Vlad Oprea, Steven Wilson, and Walid Magdy. 2022. We observed that the regularized logistic regression mostly re- SemEval-2022 task 6: iSarcasmEval, intended sarcasm detection in English and Arabic. In Proceedings of the 16th International Workshop on Semantic lied on the best-performing models. Its focus on the best model Evaluation (SemEval-2022), 802–814. doi: 10.18653/v1/2022.semeval-1.111. (GPT-3.5-TURBO-0125), however, suggests that there is signifi- [2] Tom Brown et al. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems. Vol. 33, 1877–1901. https://proceedi cant overlap between the various model predictions. ngs.neurips.cc/paper_f iles/paper/2020/f ile/1457c0d6bf cb4967418bf b8ac142 We decided to discard BERT-BASE-MULTILINGUAL-CASED when f 64a- Paper.pdf . constructing our voting ensembles since its dummy classification [3] Aditya Joshi, Pushpak Bhattacharyya, and Mark J. Carman. 2017. Automatic sarcasm detection: a survey. ACM Comput. Surv., 50, 5, Article 73, 22 pages. didn’t contribute to overall model performance. Both of these doi: 10.1145/3124420. two voting classifiers had an odd number of predictors, so there [4] Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Derrick Xin, was no need for a tie-breaker mechanism. Aditya Kusupati, Romi Stella, Ankur Bapna, and Orhan Firat. [n. d.] Madlad- 400: a multilingual and document-level large audited dataset. In Proceedings Voting proved to be ineffective in our setups, even scoring of the 37th International Conference on Neural Information Processing Systems lower than some of its base models. hard voting generally out- Article 2940, 13 pages. [5] Bleau Moores and Vijay Mago. 2022. A survey on automated sarcasm detec- performed soft voting. We also note that there was no benefit tion on twitter. arXiv preprint. doi: 10.48550/arXiv.2202.02516. in using mixed voting, at least for the sets of predictors that we [6] Smaranda Muresan, Roberto Gonzalez-Ibanez, Debanjan Ghosh, and Nina obtained as hard voting always had a higher accuracy. This was Wacholder. 2016. Identification of nonliteral language in social media: a case study on sarcasm. Journal of the Association for Information Science and true for both the classifiers that used all and only five of the base Technology, 67, 11, 2725–2737. doi: 10.1002/asi.23624. models. [7] Matej Ulčar and Marko Robnik Šikonja. 2021. Sloberta: slovene monolingual Regularized logistic regression managed to improve on the large pretrained masked language model. In Proceedings of Data Mining and Data Warehousing, SiKDD, 17–20. http://library.ijs.si/Stacks/Proceedings/Inf scores of individual models, raising accuracy by one percent, thus ormationSociety/2021/IS2021_Volume_C.pdf . achieving the best performance out of all of the tested approaches. [8] Matej Ulčar and Marko Robnik Šikonja. 2021. Slovenian RoBERTa contextual This shows that there is still performance to be gained from embeddings model: SloBERTa 2.0. CLARIN.SI data & tools. Nasl. z nasl. zaslona. Fakulteta za računalništvo in informatiko. http://hdl.handle.net/11356/1397. ensembles; however, it is still necessary to use commercial models [9] Mengfei Yuan, Zhou Mengyuan, Lianxin Jiang, Yang Mo, and Xiaofeng Shi. for top performance. 2022. Stce at SemEval-2022 task 6: sarcasm detection in English tweets. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval- 2022), 820–826. doi: 10.18653/v1/2022.semeval-1.113. 5 Conclusion In this work, we presented the task of sarcasm detection in the less-resourced Slovenian language. Our code and results are freely 7 available . We tackled the missing dataset problem by using two LLMs to perform neural translation of an English dataset into Slove- nian. The translations generated by GPT-4o-2024-05-13 out- paced those generated by a large T5 model specifically trained for neural machine translation in terms of quality. We used this data to train a plethora of Transformer-based models in various settings. We found that fine-tuning GPT-3.5- -TURBO-0125 via OpenAI’s API results in the highest individual 7 github.com/GalaxyGHz/Diploma 22 Speech-to-Service: Using LLMs to Facilitate Recording of Services in Healthcare Maj Smerkol Rok Susič Mariša Ratajec maj.smerkol@ijs.si rs36117@student.uni- lj.si mr97744@student.uni- lj.si Jožef Stefan Institute University of Ljubljana, Faculty of University of Ljubljana, Faculty of Ljubljana, Slovenia Mathematics and Physics Electrical Engineering Ljubljana, Slovenia Ljubljana, Slovenia Helena Halbwachs Anton Gradišek h.halbwachs@senecura.si anton.gradisek@ijs.si SeneCura Kliniken- und Jožef Stefan Institute Heimebetriedsgesellschaft m.b.H. Ljubljana, Slovenia Vienna, Austria Abstract by significantly lowering the number of clicks required in the UI. The system is built using open-source or publicly accessible Digital tracking of services is one of the main administrative bur- components, particularly a speech-to-text system that transcribes dens of the healthcare staff. Here, we present a proof-of-concept the recorded conversation, and a large language model (LLM) study of a so-called speech-to-service (S2S) system that is aimed that leverages its natural language processing capabilities. The at facilitating recording of activities, extracting information from recommender system shows possible required tasks, serving as a the conversation between a healthcare provider and recipient. reminder and to suggest tasks that are expected soon, which may The system comprises of a speech recorder, a diarization compo- lower the number of visits per patient. These recommended tasks nent, an LLM to interpret the conversation, and a recommenda- are then suggested to the healthcare worker, who can review and tion system integrated in a smart tablet that records completed confirm them using the LLM-assisted interface. LLMs, such as activities and suggests possible other activities that may have ChatGPT and Llama, have seen a surge in popularity in a wide still be required. We tested the system on 350 conversations and variety of topics since their popularization in particular with the obtained 95% accuracy, 97% precision and 94% recall. unveiling of ChatGPT3 in the autumn of 2022. Keywords Several LLM based systems have been proposed recently, in- cluding administrative task automation [6], decision making pro-healthcare, LLM, speech recognition, recommendation system cess [10], improving existing automatic speech recognition (ASR) 1 Introduction systems [1], and providing patients with needed information [9]. A recent study [11] concludes that utilising ASR to ease some Healthcare workers, including nurses, technicians, and care per-administrative tasks leads to faster, more efficient work and even sonnel form the backbone of the health system as they care for increase workers’ moods. patients and tend to their needs. However, with the standard- ization and systematization of the healthcare professions and 2 System Architecture services often becomes a large bureaucratic burden, as health- care workers have to record all the activities and services they This paper describes two early prototype systems, both aiming provide to the patients. This process is of course needed as it to alleviate the workload of healthcare workers by easing the provides traceability and ensures that all the required activities task of documenting care actions performed. These are the ASR were taken care of, but the problem is that the interfaces designed system that logs care actions based on captured dialogue between for activity logging are often not user-friendly and require the the healthcare worker and the patient, and a recommender sys- users to choose the activities from a extensive lists of drop-down tem that predicts the required services at a specific time. This menus. In total, this amounts to substantial time required only for recommender system relies on the historical data, appropriate tedious administrative tasks, time that would be more beneficially for long-term patient care facilities. spent otherwise. Both systems are limited in scope and only target the most With the aim to alleviate the administrative burden of activity common healthcare services in the dataset for detection or pre- logging, we explored the possibilities of novel technologies to as- diction respectively, which can still greatly easy the workload sist the healthcare staff in their logging tasks. We developed and for medical workers, since the top 10 most common tasks out of tested a proof-of-concept system that records the conversation around 200 care action types represent around 80% of all services between the healthcare worker and a patient, identifies the activ- performed. ities, and allows the healthcare worker to batch-confirm them on The recommender system allows the care workers to anticipate a dedicated smart tablet. Batch-confirmation saves a lot of time tasks in advance and server as a reminder. This aims to lower the number of patient visits, which also alleviates the workload. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and 2.1 Speech-to-Service ASR the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner /author(s). The ASR system consists of a speech diarization model, capable Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia of segmenting the recorded speech based on who is currently © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2024.scai.4550 speaking, a speech transcription model that transcribes the audio 23 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Smerkol et al. to text, and a LLM fine-tuned to extract specific information from hundred conversations that way, and manually checked for mis- the text. Figure 1 shows the architecture of the prototype system. takes in the model output. Many conversations were removed due to selected actions not being mentioned or other reasons. Fi- nally, the resulting dataset contains 350 conversations and JSON formatted lists of tasks. For the prediction of services required during a visit, we have acquired a log of all services performed in one long-term patient care facility over a period of 6 months, with the next version expanding to data from six facilities. The tasks in dataset include measurements (body temperature, heart rate, blood pressure, ...), medical tasks (monitoring medicine intake, performing exam- inations, turning the patient in bed) and care tasks (breakfast, lunch, cleaning). There are over 200 different tasks mentioned. Figure 1: Overview of the ASR system. The dataset includes limited patient information—patient ID, care type, and a detailed chronological history of services received. Care types (CareType I, CareType II, CareType III/A, CareType 1 We employ speaker-diarization pretrained model [4] for di- III/B, CareType IIII) represent an estimate of how much assistance arization, pyannote/speaker-diarization-3.1 pretrained model [7] a person requires. Legal restrictions on accessing sensitive health 2 for transcription, and fine-tuned Llama3 model [2] for informa-data prevented us from obtaining more detailed patient records, tion extraction via generating JSON formatted output. so we developed prediction models based on these limited data 2.2 Recommender System points, balancing accuracy with regulatory constraints. The data preprocessing involved determining each patient’s The recommender system prototype is based on machine-learning presence in the facility by identifying the timestamps of their prediction of events that are expected to occur in a certain time first and last recorded service. Patients with a stay of less than window for a specific patient with addition of tasks that com- four months were excluded from the analysis to ensure sufficient monly follow predicted tasks. Due to the sensitive nature of the data for reliable predictions. data, we base our predictions only on the time window, patient ID and care type. Thus we consider multi-output binary clas- 4 Methods sifiers that do not require large amounts of data for training. This section describes the methodology used to develop the ASR Additional tasks are added to the list based on a Markov chain system and the recommender system. model that commonly follow, e.g. the task ’clean table’ follows the task ’lunch’. 4.1 Clustering The feature vector includes the time of day, day of week, week of month and month of year as numbers, allowing for capture The primary goal of the clustering process was to group patients of periodic events with different periods. Due to lack of patient with similar patterns in terms of the type and frequency of ser- data, we opted for personalized models, trained for each patient vices they received, allowing us to predict relevant services more separately. We believe that results can be further improved by effectively for each cluster (since it was not clear, even among adding more patient-related attributes. The model training used experts, whether care type and actual care provided were corre- five month period of data collected, with cross-validation, and lated). the accuracy was evaluated on the data collected during the sixth The clusters, as shown in Figure 2, demonstrate that patients month. Due to patients’ medical states changing over time, some within the same care type tend to receive similar services. Some data drift is expected, which is reflected in our results. deviations, where multiple classifications appear within a cluster, are likely due to temporary conditions we could not fully exclude 3 Dataset (for instance, an individual categorized under "Care Type II" may temporarily receive services typical of "Care Type III/A" (e.g. To fine-tune the information extraction model based on Llama3, due to a broken arm), while their care type classification remains we have created a dataset of conversations in text form and ap- unchanged). Despite this, the care types differentiate well enough propriate outputs for each of them, as the task on hand is very across clusters, leading us to use "CareType" as one of the key specific and we did not find any existing appropriate dataset. We attribute for further service predictions. automated the process and manually removed any bad exam- In the clustering process, we excluded CareType IIII because ples. A real dataset, ideally recorded in the target environment, this group is characterized by highly diverse healthcare needs is needed for for final implementation - LLM generated datasets due to specific diseases, and experts advised us to omit it for this used for training LLMs are only appropriate in preliminary stud- part of the analysis. ies. 3 To generate the dataset, we prepared a BERT LLM via prompt- 4.2 Recommender System ing [5]. A training sample was generating by first randomly selecting 2 of the 10 target actions, and programmatically generating To recommend the required services, we constructed the train- the target output JSON. The BERT model was then tasked with ing dataset using a detailed log of care actions performed over generating a conversation, in which these two tasks are men- a 6-month period. For each patient, the data was divided into tioned as done during the conversation. We generated several consecutive 4-hour time windows. In each window, we examined whether specific care actions were performed, marking them as 1 pyannote/speaker-diarization-3.1 2 "positive" if they occurred within that time frame. This granu- meta/meta-llama-3-8b 3 google-bert/bert-base-multilingual-cased lar approach allowed us to capture the temporal dynamics of 24 Speech-to-Service Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia added. Thus, the transcribed text resembles a play with unknown characters speaking. The information extraction model is Llama3 [2], and fine- tuned utilising a LoRA few-shot fine tuning. Our approach was to fine-tune the model for the task of extracting information about specific care actions and generate the output in a JSON format, providing structured data directly. A small training dataset was prepared as described in the section 3. 5 Results and Discussion 5.1 LLM Based Infromation Retrieval Model The Llama3 based information extraction model is evaluated using a 5-fold cross validation, achieving 95% accuracy, 97% precision, and 94% recall. For evaluation the model’s JSON- Figure 2: Clustering of patients closely aligns with pre- formatted strings were deserialized to objects and tested against existing care type assignments, ranging from minimal known correct objects to be able to interpret the results as multi- personal assistance (CareType I) to moderate assistance label binary classification. (CareType II), and full or intensive personal assistance The LLM infromation extraction model sometimes generates (CareTypes III/A and III/B) for those with more severe care invalid JSON after fine-tuning, most commonly due to duplicated needs. keys or getting stuck in a loop, generating same elements until maximum output size is generated. The generated strings are therefore post-processed to fix these mistakes via simple string manipulation, however this indicates that experiments with dif- service delivery, ensuring that for each time window, we had a ferent output formats or avoiding generating the answers should clear record of the services provided. As a result, we generated be performed. over 1000 labeled examples per patient, with each example rep- The whole ASR pipeline including diarization and transcrip- resenting a specific time window and its associated care actions. tion has not yet been evaluated and falls within the scope of This enabled the model to learn patterns in service requirements future work. throughout the day and week. To identify the best predictive model, we evaluated various 5.2 Recommender System classification algorithms, including Random Forest, Decision Tree, Tables 1 and 3 present the classification results. Table 1 reports K-Neighbors, Support Vector Classifier (SVC), Gradient Boost-the average performance across all patients, including standard ing, and Naive Bayes. Each model was trained using a multi- deviations for the different models, while Table 3 shows classifica-output classification approach, with features including the fre- tion accuracy by care type, with averages and standard deviations quency of the top services provided and the relevant time at- across all patients within each care type, based on the model with tributes. To ensure robust model evaluation, we implemented the best results, which in this case is K-Neighbors (KNN). 5-fold cross-validation and subsequently tested the models on Results are reported in two ways, tables 1 and 3 show accuracy the sixth month’s data to assess their predictive performance. considering all target attributes, only considering a prediction 4.3 Speech Recognition and Information correct when all targets are predicted correctly. The table 2 show average of accuracies for each target attribute. Extraction Due to limited availability of training data, only the information- Table 1: Cross-validation and test accuracy (mean ± stan- extraction model based on Llama3 was fine tuned using few-shot dard deviation) across all patients for various classification LoRA (low-rank adaptor) supervised training. The diarization models. and transcription models are used unchanged. The diarization model used is speaker-diarization [4]. Initial Model CV Accuracy Test Accuracy experiments with few-shot LoRA fine tuning [3] did not improve RandomForest 0.71 ± 0.14 0.66 ± 0.16 the performance, hinting at the need for a larger training dataset. DecisionTree 0.65 ± 0.16 0.66 ± 0.16 The model’s performance is satisfactory at the task of segmenta- tion, but less so at the task of identifying which segments belong KNeighbors 0.73 ± 0.13 0.71 ± 0.16 to which speaker, especially for longer conversations. For a two- SVC 0.63 ± 0.12 0.63 ± 0.14 speaker situation, the model seems to assume the speakers take GradientBoosting 0.68 ± 0.12 0.66 ± 0.15 turns speaking, causing mistakes when a single speaker pauses NaiveBayes 0.57 ± 0.17 0.55 ± 0.20 before continuing to speak. The transcription model used is whisper [8]. The model transcribes each segment separately. As mentioned above, the speak- The K-Neighbors (KNN) classifier outperformed other models, ers are not robustly recognised, and we cannot reliably assign a achieving an average CV accuracy of 73%, a test accuracy of 71%, speaker to each line of text. Still, labeling each line of text even and R2 score of 0.44. This made it the most effective model for with an ambiguous label improves the downstream task of infor- predicting service plans. Random Forest also performed reason- mation extraction. The transcribed lines of text are concatenated, ably well, achieving a test accuracy of 66%, though it did not and at the start of each utterance a label marking it as such is surpass KNN in overall performance. 25 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Smerkol et al. Table 2: Majority Class Percentage and Task-wise Test Ac- failure at each step. The information retrieval model itself is not curacy (mean ± standard deviation) across all patients for inefficient considering computational time and memory required, various classification models. but diarization and transcription steps are. The required service prediction should also be further improved. Using current dataset Model Majority Class Task-wise an alternative approach that may improve performance is using Percentage Accuracy sequence modelling or event prediction approaches. Finally, the two models could work in tandem - predicting the required ac- RandomForest 0.72 ± 0.19 0.89 ± 0.10 tions and using that information in the ASR pipeline could be DecisionTree 0.72 ± 0.19 0.89 ± 0.11 beneficial. KNeighbors 0.72 ± 0.19 0.91 ± 0.10 Based on the proof-of-concept study, we conclude the sug- gested approach is in principle feasible and can be beneficial SVC 0.65 ± 0.16 0.88 ± 0.10 to healthcare providers. However, in view of regulations, spe- GradientBoosting 0.65 ± 0.16 0.89 ± 0.09 cial caution has to be paid during the implementation of any NaiveBayes 0.72 ± 0.19 0.85 ± 0.15 sort of such system in a real-world setting. Recording and di- arizing conversations between healthcare staff and the patients Table 3: Classification performance of K-Neighbors (KNN) is likely to include highly personal data, which falls under the by CareType, showing cross-validation and test accuracy EU relevant legislation, specifically the GDPR (General Data Pro- (mean ± standard deviation), averaged across all patients tection Regulation 4 ) and the EU AI Act (Artificial Intelligence within each care type. Act (Regulation (EU) 2024/1689) 5 ) . Furthermore, indiscriminately recording conversations and feeding them into an LLM will likely CareType CV Accuracy Test Accuracy be considered as "high risk" in view of the AI Act. This means that implementing such services will require extensive screening, CareType I 0.79 ± 0.12 0.76 ± 0.16 documentation, clear division of ownership and access roles, and CareType II 0.79 ± 0.11 0.78 ± 0.13 other compliance with legal requirements. CareType III/A 0.68 ± 0.13 0.66 ± 0.15 Acknowledgements CareType III/B 0.70 ± 0.14 0.68 ± 0.17 We thank the healthcare provider organization for the dataset CareType IIII 0.68 ± 0.10 0.67 ± 0.12 and for insightful discussions. References Since all predictive accuracy values exceed the 70% majority [1] Ayo Adedeji, Sarita Joshi, and Brendan Doohan. 2024. The sound of health- class baseline, this is an excellent result. In multi-label classifica- care: improving medical transcription asr accuracy with large language models. arXiv preprint arXiv:2402.07658. tion, where multiple services are predicted simultaneously, it’s [2] AI@Meta. 2024. Llama 3 model card. https://github.com/meta- llama/llama3 important to not only focus on overall accuracy but also on the /blob/main/MODEL_CARD.md. accuracy of each individual task. By achieving 90% accuracy on [3] Shamil Ayupov and Nadezhda Chirkova. 2022. Parameter-efficient finetun- ing of transformers for source code. ArXiv, abs/2212.05901. https://api.sema the most common tasks, the model ensures that key services are nticscholar.org/CorpusID:254564456. reliably predicted. [4] Hervé Bredin and Antoine Laurent. 2021. End-to-end speaker segmentation The lower test accuracy compared to cross-validation can be for overlap-aware resegmentation. In Proc. Interspeech 2021. Brno, Czech Republic, (Aug. 2021). explained by temporal changes in patient conditions, as the test [5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. set only included the last month of data. As patient care needs BERT: pre-training of deep bidirectional transformers for language under- standing. CoRR, abs/1810.04805. http://arxiv.org/abs/1810.04805 arXiv: shift over time, predicting long-term patterns is more challenging 1810.04805. than shorter-term cross-validation, where care remains more [6] Senay A. Gebreab, Khaled Salah, Raja Jayaraman, Muhammad Habib ur stable. Rehman, and Samer Ellaham. 2024. Llm-based framework for administrative task automation in healthcare. In 2024 12th International Symposium on The test accuracy also reflected noticeable differences across Digital Forensics and Security (ISDFS), 1–7. doi: 10.1109/ISDFS60797.2024.10 care types. CareType I and CareType II showed higher accuracy 527275. rates, while more complex types, such as CareType III/A, III/B, [7] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. Robust speech recognition via large-scale weak and IIII, exhibited a drop in accuracy of around 10%. This is likely supervision. (2022). doi: 10.48550/ARXIV.2212.04356. due to the more diverse and unpredictable care needs in these [8] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. Robust speech recognition via large-scale weak groups, making service prediction more challenging. supervision. (2022). doi: 10.48550/ARXIV.2212.04356. This approach, particularly with the strong performance of [9] Prakasam S, N. Balakrishnan, Kirthickram T R, Ajith Jerom B, and Deepak S. our K-Neighbors (KNN) model, demonstrated the potential of 2023. Design and development of ai-powered healthcare whatsapp chatbot. 2023 2nd International Conference on Vision Towards Emerging Trends in machine learning to enhance personalized planning in healthcare. Communication and Networking Technologies (ViTECoN), 1–6. https://api.se In future work, including additional patient-specific features manticscholar.org/CorpusID:259280109. beyond time-based data, such as health-related attributes, could [10] Raja Vavekanand, Pinja Karttunen, Yue Xu, Stephanie Milani, and Huao Li. 2024. Large language models in healthcare decision support: a review. further improve accuracy, particularly for the more complex care [11] Markus Vogel, Wolfgang Kaisers, Ralf Wassmuth, and Ertan Mayatepek. types. 2015. Analysis of documentation speed using web-based medical speech recognition technology: randomized controlled trial. Journal of medical 6 Conclusions Internet research, 17, 11, e247. This is early work and further improvements are underway. The whole ASR pipeline needs to be evaluated and we expect no- ticeably worse performance comparing to only the information 4 https://gdpr-info.eu/ extraction model due to larger complexity and possibility for 5 https://artificialintelligenceact.eu/the-act/ 26 Performance Comparison of Axle Weight Prediction Algorithms on Time-Series Data Žiga Kolar David Susič Martin Konečnik Jožef Stefan Institute Jožef Stefan Institute Cestel Cestni Inženiring d.o.o Jamova cesta 39 Jamova cesta 39 Špruha 32 Ljubljana, Slovenia Ljubljana, Slovenia Trzin, Slovenia ziga.kolar@ijs.si david.susic@ijs.si martin.konecnik@cestel.si Domen Prestor Tomo Pejanovič Nosaka Bajko Kulauzović Cestel Cestni Inženiring d.o.o Cestel Cestni Inženiring d.o.o Cestel Cestni Inženiring d.o.o Špruha 32 Špruha 32 Špruha 32 Trzin, Slovenia Trzin, Slovenia Trzin, Slovenia domen.prestor@cestel.si tomo.pejanovic@cestel.si bajko@cestel.si Jan Kalin Matjaž Skobir Matjaž Gams Zavod za gradbeništvo Slovenije Cestel Cestni Inženiring d.o.o Jožef Stefan Institute Dimičeva ulica 12 Špruha 32 Jamova cesta 39 Ljubljana, Slovenia Trzin, Slovenia Ljubljana, Slovenia jan.kalin@zag.si matjaz.skobir@cestel.si matjaz.gams@ijs.si Abstract including road maintenance planning, traffic management, and Accurate vehicle axle weight estimation is essential for the main- the prevention of overloading, which can lead to premature road tenance and safety of transportation infrastructure. This study wear and increased accident risks [8]. Traditional methods for evaluates and compares the performance of various algorithms axle weight measurement often rely on static scales or weigh- for axle weight prediction using time-series data. The algorithms in-motion (WIM) systems. While these methods provide direct assessed include traditional machine learning models (e.g., ran- measurements, they are susceptible to limitations such as high dom forest) and advanced deep learning techniques (e.g., con- installation and maintenance costs, potential measurement inac- volutional neural networks). The evaluation utilized datasets curacies due to environmental factors, and the need for frequent comprising time-series data from 10 sensors positioned on a sin- calibration. gle lane of a bridge, with the goal of predicting each vehicle’s axle In recent years, the advent of advanced computational tech- weights based on the signals from these sensors. Each algorithm’s niques has opened new avenues for improving axle weight predic- performance was measured against the OIML R-134 recommen- tion. Machine learning (ML) and deep learning (DL) algorithms, in dation, where a prediction was classified as accurate if the error particular, offer promising alternatives by leveraging time-series was within ±4 percent for two-axle vehicles and ±8 percent for data to model complex, non-linear relationships inherent in ve- vehicles with more than two axles. Tests were conducted on sev- hicular weight patterns. These methods can enhance prediction eral bridges, with this paper presenting detailed results from the accuracy, handle large volumes of data, and adapt to varying con- Lopata bridge. Findings indicate that deep learning models, par- ditions, making them suitable for real-world applications where ticularly convolutional neural networks, significantly outperform traditional methods may fall short. traditional methods in terms of accuracy and their ability to adapt This study systematically evaluates and compares the per- to complex patterns in time-series data. This study provides a formance of various axle weight prediction algorithms using valuable reference for researchers and practitioners aiming to time-series data. We focus on a diverse set of algorithms, includ- enhance axle weight prediction systems, thereby contributing to ing machine learning models like random forests (RF) [6] and more effective infrastructure management and safety monitoring. advanced deep learning techniques such as convolutional neural networks (CNN) [4]. Keywords The objective of this research is to explore the potential of time-series data, axle weight, machine learning, neural network combining traditional WIM systems with advanced ML and DL models to enhance axle weight predictions. By comparing the performance of different methodologies, including the SIWIM 1 Introduction traditional model, random forest (IJS RF), a hybrid approach Accurate axle weight prediction plays a pivotal role in the mainte- (AVERAGE(IJS, SIWIM traditional)), and a CNN-based model, this nance and safety of transportation infrastructure [7]. The precise study aims to identify the most effective strategies for accurate estimation of axle weights is essential for various applications, and reliable axle weight estimation. Additionally, it examines the impact of synthetic data generation on the performance of these Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or models, providing a comprehensive evaluation of their practical distributed for profit or commercial advantage and that copies bear this notice and applicability in real-world scenarios. the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). The study aimed to predict the axle weights of vehicles using Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia ten input signals from sensors placed under the Lopata bridge. © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2024.scai.4752 27 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Kolar et al. Each predictive algorithm’s performance was evaluated accord- Bosso et al. [1] proposed a method using weigh-in-motion ing to the OIML R-134 recommendation, which is deemed accu- (WIM) data and regression trees to identify patterns in over- rate if the error margin for predicting the axle weight is within loaded truck weights and travel. The analysis reveals that truck ±4% for vehicles with two axles and within ±8% for vehicles with type is the key predictor of overloading, while time of day is more than two axles. crucial for axle overloading, with most incidents occurring late The dataset comprised 1478 samples, i.e. passing of a vehicle, at night or early morning. These insights can enhance enforce- each containing 10 signals per vehicle. For each sample, a static ment strategies and inform pavement management and design, weight for each axle was assigned as the target value. Static optimizing infrastructure longevity and safety. weight refers to the weight measured by a scale when the vehicle He et al. [2] introduced a new method that uses only the is stationary. flexural strain signals from weighing sensors to identify axle This paper is structured as follows: Section 2 reviews several spacing and weights, reducing installation costs and expanding state-of-the-art approaches. Section 3 details the preprocessing BWIM applications. The method’s accuracy is validated through steps necessary before applying machine learning methods. In numerical simulations and laboratory experiments with a scaled Section 4, algorithms used for predicting axle weights are pre-vehicle-bridge interaction model, showing promising results for sented. Section 5 presents the final results of the axle weight accurate axle spacing and weight identification. predictions. Finally, Section 6 summarizes the findings and proposes ideas for future research. 3 Data Preprocessing Before applying various algorithms to the dataset, several pre- 2 Related Work processing steps were necessary. Due to the differing lengths of The prediction of axle weights using time-series data has often signals from each sample, padding was performed to standardize been studied in recent years, resulting in a substantial body of them to the length of the longest signal. Samples with a gross related work. Below, several state-of-the-art (SOTA) approaches weight below 5 kN were excluded from both the training and are described. test datasets. Each signal was cropped by removing data to the Zhou et al. [10] differentiated between high-speed and low- left of the leftmost peak value minus 100 and to the right of the speed weigh-in-motion (WIM) systems and analyzed the char- rightmost peak value plus 100. The peak values were calculated acteristics of axle weight signals. They proposed a nonlinear in advance. curve-fitting algorithm, detailing its implementation. Numerical To address the limited availability of data required for deep simulations and field experiments assessed the method’s perfor- learning, which typically necessitates tens of thousands of sam- mance, demonstrating its effectiveness with maximum weighing ples for effective training, synthetic data generation was em- errors for the front axle, rear axle, and gross weights recorded ployed. The original dataset comprised 1,478 samples (from Janu- at 4.01%, 5.24%, and 3.92%, respectively, at speeds of 15 km/h or ary 2022 to December 2023) i.e. passing of a vehicle, each contain- lower. ing 10 signals per vehicle. An additional 20,000 synthetic samples Wu et al. [8] introduced a modified encoder-decoder architec-were generated using a specific algorithm. This algorithm oper- ture with a signal-reconstruction layer to identify vehicle proper- ates by iterating 20,000 times, during each of which a random ties (velocity, wheelbase, axle weight) using the bridge’s dynamic training sample and a random strain factor were selected. The response. This unsupervised encoder-decoder method extracts strain factor is a random value ranging between 0.5 and 0.99. The higher features from the original data. A numerical bridge model selected signal from the training sample was then scaled by the based on vehicle-bridge coupling vibration theory demonstrated chosen strain factor. This scaling process effectively models the the method’s applicability. Results indicated that the proposed ap- feature that doubling the amplitude of the signal corresponds to proach accurately predicts traffic loads without additional sensors doubling its weight. or vehicle weight labels, achieving better stability and reliability A crucial aspect of data preprocessing involved the normal- even with significant data pollution. ization of sensor signals to ensure uniformity across the dataset. Xu et al. [9] applied wavelet transform for denoising and re-Each signal was normalized to have a mean of zero and a stan- constructing the WIM signal, and used a back propagation (BP) dard deviation of one, which helps in improving the convergence neural network optimized by the brain storm optimization (BSO) of machine learning algorithms by ensuring that each feature algorithm to process the WIM signal. Comparing the predictive contributes equally to the learning process. abilities of BP neural networks optimized by different algorithms, The selection of training and test data was conducted using a they found the BSO-BP WIM model to exhibit fast convergence rolling window approach [3]. Specifically, for each testing month, and high accuracy, with a maximum gross weight relative error the training data comprised all available data up to but not includ- of 1.41% and a maximum axle weight relative error of 6.69%. ing the testing month. For instance, if May 2023 was designated Kim et al. [5] developed signal analysis algorithms using artifi-as the testing month, the training dataset consisted of data from cial neural networks (ANN) for Bridge Weigh-in-Motion (B-WIM) January 2022 through April 2023. This process was systematically systems. Their procedure involved extracting information on ve- repeated for each testing month from March 2022 to December hicle weight, speed, and axle count from time-domain strain 2023. data. ANNs were selected for their effectiveness in incorporating dynamic effects and bridge-vehicle interactions. Vehicle exper- 4 Methodology iments with various load cases were conducted on two bridge Four methods were identified as applicable for predicting vehicle types: a simply supported pre-stressed concrete girder bridge and axle weights. The first method, known as SIWIM traditional [11], a cable-stayed bridge. High-speed and low-speed WIM systems calculated the number of axles, axle lengths, and axle weights by were used to cross-check and validate the algorithms’ perfor- utilizing influence lines to model the signal and determine the mance. correct output. For validation purposes, each predicted output 28 Performance Comparison of Axle Weight Prediction Algorithms Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia The detailed architecture of the CNN is shown in Figure 1. 2D Convolutional layers (Conv2D) were used instead of 1D Convolu- tional layers due to the input data consisting of 10 sensor signals. The number of filters and kernel size are specified within the parentheses of each Conv2D layer, while the pooling size is de- fined in each 2D MaxPooling layer parentheses (MaxPooling2D). The last Dense layer has 100 neurons. To mitigate overfitting, a Dropout layer was added after the final Dense layer. Additionally, Batch Normalization was applied after each 2D Convolutional layer to further reduce the risk of overfitting. Although Long Short-Term Memory (LSTM) and Gated Re- current Unit (GRU) neural networks could be used for this task, a Convolutional Neural Network (CNN) was chosen instead be- cause of its strengths in capturing spatial hierarchies and local patterns within the data. CNNs are highly effective at extracting local features and detecting patterns, while LSTM and GRU are better suited for handling temporal dependencies, which are not that relevant to this specific task. 5 Results Figure 1: Architecture of CNN for predicting axle weights. was stored in a separate file alongside the signal data, enabling direct comparison with the actual values. Figure 2: Accuracies of all algorithms for each testing The second method used the random forest [6] (named IJS month. RF) for predicting vehicle axle weights. The model relied on ac- curately identifying the positions of peaks to function correctly. Peak values were determined using the 𝑓 𝑖𝑛𝑑_𝑝𝑒𝑎𝑘𝑠 method from The results of each method described in Section 4 are illusthe SciPy library, which identifies peaks based on a specified trated in Figure 2. Among the methods evaluated, SIWIM tradi-minimum height. Once the peaks were identified, the algorithm tional exhibited the poorest performance, with fluctuating trends extracted values within a ±5 range of each peak. These extracted observed throughout the entire two-year period. The CNN be- values were then used as input features for the random forest gan to outperform the other three approaches after December model. Additionally, the random forest model incorporated tem- 2022. Conversely, the AVERAGE(IJS, SIWIM traditional) method perature, axle distances and gross weight as input features. Ran- showed superior performance during the initial testing months dom forest algorithms are not inherently suited for time series from March 2022 to June 2022. data; however, they perform effectively with numerical data such The performance of the CNN improved with an increasing as temperature, axle distance, and gross weight. Therefore, this amount of data, whereas the IJS RF and AVERAGE(IJS, SIWIM algorithm was chosen for analyzing this type of input data. traditional) methods were more effective during the initial phase The third method integrated the first two approaches by aver- when less training data was available. However, the improvement aging the outputs from the SIWIM traditional and IJS RF models in CNN’s accuracy was not linear. This non-linear trend can be (named AVERAGE(IJS, SIWIM traditional)). This approach is mo- attributed to the random initialization of the CNN’s weights tivated by the concept that combining multiple models can often before each training session, occasionally leading to suboptimal yield more accurate results than relying on a single model alone convergence. [12]. An additional analysis was conducted to compare the perfor- The final method employed a convolutional neural network mance of the models under varying environmental conditions, (CNN) to predict axle weights. The CNN utilized synthetic data, such as temperature fluctuations and differing traffic patterns. as detailed in section 3, during the training phase. This method This analysis revealed that the CNN model maintained its accu-processed all 10 signals as input to calculate the axle weights. racy more consistently across different conditions, indicating its 29 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Kolar et al. robustness and adaptability. Furthermore, the inclusion of syn- be better tailored to capture specific nuances in the time-series thetic data in training the CNN model contributed to its superior data. performance, as it allowed the model to learn from a more di- After developing individual models for each sensor, the next verse set of examples. Future research should focus on expanding step would be to combine the predictions from these models into the range of synthetic data and exploring additional ensemble a single final prediction. This can be achieved using an ensemble techniques to further enhance prediction accuracy. method, such as a random forest classifier. The random forest Despite achieving high accuracy with the CNN model, with classifier would take the ten individual predictions (one from the highest accuracy reaching 0.94, this most accurate method each sensor model) as input features and produce a consolidated still falls short of meeting the OIML R-134 recommendation by final axle weight prediction. 4.4%. Furthermore, the results show that more static data could This method not only holds the potential to improve the ac- be needed for the learning phase. Having 1000 static samples curacy and robustness of the axle weight predictions but also which were augmented might not be sufficient to reach the OIML provides a scalable framework that can be adapted to different R-134 recommendation. datasets and sensor configurations. Future work should explore In summary, the results indicate that while traditional meth- the implementation of this approach, including the optimization ods such as IJS RF and AVERAGE(IJS, SIWIM traditional) perform of individual sensor models and the integration of their predic- well with limited data, convolutional neural networks (CNNs) tions through an ensemble method. demonstrate superior performance as more data becomes avail- By advancing the CNN model in this manner, it is anticipated able, despite some variability in their convergence. In addition, that the performance gap relative to the OIML R-134 recommen- a sufficient number of training examples is needed to approach dation could be further reduced, bringing the predictions closer the desired OIML R-134 recommendation. to the required accuracy levels with a smaller amount of data and enhancing the overall efficacy of the axle weight prediction 6 Conclusion and Discussion system. In this study, a performance comparison of various axle weight Acknowledgements prediction algorithms using time-series data collected from 10 sensors positioned on the Lopata bridge was conducted. The This study received funding from company Cestel. The authors algorithms evaluated encompassed traditional machine learning acknowledge the funding from the Slovenian Research and Inno- models, such as random forests, and advanced deep learning vation Agency (ARIS), Grant (PR-10495) and Basic core funding techniques, notably convolutional neural networks. P2-0209. The author(s) made use of chatGPT to assist with this The major findings reveal that CNNs achieved significantly article. ChatGPT was commonly employed as a tool for enhanc- better results in predicting axle weights during the latter months ing the language of the initial draft, without altering the length of the experiment. The CNNs’ ability to adapt to and learn from of the text. ChatGPT 4 was accessed/obtained from chatgpt.com complex patterns within the time series data was a key factor in and used with modification in July 2024. their superior performance. Despite achieving high accuracy with References the CNN model, reaching a peak accuracy of 0.94, this method [1] Mariana Bosso, Kamilla L Vasconcelos, Linda Lee Ho, and Liedi LB Bernucci. still falls short of meeting the OIML R-134 recommendation by 2020. Use of regression trees to predict overweight trucks from historical 4.4%. weigh-in-motion data. Journal of Traffic and Transportation Engineering Overall, there are three implications of this study. First, it (English Edition), 7, 6, 843–859. [2] Wei He, Tianyang Ling, Eugene J OBrien, and Lu Deng. 2019. Virtual axle demonstrates the potential of deep learning techniques to en- method for bridge weigh-in-motion systems requiring no axle detector. hance the accuracy of axle weight predictions where sufficient Journal of Bridge Engineering, 24, 9, 04019086. data is available, thereby facilitating more reliable infrastructure [3] Hamed Kalhori, Mehrisadat Makki Alamdari, Xinqun Zhu, Bijan Samali, and Samir Mustapha. 2017. Non-intrusive schemes for speed and axle iden- management. Second, for smaller datasets, it is more effective tification in bridge-weigh-in-motion systems. Measurement Science and to use classical machine learning systems in combination with Technology, 28, 2, 025102. [4] Teja Kattenborn, Jens Leitloff, Felix Schiefer, and Stefan Hinz. 2021. Review methods like SIWIM traditional. Third, it provides a valuable on convolutional neural networks (cnn) in vegetation remote sensing. ISPRS benchmark for researchers and practitioners, guiding the de- journal of photogrammetry and remote sensing, 173, 24–49. velopment and implementation of more effective axle weight [5] Sungkon Kim, Jungwhee Lee, Min-Seok Park, and Byung-Wan Jo. 2009. Vehicle signal analysis using artificial neural networks for a bridge weigh- prediction systems. in-motion system. Sensors, 9, 10, 7943–7956. To achieve the OIML R-134 recommendation, two options are [6] Steven J Rigatti. 2017. Random forest. Journal of Insurance Medicine, 47, 1, possible: 31–39. [7] Mohhammad Sujon and Fei Dai. 2021. Application of weigh-in-motion technologies for pavement and bridge response monitoring: state-of-the-art • Just add more data - if the trend continues, adding another review. Automation in Construction, 130, 103844. half a year of measurements would enable achieving the [8] Yuhan Wu, Lu Deng, and Wei He. 2020. Bwimnet: a novel method for iden- standard. Another option would be to apply measurements tifying moving vehicles utilizing a modified encoder-decoder architecture. Sensors, 20, 24, 7170. on a bridge with more traffic. [9] Suan Xu, Xing Chen, Yaqiong Fu, Hongwei Xu, and Kaixing Hong. 2022. • Improve the methods by incorporating advanced ensemble Research on weigh-in-motion algorithm of vehicles based on bso-bp. Sensors, 22, 6, 2109. techniques. [10] ZF Zhou, P Cai, and RX Chen. 2007. Estimating the axle weight of vehicle in motion based on nonlinear curve-fitting. IET science, measurement & To introduce the ensemble approaches, one potential improve- technology, 1, 4, 185–190. ment involves modeling each sensor individually. This approach [11] A Žnidarič, J Kalin, M Kreslin, M Mavrič, et al. 2016. Recent advances in bridge wim technology. In entails building a separate CNN model for each of the ten sen- Proc. 7th International Conference on WIM. [12] Hui Zou and Yuhong Yang. 2004. Combining time series models for fore- sors, allowing for more specialized and potentially more accurate casting. International journal of Forecasting, 20, 1, 69–84. predictions from each sensor’s data. By focusing on the unique characteristics and data patterns of each sensor, the models can 30 Comparison of Feature- and Embedding-based Approaches for Audio and Visual Emotion Classification Sebastijan Trojer Zoja Anžur st5804@student.uni- lj.si zoja.anzur@ijs.si Jožef Stefan Institute Jožef Stefan Institute Faculty of Computer and Information Science Ljubljana, Slovenia Ljubljana, Slovenia Mitja Luštrek Gašper Slapničar mitja.lustrek@ijs.si gasper.slapnicar@ijs.si Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan International Postgraduate School Ljubljana, Slovenia Ljubljana, Slovenia Abstract nature, lacking explainability and interpretability of the internally derived features [9]. Furthermore, while some research suggests This paper presents a comparative analysis of feature- and embe-superior performance of embeddings compared to traditional dding-based approaches for audio-visual emotion classification. features [20], this is not universally agreed upon [8], especially We compared the performance of traditional handcrafted fea-when taking into account potentially much higher computational tures, using MediaPipe for visual features and Mel-frequency complexity of deriving embeddings with deep artificial neural cepstral coefficients (MFCCs) for audio features, against neural networks (ANNs). network (NN)-based embeddings obtained from pretrained mod- Our research question is thus, whether it is better to compute els suitable for emotion recognition (ER). The study employs embeddings using SOTA pretrained DL models instead of using separate uni-modal datasets for audio and visual modalities to hand-crafted features, as ANN embeddings promise to increase rigorously assess the performance of each feature set on each detection accuracy at the cost of interpretability and computa- modality. Results demonstrate that in the case of visual data NN- tional complexity. In this work we compared the performance of based embeddings significantly outperform handcrafted features hand-crafted features and embeddings obtained with pretrained in terms of accuracy and F1 score when training a traditional SOTA models for the down-stream task of emotion recognition. classifier. However, for audio data the performance is similar We independently compared ER performance of audio and video on all feature sets. Handcrafted features, such as facial blend- modality, using established benchmark datasets for each. Hand- shapes, computed from MediaPipe keypoints and MFCCs, re- crafted features were chosen based on literature and embeddings main relevant in resource-constrained settings due to their lower were computed with task-suitable pretrained models available computational demands. This research provides insights into in existing Python libraries. Both were formatted in a way that the trade-offs between traditional feature extraction methods allowed us to then train a set of traditional ML models, listed in and modern deep learning techniques, offering guidance for the Section 3.3, for ER, using hand-crafted features, embeddings, or development of future emotion classification systems. a union of both as inputs. Keywords emotion recognition, embeddings, hand-crafted features 2 Related Work Performance comparison of hand-crafted features and learned 1 Introduction embeddings has been discussed in depth in computer vision do- Automated emotion recognition (ER) often focuses on two modali- main. Schonberger et al. [15] demonstrated that hand-crafted ties – video and audio. This is akin to human emotion recognition, features (e.g., SIFT) still perform on par or better than learned as we heavily rely on audio-visual characteristics, such as facial embeddings in image reconstruction. They warned of high vari- expressions and audio cues, to deduce emotional state [7]. Both ance across datasets when using learned embeddings as features. audio and video are relatively simple to obtain using sensors, Similarly, Antipov et al. [2] reported similar performance of hand-as such sensors are unobtrusive and easily available (e.g., web- crafted features (e.g., HOG) and learned embeddings when classi- cameras) and can be used to train machine learning (ML) models fying pedestrian gender from images using small datasets. They for emotion recognition. also highlighted superior generalization performance of embed- In the past decade deep-learning (DL) approaches achieved dings across (unseen) datasets. In emotion recognition from audio, state-of-the-art (SOTA) results in many domains, including emo- Papakostas et al. [13] compared using hand-crafted MFCC-based tion recognition [16]. However, despite the superior performance features with embeddings from a custom convolutional neural of such models, many doubts have been cast on their black-box network (CNN) trained on spectrograms. The latter slightly out- performed hand-crafted features by 1% on average in terms of Permission to make digital or hard copies of all or part of this work for personal F1 score, again showing similar performance. Ye et al. [21] re-or classroom use is granted without fee provided that copies are not made or cently showed that using a union of both hand-crafted features distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this and learned embeddings achieves superior performance in user work must be honored. For all other uses, contact the owner /author(s). identification, compared to using each input individually. Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia There is moderate (but not universal) agreement in recent © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2024.scai.6883 literature that performance between hand-crafted features and 31 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Sebastijan Trojer, Zoja Anžur, Mitja Luštrek, and Gašper Slapničar learned embeddings is similar, however, most work comparing the other hand, we chose SOTA pretrained models trained for their performance is limited to a single modality or task. We related tasks. We extracted embeddings at a model-specific point compared performance between two different modalities for the before the learning layers, and formatted them using principal task of ER and investigated potential performance improvements component analysis (PCA) in order to reduce their dimensionality of feature-level fusion (hand-crafted + embeddings). while maintaining the relevant information. 3.2.1 Audio Features. MFCCs are historically well-established 3 Methodology in ER from audio [10], as they give a good approximation of the Our task consisted of two parts – hand-crafted features and em-human auditory system’s response. For each audio clip, we com- beddings computation, and ER model training for classification. puted a common set of statistical aggregate features (averages, Both were done on (separate) audio and visual modality and will standard deviations) for MFCCs, Root Mean Square (RMS) en- be described per-modality in the following sections. ergy (volume), Zero-Crossing Rate, Spectral Bandwidth, Spectral Contrast, and Spectral Roll-off, using the librosa python library. 3.1 Datasets For embeddings we decided to investigate models pretrained As mentioned previously, the ER task is most-often audio-visual, on similar audio tasks (e.g., emotion recognition) and use them so we decided to use an audio and a visual dataset to indepen- to the point where embeddings are available, which typically dently evaluate the performance of different feature sets. While means the upper part of the ANN architecture, responsible for many datasets exist that contain both modalities, they often have computation of embeddings representing the features. Three pre- a problem of imprecise coarse emotion labelling [18], as labels trained models were investigated in our evaluation, all based are video-based, while emotions can be exhibited and changed in on the wav2vec2 architecture, which is a self-supervised model much shorter windows. Splitting video into frames yields a large for learning speech representations proposed by Facebook AI number of (different) instances with the same label, so we wanted Research (FAIR) [3]. Full wav2vec2 pretraining framework com-a dataset with individual image labels. As our focus was on com- prises a latent feature encoder, a context network using the trans- paring the performance of hand-crafted and embedding-based former architecture, a quantization module and contrastive loss features, we chose two well-established benchmark datasets ded- (pre-training objective). For our purposes the feature encoder icated to audio and visual emotion classification. These datasets is important, which is a 7-layer 1D CNN reducing the dimen- contain short audio clips and individual images with precise sionality of audio inputs into a sequence of feature vectors. The short-term and per-frame labels, circumventing the mentioned initial model version was pretrained on the LibriSpeech dataset, per-video labelling problem. another version was fine-tuned on IEMOCAP dataset specifically for ER, and finally a large general cross-lingual model (XLSR) 3.1.1 Audio Dataset. For evaluation on audio data we decided was trained on millions of hours of unlabeled audio data in 53 to use the crowd-sourced emotional multimodal actors dataset (later extended) languages [5]. These three variants were used (CREMA-D) [4]. It contains short clips of 91 actors between the to extract their corresponding embeddings. Since the input data ages of 20 and 74 coming from a variety of races and ethnicities, from CREMA-D is of inconsistent shape (varying by < 1 sec), we who exhibited six different emotions (Anger, Disgust, Fear, Happy, had to employ an additional adaptive average pooling layer to en- Neutral, Sad). Each actor produced about 80 clips (small vari- sure consistently shaped outputs. We designed this pooling layer ation), saying specific sentences exhibiting different emotions. so that we lost minimal information (short segment length for The distribution of labels was balanced, each class representing pooling) and the outputs were then flattened. PCA was employed approx. 16% of the data. The intended emotions were verified to subsequently reduce them to 10 dimensions. The number of di- with 2,443 crowd-sourced human raters as baseline. These raters mensions was chosen arbitrarily and could be changed, however, predicted emotions based on audio only, video only, or both, we believe that 10 dimensions offer a good balance between re- achieving 40.9%, 58.2% and 63.6% recognition of intended (acted) tained information and computational (and spatial) requirements. emotion respectively. Moreover, this number of PCA components is on the same order 3.1.2 Visual Dataset. of magnitude as the number of hand-crafted features, making For visual data we chose the extended them more comparable. Cohn-Kanade dataset (CK+) [11], which a staple dataset in ER evaluation from facial expressions. It contains images of 118 3.2.2 Visual Features. For visual features, we focused on the adults, aged between 18 and 50, again of different ethnicities. Par- movement of specific facial keypoints, such as the corners of ticipants were instructed to perform a series of 23 facial displays, the mouth and eyebrows, which form the basis of the Facial relating to one of seven emotions (Anger, Contempt, Disgust, Fear, Action Coding System (FACS) – a taxonomy that categorizes Happy, Sad, Surprise). The distribution of classes in CK+ is not human facial expressions based on muscle movements [6]. We balanced – Surprise is the majority class at 25% and Contempt the employed the MediaPipe (MP) framework [12] to extract values minority class at 6%, with others in between. This distribution representing the activation of various facial blendshapes, which also changes between subjects. CK+ images were reshaped to correspond approximately to the regions defined in FACS. In this 48x48 pixels, put in grayscale format and cropped using frontal paper, we classify MediaPipe features as “handcrafted” because, face Haar cascade classifier [1] as part of preprocessing. The despite being neural network-based, they quantify predefined emotion labels were validated by experts via facial activation facial areas with human-interpretable metrics. This contrasts unit rules (e.g., Happy = Activation unit 12 must be present = Lip with CNN-based embeddings, which capture patterns without corner puller active). direct interpretability. For comparison, we used embeddings from two pretrained 3.2 Feature Computation models: FaceNet [17] and EfficientNet [19] from the HSEmotion For selection of hand-crafted features we relied on literature library [14]. FaceNet architecture is based on GoogleNet, which and previous work in ER for each modality. For embeddings on is a variant of deep CNN, and is trained using triplet loss. It 32 Comparison of Feature- and Embedding-based Approaches Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia was optimized for facial recognition, verification, and clustering. (1) Hand-crafted statistical features relating to MFCCs EfficientNet comprises several inverted bottleneck convolutional (2) 10-component PCA of wav2vec2 embeddings from a model residual blocks. It achieved SOTA results on the AffectNet ER trained on LibriSpeech dataset, while being relatively light-weight. Again, PCA was used (3) 10-component PCA of wav2vec2 embeddings from a model to reduce the embeddings to 10 dimensions. trained on IEMOCAP (4) 10-component PCA of wav2vec2 embeddings from a cross- 3.2.3 Computational and Spatial Requirements. In order to have lingual XLSR model a clear overview of the trade-off between computational and (5) Union of hand-crafted and best-performing embeddings spatial requirements of each feature computation method, and (from above) their classification performance discussed in the next section, we These were compared in experiments as described in Section 3.3, first report the average times to compute and disk sizes of the using a set of four ML models. Results of best-performing model output (per one instance) for each method in Table 1. for each set in terms of accuracy and F1 are given in Table 2. Table 1: Average time and disk space needed for feature Fused data was acquired by concatenating the feature sets. computation using each method. Table 2: Best performing models for each feature set and Modality Feature method Avg. Time Avg. Space corresponding accuracy and F1 scores for audio data. Note that embeddings were represented with 10 components MFCC stats 19 ms < 1 kB obtained from PCA. wav2vec2 LibriSpeech 99 ms 194 kB Audio wav2vec2 XLSR 274 ms 258 kB wav2vec2 IEMOCAP 101 ms 5 kB Feature set Best model Accuracy F1 score MediaPipe 10 ms < 1 kB N/A Majority 0.17±0.00 0.05±0.00 FaceNet 29 ms 2 kB MFCC stats RF 0.46±0.08 0.43±0.09 Video EfficientNet 2 ms 5 kB wav2vec2 LibriSpeech SVM 0.47±0.08 0.45±0.09 When interpreting the results in Table 1, it must also be con-wav2vec2 XLSR SVM 0.30±0.05 0.27±0.05 sidered that DL-based methods require additional computational wav2vec2 IEMOCAP SVM 0.47±0.08 0.42±0.09 time when doing PCA on top of the raw embeddings. MFCC + best wav2vec2 SVM 0.52±0.09 0.50±0.10 3.3 Emotion Classification 4.2 Image Emotion Classification Data splitting is a crucial step in evaluation of ML models, as To stay consistent with the audio experiments we performed the it must be done in a way to avoid overfitting and provide a ro- same LOSO experiments described in Section 3.3. We compared bust evaluation of generalization capabilities of a model. The model performances using the following features as inputs: aim of this research was primarily not to evaluate the absolute (1) MediaPipe blendshapes performance of ER, but rather compare the performance when (2) 10-component PCA of FaceNet embeddings using hand-crafted vs. embedding features. Therefore it was cru- (3) 10-component PCA of EfficientNet embeddings cial to consistently ensure that the same data splits and models (4) Union of MP and FaceNet embeddings were used in each experiment, for each of the compared inputs. (5) Union of MP and EfficientNet embeddings We decided for the most robust leave-one-subject-out (LOSO) Accuracy and F1 scores for the best performing models for evaluation, always using default model hyperparameters. Such each set of features are again reported in Table 3 experimental setup minimized overfitting and also gave a good overview of generalization performance of emotion classifiers. Table 3: Best-performing models for each feature set and 4 Experiments and Results corresponding accuracy and F1 scores for visual data. Note that embeddings were represented with 10 components The outputs of the previous step were used as inputs (features) obtained from PCA. to train a traditional ML model for emotion classification. We evaluated several options: taking the 10 PCA components of em- Feature set Best model Accuracy F1 score beddings obtained from each pretrained model as inputs, taking N/A Majority 0.25±0.00 0.40±0.00 only hand-crafted features as inputs, and taking union of both MediaPipe RF 0.62±0.28 0.51±0.29 as input. Each of these cases was evaluated for audio and visual FaceNet SVM 0.45±0.30 0.36±0.30 modality separately, using the LOSO experimental setup. Several EfficientNet RF 0.93±0.16 0.90±0.20 popular ML models were compared (with default hyperparame- Mediapipe + FaceNet XGB 0.70±0.28 0.60±0.29 ters), including k-nearest Neighbours (kNN), Random Forest (RF), Mediapipe + EfficientNet XGB 0.93±0.17 0.90±0.21 Support Vector Machines (SVM) with linear kernel, and eXtreme Gradient Boosting (XGB). We monitored classification accuracy and macro F1 score as metrics of the model performance. All 4.3 Discussion results were compared with baseline majority classifier and are From Tables 2 and 3 we can observe that for audio the best reported as averages across all iterations of LOSO, where majority performance is achieved when using union of hand-crafted and was always taken from the train data (all except left-out). embedding features, while for visual ER the performance of only embeddings or union is nearly identical. The improvement of 4.1 Audio Emotion Classification feature union is thus generally small, as for visual data we get As mentioned in Section 3 we investigated the following options the same result as using only the best embeddings (1% difference as feature inputs: in standard deviation), while for audio data the improvement in 33 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Sebastijan Trojer, Zoja Anžur, Mitja Luštrek, and Gašper Slapničar both metrics is about 5% compared to individual feature sets. All used data was simulated/acted, so interpretation of these results results substantially outperform the baseline majority classifiers. must take that into account. Numbers are expected to decrease For audio data we can see that the best embedding set (wav2vec2 on a more realistic dataset, as emotions in everyday life are quite LibriSpeech) performs nearly the same as hand-crafted features subtle [18]. It would thus make sense to run similar experiments (MFCC stats), which is in agreement with some literature [13]. It on more realistic data as well, although such data is more scarce. is surprising that LibriSpeech embeddings slightly outperform IEMOCAP ones, since the latter were trained specifically for Acknowledgements emotion recognition, while the former were not. The subpar This work was supported by bilateral Weave project, funded by performance of XLSR is expected, since it is a more general cross- the Slovenian Agency of Research and Innovation (ARIS) under lingual unsupervised model, while investigated data is spoken in grant agreement N1-0319 and by the Swiss National Science English. For visual data on the other hand the best embeddings Foundation (SNSF) under grant agreement 214991. (EfficientNet) substantially outperform hand-crafted facial ex- pression features (MediaPipe) and those obtained from FaceNet. References This is expected, as EfficientNet was trained specifically for emo- [1] Shahad Salh Ali, Jamila Harbi Al’ Ameri, and Thekra Abbas. 2022. Face tion recognition, while FaceNet was trained for face recognition. detection using Haar cascade algorithm. In 2022 Fifth College of Science International Conference of Recent Trends in Information Technology (CSCTIT), In terms of ML models, we consistently observed best perfor- 198–201. doi: 10.1109/csctit56299.2022.10145680. mance of SVM for ER from audio data, while for video data the [2] Grigory Antipov, Sid-Ahmed Berrani, Natacha Ruchaud, et al. 2015. Learned vs. hand-crafted features for pedestrian gender recognition. In Proceedings best model is not as homogeneous. Importantly, performance of of the 23rd ACM international conference on Multimedia, 1263–1266. different models (RF, SVM and XGB) was often within 1%. [3] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, et al. 2020. Wav2vec Another important observation is the relative stability of re- 2.0: a framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33, 12449–12460. sults across subjects when classifying from audio, with standard [4] Houwei Cao, David G Cooper, Michael K Keutmann, et al. 2014. Crema-d: deviations around 8%. The same was not observed in the eval- crowd-sourced emotional multimodal actors dataset. IEEE transactions on uation from visual data, with much higher standard deviations, affective computing, 5, 4, 377–390. [5] Alexis Conneau, Alexei Baevski, et al. 2020. Unsupervised cross-lingual rep- indicating lower stability and greater variation between subjects. resentation learning for speech recognition. arXiv preprint arXiv:2006.13979. To address our initial research question, we observed simi- [6] Paul Ekman and Erika L Rosenberg. 1997. What the face reveals: Basic and applied studies of spontaneous expression using the Facial Action Coding lar performance of hand-crafted features and embeddings from System (FACS). Oxford University Press, USA. SOTA DL models for audio-based ER, with union of both achiev- [7] Monica Gori, Lucia Schiatti, and Maria B. Amadeo. 2021. Masking emotions: ing the best results. The image-based visual ER achieved much face masks impair how we read emotions. Frontiers in Psychology, 12, 669432. [8] Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. 2022. Why do tree- better performance with learned embeddings as inputs, while the based models still outperform deep learning on typical tabular data? Ad- union of features showed no improvement. However, the cost vances in neural information processing systems, 35, 507–520. of hand-crafted features and embeddings in terms of computa- [9] Xuhong Li, Haoyi Xiong, Xingjian Li, et al. 2022. Interpretable deep learning: interpretation, interpretability, trustworthiness, and beyond. Knowledge and tional power required to compute, and spatial requirements to Information Systems, 64, 12, 3197–3234. save, is not the same. While hand-crafted features are usually [10] MS Likitha, Sri Raksha R Gupta, K Hasitha, et al. 2017. Speech based human emotion recognition using MFCC. In 2017 international conference on wireless computed quickly and represented with a few numbers, as re- communications, signal processing and networking (WiSPNET), 2257–2260. ported in Table 1, the embeddings require loading a (commonly [11] Patrick Lucey, Jeffrey F Cohn, Takeo Kanade, et al. 2010. The extended large) pretrained ANN, which performs a large number of matrix cohn-kanade dataset (ck+): a complete dataset for action unit and emotion- specified expression. In 2010 ieee computer society conference on computer multiplications, resulting in high-dimensional embeddings (e.g., vision and pattern recognition-workshops. IEEE, 94–101. 64×512). This in turn requires additional dimensionality reduc- [12] Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, et al. 2019. Mediapipe: a tion, such as PCA employed in this work. Our results indicate framework for building perception pipelines. (2019). https://arxiv.org/abs/1 906.08172. that for image-based visual ER, the additional cost is worthwhile, [13] Michalis Papakostas, Evaggelos Spyrou, Theodoros Giannakopoulos, et al. due to large improvements in performance, while audio-based 2017. Deep visual attributes vs. hand-crafted audio features on multidomain speech emotion recognition. Computation, 5, 2, 26. ER achieved much smaller improvement, making the use of em- [14] Andrey Savchenko. 2023. Facial expression recognition with adaptive frame beddings from pretrained models less attractive. rate based on multiple testing correction. In Proceedings of the 40th Inter- Finally, hand-crafted features mostly offer direct interpretabil- national Conference on Machine Learning (ICML) (Proceedings of Machine Learning Research). Vol. 202. Pmlr, (July 2023), 30119–30129. https://procee ity (e.g., audio loudness), while embeddings are commonly black- dings.mlr.press/v202/savchenko23a.html. box in nature, lacking explainability without suitable mechanisms [15] Johannes L Schonberger, Hans Hardmeier, Torsten Sattler, et al. 2017. Com- on top. The clear meaning of hand-crafted features can be helpful parative evaluation of hand-crafted and learned local features. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1482–1491. when training traditional ML models, where feature importance [16] Liam Schoneveld, Alice Othmani, and Hazem Abdelkawy. 2021. Leveraging can be compared and subsequently interpreted. recent advances in deep learning for audio-visual emotion recognition. Pattern Recognition Letters, 146, 1–7. [17] Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: a unified embedding for face recognition and clustering. In 2015 IEEE Confer- 5 Conclusion ence on Computer Vision and Pattern Recognition (CVPR). IEEE, (June 2015). doi: 10.1109/cvpr.2015.7298682. In summary we compared using hand-crafted features, embed- [18] Gašper Slapničar, Zoja Anžur, Sebastijan Trojer, et al. 2024. Contact-free dings of pretrained SOTA models, or union of both, as inputs for emotion recognition for monitoring of well-being: early prospects and future ER models using audio and visual data. We found that embedding- ideas. In Intelligent Environments 2024: Combined Proceedings of Workshops and Demos & Videos Session. IOS Press, 58–67. based approach is substantially superior with visual data, out- [19] Mingxing Tan and Quoc V. Le. 2019. Efficientnet: rethinking model scaling weighing the computational cost – the latter is in fact the lowest for convolutional neural networks. CoRR. http://arxiv.org/abs/1905.11946. [20] Pawan Kumar Verma, Prateek Agrawal, Ivone Amorim, et al. 2021. Welfake: when using EfficientNet. For audio data, the improvement was word embedding over linguistic features for fake news detection. IEEE only seen in union of inputs, and was relatively low. Transactions on Computational Social Systems, 8, 4, 881–893. As future work it would be worthwhile to compare merged [21] Cuicui Ye, Jing Yang, and Yan Mao. 2024. Fdhfui: fusing deep representa- tion and hand-crafted features for user identification. IEEE Transactions on audio-visual features and embeddings in a single ER problem on Consumer Electronics. the same dataset having both modalities. Furthermore, currently 34 Multi-modal Data Collection and Preliminary Statistical Analysis for Cognitive Load Assessment Ana Krstevska Sebastjan Kramar Hristijan Gjoreski Department of Intelligent Systems Department of Intelligent Systems Faculty of Electrical Engineering and Jožef Stefan Institute Jožef Stefan Institute Information Technologies Ljubljana, Slovenia Ljubljana, Slovenia Skopje, Macedonia ana.krstevska2001@gmail.com sebastjan.kramar@ijs.si hristijang@feit.ukim.edu.mk Martin Gjoreski Junoš Lukan Sebastijan Trojer Università della Svizzera italiana (USI) Department of Intelligent Systems Department of Intelligent Systems Lugano, Switzerland Jožef Stefan Institute Jožef Stefan Institute martin.gjoreski@usi.ch Jožef Stefan International Postgraduate Ljubljana, Slovenia School st 5 804@student.uni-lj.si Ljubljana, Slovenia junos.lukan@ijs.si Mitja Luštrek Gašper Slapničar Department of Intelligent Systems Department of Intelligent Systems Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan International Postgraduate School Lj ub lj a n a, Slo v enia Ljubljana, Slovenia gasper.slapnicar@ijs.si mitja.lustrek@ijs.si Abstract 1 Introduction To mitigate distractions during complex tasks, ubiquitous Human attention is a critical resource that is increasingly targeted computing devices should adapt to the user's cognitive load. by mobile applications, online services, and other forms of digital However, accurately assessing cognitive load remains a significant engagement. In an era of constant connectivity, capturing and challenge. This study aims to present sophisticated, multi-modal retaining user attention has become a primary objective for many data collection, which can enable accurate estimation of cognitive technologies. However, as users engage in cognitively demanding load using wearable and contact-free devices. A total of 25 tasks, distractions can lead to performance degradation and participants participated in six cognitive load-inducing tasks, each increased stress. Therefore, to minimize interruptions and maintain presented at two levels of difficulty. Simultaneously, physiological productivity, ubiquitous computing systems must become capable and behavioral data were collected from a multi-modal sensory of recognizing and adapting to the user’s cognitive load in real time. setup, including: Empatica E4 wristband, Emteq OCOsense Cognitive load, defined as the mental effort required to process glasses, an eye tracker, a thermal camera, a depth camera and an information and perform tasks, triggers a series of physiological RGB video camera. Additionally, participants provided subjective responses in the human body. These responses are largely governed measures of cognitive load by completing standardized NASA by the activation of the sympathetic nervous system. When Task Load Index (NASA TLX) and Instantaneous Self-Assessment cognitive load increases, measurable changes can be observed in (ISA) questionnaires following each cognitive task. Preliminary physiological markers, including blood pressure, brain activity, eye statistical analyses were conducted on participant demographics, movements, electrodermal activity (EDA), respiration rate, heart performance metrics, and the perceived difficulty of tasks, as rate variability, etc. Furthermore, changes are also reflected in reported in the completed questionnaires. facial expressions, posture, and other behavioural patterns. Keywords This study seeks to offer a unique multi-modal dataset with a rich cognitive load inference, wearable sensors, contact-free set of wearable and unobtrusive sensors to capture the subtle unobtrusive sensors changes that occur with the gradual activation of the sympathetic Permission to make digital or hard copies of part or all of this work for personal or nervous system. Rather than solely focusing on maximizing data classroom use is granted without fee provided that copies are not made or distributed accuracy through the use of numerous devices, this approach also for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. aims to identify the minimum set of sensors required to achieve For all other uses, contact the owner/author(s). reliable cognitive load assessment. To that end, rich multi-modal Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia © 2024 Copyright held by the owner/author(s). data was collected from a myriad of sensors, including wearables https://doi.org/10.70314/is.2024.scai.6961 35 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Ana Krstevska et al. (OCOsense glasses and Empatica E4 wristband) and contact-free unobtrusive sensors such as an advanced eye tracker, a thermal camera, a depth camera, and an RGB video camera. To the best of our knowledge, no prior dataset exists containing such rich multi- modal data obtained with such an elaborate sensory setup. 2 Related Work The challenge of cognitive load estimation has been extensively studied across various fields. Significant emphasis has been placed on reducing cognitive load in dynamic environments, such as aviation [1]. Recent research has increasingly focused on transitioning from direct measurements, such as electroencephalography (EEG), to indirect methods of cognitive load assessment. For instance, ocular metrics, including pupil diameter and blink rate, have been shown to accurately estimate cognitive load [2, 3, 4]. Additionally, facial temperature variations have been widely correlated with cognitive workload, providing Figure 1: Experimental setup another non-invasive means of assessment [5, 6]. Novak et al. demonstrated that biometric indicators, such as galvanic skin Calibration data for the OCOsense glasses was then recorded by response and skin temperature, can signal increased cognitive load; having participants replicate four facial expressions — smiling, however, these measures are insufficient to distinguish between frowning, brow raising, and eye squeezing — three times each. varying levels of cognitive load [7]. Wang et al. demonstrated that Calibration for the eye tracker followed, during which participants visual cues—including face pose, eye gaze, eye blinking, and yawn tracked a moving dot with their eyes. This calibration aimed to frequency—can serve as indicators of cognitive load [8]. optimize participant's seating position for accurate eye-tracking. This research aims to address the complexities of cognitive load The experiment's main phase involved participants completing estimation by integrating a wide range of psychophysiological cognitive load-inducing tasks that tested three aspects of cognition: signals, offering a more comprehensive approach to this task. attention, memory, and visual perception. For each cognitive domain, two widely recognized tasks were presented, each with 3 Experimental Setup two levels of difficulty (easy and difficult). This design allowed for The objective of our data collection was to capture participants' the differentiation of cognitive load levels. Following each cognitive load under varying levels of difficulty imposed by category of cognitive tasks, participants engaged in relaxation tasks cognitive load-inducing tasks. The study was conducted in a quiet, that were not expected to induce cognitive load, such as meditation temperature-controlled room, with participants tested individually. with open eyes, listening to music to relieve stress and passive At the beginning of each session, participants were seated in a viewing of aesthetically pleasing images. These tasks provided comfortable chair in front of a 24” monitor and given instructions baseline data for periods of minimal cognitive load. about the experiment and their expected role. The Empatica E4 In summary, each recording session included six cognitive load- wristband was then fitted to the participant's non-dominant hand, inducing tasks (with two levels of difficulty) and three relaxation and the OCOsense glasses for emotion recognition were equipped tasks, totaling 15 tasks. After each task, participants completed the in line with product instructions. NASA Task Load Index (NASA TLX) questionnaire, a validated Data collection was further enriched through the use of instrument for assessing cognitive load across six dimensions: unobtrusive sensing technologies, including a Tobii Spark eye mental demand, physical demand, temporal demand, performance, tracker (60 frames per second), an Intel RealSense Depth Camera effort, and frustration [9]. Each question was rated on a scale of 0 D455 (providing depth data at 30 fps), a Logitech BRIO stream 4k to 100. In this study, the unweighted version of the NASA TLX, webcam at 10 fps with HDR and noise-canceling microphones and known as the Raw NASA TLX, was used. Additionally, a FLIR Lepton 3 thermal camera delivering a full 160x120 pixel participants completed a single-item Instantaneous Self- thermal resolution with 8 fps. We used this set of devices to Assessment (ISA) of workload, providing a subjective measure of continuously monitor participants throughout the recording the cognitive load induced by the task [10]. These questionnaires session. The experimental setup can be observed in Figure 1. served as subjective assessments of cognitive load and as reference points for the difficulty of each task [11]. The tasks were implemented using PsychoPy, an open-source 4 Data Collection Protocol software package commonly used in neuroscience and Prior to the experiment, participants completed a brief sleep experimental psychology research [12]. For attention-related tasks, questionnaire to gather information about their sleep patterns (e.g., participants completed the N-back and Stroop tests. In the N-back hours slept the night before and usual sleep duration) and rated their task, participants were presented with a sequence of letters and levels of fatigue and focus on a scale of 1 to 10. asked to determine whether the current letter matched the one 36 Multi-modal Data Collection and Preliminary Statistical Analysis Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia for Cognitive Load Assessment presented N trials earlier (with task difficulty increasing as N for the N-back tasks. Notably, the N-back tasks were always increased) [13]. Participants completed both a 2-back and a 3-back presented first to participants, suggesting that they may have task. In the Stroop test, participants identified whether the word required additional time to adjust to the testing environment and matched the color in which it was written, with the easier version fully engage with the task. involving two colors (red and blue) and the more difficult version Next, an inferential statistical analysis was performed on the incorporating five colors [14]. relationship between task scores and various variables of the sleep Memory-related tasks included a memory game and a question- pattern. To investigate the potential influence of tiredness on answering task based on a previously shown image. In the memory performance, responses from the sleep patterns questionnaire were game, participants recalled as many words as possible from a set, analyzed. A non-parametric Kruskal-Wallis test was performed to with the easier version comprising seven words and the more determine whether there was a statistically significant difference in difficult version consisting of 15 words. In the question-answering overall scores across different levels of tiredness (low, medium, task, participants focused on an image and then answered questions and high). The resulting p-value (0.91) indicated no significant about it (e.g., remembering the number of particular objects in the difference in performance between these groups. Thus, tiredness image), with the hard version using an image with greater detail. levels did not show a statistically significant impact on The visual perception tasks included a "spot the difference" task performance within a 95 % confidence interval. and a pursuit test. In the "spot the difference" task, participants were Similarly, the effect of focus level (low vs. high) on overall presented with two images and were asked to identify as many performance was examined using a non-parametric Mann-Whitney differences as possible within a one-minute time frame. The test. The p-value was 0.12, indicating no statistically significant difficulty of this task varied, with the more challenging version difference in performance between low and high focus groups at involving an image that contained greater detail compared to the the 5 % significance level. simpler, easier version. The pursuit test required participants to Furthermore, the relationship between hours of sleep the night visually track irregularly curved, overlapping lines. As with the before the experiment and participant performance was examined "spot the difference" task, the pursuit test was administered at two using Spearman’s correlation. The p-value was 0.42, indicating no levels of difficulty. The more difficult version featured a more statistically significant correlation between overall performance intricate image, with longer and more tangled lines, as opposed to scores and hours of sleep the night before the experiment. the less complex image used in the easier version of the task. The potential influence of participants' highest education level on overall performance was also investigated. To assess this, a non- 5 Statistics parametric Kruskal-Wallis test was conducted. The results (p-value In this section, we present some descriptive demographic and task- of 0.33) indicated no statistically significant difference in related statistics for the participants involved in the experiment. performance scores across different education levels among the The average age of participants was 29.28 years, with a standard participants. deviation of 8.31. In terms of educational background, the majority Overall, the small sample size may have constrained the ability of participants (44 %) had obtained a Bachelor's degree (BSc), to detect significant effects. The limited variability in the sample's followed by those with a Master's degree (MSc), 28 %. A smaller educational background and other factors likely contributed to the portion had completed only high school (16 %) or had earned a PhD lack of observed differences, emphasizing the need for a larger, (12 %). Additionally, 60 % of the participants were male. more diverse sample to better understand the impact of these We then looked at the descriptive statistics derived from the variables on cognitive load performance. performance of the participants in each task. These indicate that participants performed consistently well on tasks such as the 2-back task, both easy and difficult versions of the Stroop test, the easy memory task (where participants recalled an average of 5 out of 7 words), the easy version of the "spot the difference" task (with an average detection rate of approximately 90 % of all the differences), and both versions of the pursuit test. Notably, participants performed slightly better on the difficult version of the Stroop test, likely due to their increased familiarity with the task. However, performance was lower on tasks such as the 3-back test (which most participants perceived as highly or extremely difficult), the difficult memory task (with an average recall rate of 39 %), and both the easy and difficult question-answering tasks. The difficult version of the "spot the difference" task also showed lower performance, with participants detecting only 25 % of the differences on average. Consistent performance among subjects Figure 2: Reported perceived difficulty per cognitive task (with low standard deviation) was observed across all tasks except 37 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Ana Krstevska et al. As shown in Figure 2, participants consistently perceived the background, and other demographic factors is desirable to enhance difficulty of the two N-back tasks and the difficult version of the the generalizability of the findings. "spot the difference" task as somewhat high or high. This suggests In future work, the collected data will be processed and utilized a general consensus regarding the difficulty of these tasks. In to train machine learning models aimed at estimating cognitive contrast, the NASA TLX-based perceived difficulty of remaining load. Ground truth for the machine learning models can be derived tasks, exhibited significant variability among participants. from various sources, including perceived task difficulty reported To assess differences in performance across task difficulties and through the standardized questionnaires, the designed difficulty evaluate the potential for differentiating cognitive load using level of the tasks or the participants' performance on the tasks. machine learning models, we conducted additional inferential These machine learning models will leverage sophisticated ML statistical analyses. The Wilcoxon signed-rank test was used to techniques to effectively integrate and analyze multi-modal data, compare participant performance on the easier and more difficult aiming to enhance the accuracy of cognitive load predictions. We versions of each cognitive task. also plan to further expand the dataset with another phase of data Statistically significant differences in performance were found collection, offering a rich dataset both in terms of modalities, as between the two difficulty levels for all tasks. For the N-back, "spot well as in terms of participants. The collected dataset will serve as the difference", and pursuit tasks, participants performed a stepping stone towards robust multi-modal cognitive load significantly better on the easier versions, indicating that increased assessment, allowing for creation and benchmarking of ML models difficulty negatively impacted performance. Conversely, for the and will be made available to general public after the collection is Stroop, memory, and question-answering tasks, participants finalized. performed better on the more difficult versions. Acknowledgements The statistical analysis conducted in this study provides initial evidence supporting the validity of the data collection protocol, This work was supported by the Jožef Stefan Institute and particularly with respect to the selection of tasks and task difficulty Università della Svizzera italiana (funded by SNSF through the levels. The tasks chosen for this experiment varied significantly in project XAI-PAC (PZ00P2_216405)). terms of their cognitive demands, as reflected by the substantial References differences in performance between the easier and more difficult [1] Jonathan Mead, Mark Middendorf, Christina Gruenwald, Chelsey Credlebaugh, versions of each task. These results indicate that cognitive load and and Scott Galster. 2017. Investigating Facial Electromyography as an Indicator of performance are task-specific, and the significant differences Cognitive Workload. In 19th International Symposium on Aviation Psychology, 377– observed support the feasibility of using machine learning models 382. [2] Muneeb Imtiaz Ahmad, Ingo Keller, David A. Robb, and Katrin S. Lohan. 2020. to differentiate between varying levels of cognitive load. A framework to estimate cognitive load using physiological data. Personal and Ubiquitous Computing, 27, 2027–2041. [3] Tobias Appel, Christian Scharinger, Peter Gerjets, and Enkelejda Kasneci. 2018. 6 Conclusion and Future Work Cross-subject workload classification using pupil-related measures. In Proceedings of This study employs a novel approach to data collection for the 2018 ACM Symposium on Eye Tracking Research & Applications, 4, 1–8. [4] Tobias Appel, Natalia Sevcenko, Franz Wortha, Katerina Tsarava, Korbinian cognitive load inference by combining psychophysiological data Moeller, Manuel Ninaus, Enkelejda Kasneci, and Peter Gerjets. 2019. Predicting obtained from multi-modal sensory setup, including wearable and Cognitive Load in an Emergency Simulation Based on Behavioral and Physiological Measures. In Proceedings of the 21st ACM International Conference on Multimodal unobtrusive contact-free sensors. The decision to utilize a diverse Interaction (ICMI), 154–163. set of devices was motivated by the hypothesis that integrating data [5] Fangqing Zhengren, George Chernyshov, Dingding Zheng, and Kai Kunze. 2019. from multiple sources could provide a more accurate assessment of Cognitive load assessment from facial temperature using smart eyewear. In Proceedings of the 2019 ACM International Joint Conference on Pervasive and cognitive load, while also aiming to identify the minimal sensor Ubiquitous Computing and Proceedings of the 2019 ACM International Symposium configuration required to achieve reliable results. This is on Wearable Computers, 657–660. [6] Yomna Abdelrahman, Eduardo Velloso, Tillman Dingler, Albrecht Schmidt, and particularly relevant in dynamic and high-stakes environments, Frank Vetere. 2017. Cognitive Heat: Exploring the Usage of Thermal Imaging to such as driving, where accurate cognitive load assessment could Unobtrusively Estimate Cognitive Load. Proceedings of the ACM on Interactive, have life-saving implications. To the best of our knowledge, no Mobile, Wearable and Ubiquitous Technologies, 33, 1–20. [7] Klemen Novak, Kristina Stojmenova, Grega Jakus, and Jaka Sodnik. 2017. prior research has incorporated such a comprehensive and Assessment of cognitive load through biometric monitoring. In 7th International multifaceted setup for cognitive load evaluation. Conference on Information Society and Technology (ICIST). [8] Zixuan Wang, Jinyun Yan, and Hamid Aghajan. 2012. A framework of personal The statistical analyses conducted thus far offer promising assistant for computer users by analyzing video stream. In Proceedings of the 4th validation for the data collection protocol. The selection of tasks Workshop on Eye Gaze in Intelligent Human Machine Interaction, 1–3. [9] Sandra G. Hart, and Lowell E. Staveland. 1988. Development of NASA-TLX (Task and task difficulty levels proved effective in eliciting a range of Load Index): Results of Empirical and Theoretical Research. In Advances in cognitive load levels, as evidenced by the significant performance Psychology, 52, 139-183 differences between task difficulties. [10] Andrew J. Tattersall, and Penelope S. Foord. 2007. An experimental evaluation of instantaneous self-assessment as a measure of workload. Ergonomics, 39, 740-748. To further enhance the validity of the data collection protocol, [11] Thomas Kosch, Jakob Karolus, Johannes Zagermann, Harald Reiterer, Albrecht several changes could be implemented in potential subsequent Schmidt, and Paweł W. Woźniak. 2023. A Survey on Measuring Cognitive Workload in Human-Computer Interaction. ACM Computing Surveys, 55, 1–39. collections. Refining task difficulty levels could offer more [12] Jonathan Peirce, Rebecca Hirst, and Michael MacAskill. 2022. Building granularity in cognitive load differentiation, ensuring a clearer Experiments in PsychoPy. Sage Publications distinction between varying levels of cognitive load. Furthermore, [13] Michael J. Kane, and Andrew Conway. 2016. The invention of n-back: An extremely brief history. The Winnower increasing the diversity of participants in terms of age, educational [14] John Ridley Stroop. 1992. Studies of interference in serial verbal reactions. Journal of Experimental Psychology, 121, 15-23 38 Predicting Health-Related Absenteeism with Machine Learning: A Case Study Aleksander Piciga Matjaž Kukar ap7377@student.uni- lj.si matjaz.kukar@f ri.uni- lj.si Faculty of Computer and Information Science, Faculty of Computer and Information Science, University of Ljubljana University of Ljubljana Ljubljana, Slovenia Ljubljana, Slovenia Abstract 0.10 Health-related absenteeism, or sick leave, is a complex issue with significant financial and operational implications for businesses. 0.08 We present a machine learning approach to predict employee 0.06 absenteeism in a Slovenian company. The study involved pre- teeism processing and augmenting the dataset by incorporating domain 0.04 knowledge, and evaluating various machine learning models. Absen Gradient Boosted Regression Trees emerged as the most effective 0.02 model, significantly outperforming the baseline model which 0.00 merely predicted the previous year’s absenteeism rate. Key at- 2014 2015 2016 2017 2018 2019 2020 2021 2022 tributes influencing absenteeism were identified, notably includ- Year ing current absenteeism, performance evaluations, and various job type and location-related features. Results highlight the po- Figure 1: The increase in absenteeism rate in Slovenia be- tential of machine learning in proactively managing absenteeism tween 2014 and 2022 [8]. We can observe a steady increase and offer recommendations for future research, such as modeling throughout the years. absenteeism as a time series and incorporating additional data sources. We also show that the current data is not detailed and granular enough to further improve the results. and augmenting the company’s employee data by incorporating domain knowledge, and evaluating various machine learning Keywords models. The findings highlight key attributes influencing ab- absenteeism, data analysis, data augmentation, machine learning senteeism and offer recommendations for future research and interventions. 1 Introduction The significance of our work extends beyond Company X, offering a blueprint for organizations tackling absenteeism. By Absenteeism — temporary absence from work due to health showcasing machine learning’s efficacy in predicting absenteeism reasons — is awidespread issue. In Slovenia, it has been on the rise and revealing its drivers, we contribute to the broader field and since 2014 (Figure 1), with an average of 56,128 individuals absent pave the way for data-driven interventions promoting a healthier, daily in 2022, representing approximately 5.91% of the workforce more productive workforce. This aligns with the growing trend [8]. This carries substantial financial burdens: direct costs like sick of using AI and ML to address complex organizational challenges. pay and indirect costs from overstaffing, reduced productivity Insights from such analyses can aid in strategic workforce plan- and service quality [2]. The complexity of absenteeism, rooted ning, optimize resource allocation, and ultimately contribute to in personal and organizational factors, makes it challenging to a more sustainable and resilient organization. predict and manage effectively [10]. In section 2 we detail the data and preprocessing, section 3 Recent years have witnessed a growing interest in leverag- outlines the methodology, section 4 presents the results, and ing artificial intelligence (AI) and machine learning (ML) to ad-section 5 discusses the findings and concludes the study. dress the absenteeism challenge [5]. Various machine learning techniques, including neural networks, decision trees, random 2 Materials forests, and gradient boosting, have been employed to predict ab- senteeism and identify its underlying causes [3, 9]. These studies The data used in our work spanned six years, from 2017 to 2022, have demonstrated the potential of machine learning in providing and initially comprised 13,798 instances (aggregated employee valuable insights for proactive absenteeism management. records) with up to 49 attributes each. They include demographic This paper presents a case study conducted in collaboration details, work-related factors, performance evaluations and the 1 with a Slovenian IT company aiming to improve absenteeism current year’s absenteeism rate for each employee, but no partic- prediction and management. The study includes preprocessing ulars about sick leave and other personal data. The initial dataset required substantial preprocessing to pre- 1 The company asked to remain anonymous, so it is referred to as Company X. pare it for analysis and machine learning [6]. The data cleaning Permission to make digital or hard copies of all or part of this work for personal process involved addressing inconsistencies in attribute values, or classroom use is granted without fee provided that copies are not made or such as removing extraneous spaces and converting text to low- distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this ercase for uniformity. A significant challenge in the dataset was work must be honored. For all other uses, contact the owner /author(s). the presence of missing values, denoted by ’/’. Their meaning and Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia handling were discussed with a company representative to de- © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2024.scai.7260 termine their origins and ensure appropriate treatment. In some 39 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Piciga et al. cases, missing values were imputed based on the average values discern statistically significant differences across groups defined of similar instances. For example, missing values in ’Kilometers by categorical attributes. to work’ were attributed to errors in data entry and were imputed using the average value for employees living in the same location 3.2 Data augmentation/Feature Engineering and working at the same place. On the other hand, missing values The original dataset underwent a series of transformations to in performance evaluations were due to employee’s absence on enhance its suitability for machine learning. This included data evaluation days. cleaning, handling missing values, and the creation of new at- The target variable — health-related absenteeism rate in the tributes based on domain knowledge and insights from the EDA. following year — is a continuous variable ranging from 0 to 1. It New attributes were engineered based on domain knowledge signifies the proportion of workdays an employee is absent due and statistical analysis. These included indicators for elevated to health reasons compared to the total number of workdays. The absenteeism, receipt of bonuses or awards, high and low perfor- distribution of this target variable is heavily skewed to the right, mance evaluations, and absenteeism rates within the employee’s with most values clustered near zero, indicating that the majority team and job type. External factors, such as average absenteeism of employees have low absenteeism rates. However, there exist rates in the employee’s residential and work locations, were also some outliers with extremely high absenteeism rates (Figure 2). incorporated. The feature engineering process was iterative, in- volving close collaboration with domain experts to ensure the 0.0125 0.2750 derived attributes were meaningful and captured relevant aspects Median of employee behavior and organizational dynamics. 95th percentile 103 yees 3.3 Machine Learning Models emplo Several well-known machine learning models were employed of 102 for absenteeism prediction, including Decision Trees, Linear Re- erb gression with L1 regularization, K-Nearest Neighbors (KNN), Num 101 Support Vector Regression (SVR), Gradient Boosted Regression Trees (GBRT), and Random Forest. Hyperparameter optimization 0.0 0.2 0.4 0.6 0.8 1.0 was conducted by using Optuna toolkit [1] to optimize model Absenteeism in the following year performance. Figure 2: Log-distribution of the target variable. Most work- 3.4 Model Evaluation and Selection ers have very little absence, causing a right-tailed distribu- Model evaluation was performed using Mean Absolute Error tion with an “outlier” spike on the right. (MAE), Root Mean Squared Error (RMSE), and coefficient of de- 2 termination (R ). The models were trained on past years’ data The skewed distribution of the target variable has implica- and tested on the subsequent year, with the training set size in- tions for the statistical analysis and machine learning modeling. creasing each year. The MAE provided an intuitive measure of Therefore, non-parametric statistical tests, such as the Spear- the average prediction error, while the RMSE penalized larger er- man’s rank correlation and Kruskal-Wallis test, were employed rors more severely. The R2 quantified the proportion of variance in EDA and data preprocessing. Additionally, the presence of in the target variable explained by the model. The models were outliers necessitates careful consideration during model building also compared against a baseline model that simply predicted and evaluation. the previous year’s absenteeism, to gauge the added value of The final dataset, comprising 10,347 instances and 42 attributes, the machine learning approach. A baseline model predicting the serves as the foundation for the subsequent machine learning, previous year’s absenteeism rate was used for comparison. where various models are trained to predict absenteeism rates. 3.5 Model Interpretation 3 Methods SHAP (SHapley Additive exPlanations) values [4, 7] were cal-The research methodology encompassed a multi-faceted approach, culated to interpret model predictions and assess attribute im- integrating exploratory data analysis, feature engineering, and portance. SHAP values provide insights into the contribution of the application of diverse machine learning models. The ultimate each attribute to the model’s output, aiding in understanding goal was to establish a robust predictive framework for health- the factors driving absenteeism. SHAP values provide a unified related absenteeism, while also ensuring model interpretability framework for interpreting any machine learning model, quanti- to observe actionable insights. fying the contribution of each feature to the model’s prediction for a given instance. By analyzing the SHAP values, it was possi- 3.1 Exploratory Data Analysis (EDA) ble to identify the most influential attributes and their directional impact on absenteeism. The initial phase involved a thorough EDA to understand the underlying data distribution, identify potential outliers, and un- 3.6 Data Splitting cover preliminary relationships between attributes and the target variable (absenteeism in the following year). Given the skewed To ensure robust model evaluation and mitigate the risk of over- nature of the target variable, visualizations like histograms and fitting, the dataset was split into training and testing sets in a box plots were complemented by non-parametric statistical tests. prequential manner (year after year). The models were trained The Spearman’s rank correlation coefficient was employed to as- on the training set and their performance was assessed on the sess monotonic relationships between continuous attributes and unseen testing set for the subsequent year. This comprehensive the target variable, while the Kruskal-Wallis test was utilized to methodological framework enabled a systematic exploration of 40 Predicting Health-Related Absenteeism Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia the factors influencing health-related absenteeism and the devel- the latest year’s data. These results were statistically significantly opment of a predictive model to proactively manage this critical better than the baseline model, demonstrating the effectiveness of issue. GBRT in capturing the complex patterns underlying absenteeism. Figure 4 reveals a general trend of MAE improvement for 4 Results most models in later years, surpassing the baseline model in The primary objective of our work was to develop machine learn- the final year. This suggests that the models benefit from the ing models capable of predicting health-related absenteeism in increasing amount of training data available in later years. RMSE 2 the subsequent year. The models were evaluated using three key and R charts (not shown) exhibit almost identical properties. metrics: Mean Absolute Error (MAE), Root Mean Squared Error It is clear that ML models profit tremendously from increasing 2 (RMSE), and the coefficient of determination (R ). The baseline amounts of data, as can be expected. model, which simply predicted the previous year’s absenteeism, Given the observed performance gains in later years with served as a benchmark for comparison (Table 1). larger training sets, we explored the impact of incorporating data from previous years. Figure 4 showcases the change in MAE for Table 1: Model performance averaged year-over-year. the final year when models were trained on data from the past year and the past three years, respectively. Model RMSE MAE R2 Random Forest 0.107 0.052 0.344 0.050 Random Forest GBRT 0.108 0.051 0.333 GBRT Linear Regression 0.049 Linear Regression 0.108 0.051 0.331 Regression Decision Tree Baseline Model Regression Decision Tree 0.112 0.051 0.281 0.048 KNN 0.121 0.057 0.173 MAE 0.047 SVR 0.117 0.075 0.215 0.046 Baseline Model 0.121 0.051 0.156 0.045 As we can see, all machine learning models outperform the Year Years 2 Dataset baseline model in terms of RMSE and R . This indicates their Past Base Three With Past superior ability to explain the variance in the target variable With Dataset (absenteeism in the following year). While the MAE remains rel- atively consistent across models, the improvement in RMSE and 2 Figure 4: Impact of additional attributes from past years R suggests that the models are particularly effective in handling on MAE. larger deviations in absenteeism predictions. To establish the statistical significance of the model improve- ments, we conducted a paired T-test comparing the predictions The GBRT model exhibited notable improvement with the of each model against the baseline model. All the selected models inclusion of additional data, achieving an MAE of 0.044, RMSE demonstrated statistically significant improvements (p < 0.05) of 0.093, and R2 of 0.36. This underscores the value of historical 2 in RMSE and R ; this ensures that their superior performance is data in enhancing the predictive capabilities of machine learning statistically substantiated and not due to chance. models for absenteeism and suggests that including even more historical data per employee would be beneficial. 4.1 Performance Trends and Impact of Additional Data per Employee 4.2 Interpretability and Additional Insights To gain deeper insights into model behavior, we examined their Analysis of SHAP values yielded the following key attributes performance trends over the years. Figure 3 illustrates the evolu-influencing absenteeism: tion of MAE for each model. • Current absenteeism rate • Performance evaluations • With respect to the employee’s job type and location: Random Forest 0.056 GBRT – Absenteeism rate Linear Regression Regression Decision Tree 0.054 Baseline Model – Proportion of employees with elevated absenteeism – Proportion of employees without bonuses 0.052 Our findings suggest that absenteeism is influenced by a combina- MAE 0.050 tion of individual factors (current absenteeism, performance eval- 0.048 uations) and organizational factors ( job type, location, bonuses). Additionally, a rather simple EDA visualisation of functional 0.046 grouping of employees was quite surprising (Figure 5). Its inter-2018 2019 2020 2021 pretation can be quite speculative, possibly related to increased Year job satisfaction or engagement in certain groups. Another, some- what surprising finding from EDA is that the COVID-19 pandemic Figure 3: MAE trend over time with additional training did not significantly influence absenteeism rates in 2020, but it data from past years. may have in 2021 (Figure 6). Finally, t-SNE visualization of the full dataset shows that em- Among the evaluated models, GBRT exhibited the best perfor- ployees cannot easily be separated in clusters with similar ab- 2 mance, achieving an MAE of 0.045, RMSE of 0.10, and R of 0.40 on senteeism (Figure 7). We can identify some distinct subgroups 41 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Piciga et al. The findings align with existing literature highlighting the 0.20 multifactorial nature of absenteeism. The strong influence of 0.15 current absenteeism on future absenteeism emphasizes its pre- dictive power, suggesting that past behavior can be a significant variable 0.10 indicator of future trends. The negative correlation between per- arget formance evaluations and absenteeism suggests that employees T 0.05 with higher evaluations tend to be less absent, potentially due to 0.00 increased job satisfaction or engagement. The impact of denied ort bonuses on absenteeism points to the potential role of financial Supp Commercial Technology incentives and recognition in influencing employee attendance. Work field The limitations of our work include the relatively short time span and the potential influence of unmeasured external factors. Figure 5: Target variable according to functional partition- Future research could address these limitations by: modeling ing within the company. absenteeism as a time series to capture its dynamic nature, incor- porating additional data sources such as employee surveys, par- ticipation in wellness programs, and (within legal limits) health 0.150 and personal circumstances data analyzing absenteeism at a finer 0.125 granularity (e.g., monthly or daily), exploring the inclusion of 0.100 employee health records and workplace environmental factors in variable predictive models, and conducting longitudinal studies to track 0.075 absenteeism patterns over extended periods. arget 0.050 T While quantitative improvements of ML model predictions 0.025 are not overwhelming, the gained insights can enable targeted 0.000 interventions to reduce absenteeism and promote a healthier 2017 2018 2019 2020 2021 workforce. By leveraging ML and data-driven insights, organi- Year zations can proactively manage absenteeism, thus improving productivity, financial stability, and employee well-being. Figure 6: Target variable by year. Note the sharp increase in 2021, possibly attributable to the COVID-19 pandemic. Acknowledgements The authors sincerely thank to Company X for providing the data, domain expertise and several fruitful discussions.. The authors (like the cluster of red dots on the left), however most data points acknowledge the financial support from the Slovenian Research are quite intermingled. This suggests that with our current set of Agency (research core funding No. P2-209). attributes, we shouldn’t anticipate a significant improvement in predictive performance. References [1] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama. 2019. Optuna: a next-generation hyperparameter optimization framework. In Proceedings 50 of the 25th ACM SIGKDD International Conference on Knowledge Discovery t 40 & Data Mining. Association for Computing Machinery, 2623–2631. isbn: 30 onen 9781450362016. doi: 10.1145/3292500.3330701. 20 [2] M. Bregant, E. Boštjančič, J. Buzeti, M. Ceglar Ključevšek, A. Hiršl, M. Klun, 10 comp T. Kozjek, N. Tomaževič, and J. Stare. 2012. Izboljševanje delovnega okolja z 0 inovativnimi rešitami. Združenje delodajalcev Slovenije. −10 t-SNE [3] B. Hu. 2021. The application of machine learning in predicting absenteeism −20 at work. In 2021 2nd International Conference on Computing and Data Science −30 − (CDS), 270–276. doi: 10.1109/CDS52072.2021.00054. 40 Second −50 [4] Y. Meng, N. Yang, Z. Qian, and G. Zhang. 2021. What makes an online −60 review more helpful: an interpretation framework using xgboost and shap −50 −40 −30 −20 −10 0 10 20 30 40 50 60 values. Journal of Theoretical and Applied Electronic Commerce Research, 16, First t-SNE component 3, 466–490. doi: 10.3390/jtaer16030029. [5] I. H. Montano, G. Marques, S. G. Alonso, M. López Coronado, and I. de la Torre Díez. 2020. Predicting absenteeism and temporary disability using Figure 7: Data visualized in 2D space with t-SNE projection. machine learning: a systematic review and analysis. Journal of Medical Systems, 44, 9, (Aug. 2020), 162. doi: 10.1007/s10916-020-01626-2. Red dots represent examples with absenteeism in the next [6] A. Piciga. 2024. Napovedovanje zdravstvenega absentizma s strojnim učenjem. year above 0.25. Blue shades depict examples with absen- Bachelor’s Thesis. Univerza v Ljubljani, Fakulteta za računalništvo in infor- teeism between 0 (light blue) and 0.25 (dark blue). matiko. https://repozitorij.uni- lj.si/IzpisGradiva.php?lang=slv&id=160413. [7] E. Štrumbelj and I. Kononenko. 2014. Explaining prediction models and in- dividual predictions with feature contributions. Knowledge and Information Systems, 41, 3, (Dec. 2014), 647–665. doi: 10.1007/s10115-013-0679-x. 5 Discussion and Conclusion [8] M. Zaletel, D. Vardič, and M. Hladnik. 2024. Zdravstveni statistični letopis Slovenije 2022. (2024). Retrieved June 5, 2024 from https://nijz.si/publikacije /zdravstveni- statisticni- letopis- 2022/. Our work successfully demonstrates the application of machine [9] W. Zaman, S. Zaidi, A. I. Abdullah, and B. Touhid. 2019. Predicting absen- learning to predict health-related absenteeism. The GBRT model’s teeism at work using tree-based learners. In Proceedings of the 3rd Interna- superior performance highlights its ability to capture complex tional Conference on Machine Learning and Soft Computing. Association for Computing Machinery, 7–11. doi: 10.1145/3310986.3310994. data relationships, outperforming simpler models and the base- [10] S. Zupanc. 2011. Absentizem, kolegialnost in obremenjenost posameznikov. line. Also, identifying key attributes influencing absenteeism, Bachelor’s Thesis. Univerza v Ljubljani. http://www.cek.ef .uni- lj.si/UPES/z such as current absenteeism, denied bonuses, work type and lo- upanc1175.pdf . cation, and performance evaluations, provides valuable insights. 42 Puzzle Generation for Ultimate Tic-Tac-Toe Maj Zirkelbach Aleksander Sadikov mz5153@student.uni- lj.si aleksander.sadikov@f ri.uni- lj.si University of Ljubljana, Faculty of Computer and University of Ljubljana, Faculty of Computer and Information Science Information Science Ljubljana, Slovenia Ljubljana, Slovenia Abstract our application, which is designed to enhance players’ tactical and strategic thinking. Ultimate Tic-Tac-Toe is an interesting and popular variant of In Section 2 we present the related work, and in Section 3 we Tic-Tac-Toe that lacks available resources for improving game- detail the technical aspects of the developed application. In Sec- play skills. In this paper, we present a semi-automatic system for tion 4 we present the implemented agents and their approximate generating puzzles as a part of a larger tutorial application aimed strength. In Section 5 we provide a description of different types at teaching Ultimate Tic-Tac-Toe. The puzzles are designed to en- of puzzles and the methodology for their construction. In Section hance players’ tactical and strategic understanding by presenting 6 we present the evaluation and discuss the results in Section 7. game scenarios where they must identify correct continuations. Finally, in Section 8 we present the conclusions and give possible To ensure the quality of generated puzzles we tested the appli- extensions and enhancements for future work. cation with a group of volunteers. The results have shown that the number of solved puzzles positively impacted users’ ability to reach higher strength levels but had less of an effect on lower levels. 2 Related Work Keywords There are many implementations of the Ultimate Tic-Tac-Toe available online, mostly appearing as mobile games aimed pri- Ultimate Tic-Tac-Toe, puzzle generation, minimax algorithm, tu- marily at entertainment and lacking advanced playing agents tor application [12] [9] [10], as well as web and desktop applications developed to create the strongest possible programs [15] [7] [13]. An exam-1 Introduction ple of the latter is an agent based on the ideas of the AlphaZero For centuries, people have enjoyed playing board games like program [14], currently considered one of the strongest players of chess and Go. Over time, these games have led to the develop-this game [13]. During the development of this agent, significant ment of extensive theory and the accumulation of knowledge, strategies were discovered, which were also useful in developing helping players navigate their complexity. Today, advanced arti- our application. Some researchers have attempted to solve the ficial intelligence (AI) programs such as AlphaZero [14] surpass game theoretically, but the spatial complexity proved too great even the strongest human players, offering new insights into to allow for a complete solution [5]. strategies. However, many lesser-known games have yet to be It is important to differentiate between the various versions of thoroughly explored, despite their popularity. One such game is Ultimate Tic-Tac-Toe. One variant allows the game to continue Ultimate Tic-Tac-Toe, an advanced version of the classic Tic-Tac- playing on already-won local boards, which drastically changes Toe. This game is played on a 3x3 grid of local Tic-Tac-Toe boards, the game’s dynamics. In this variant, researchers have demon- creating a global board (Figure 1a). The goal is to win three local strated an optimal strategy for the starting player, who can win boards in a row, while players must make their moves within in 43 moves [1]. Further research has focused on enabling a more specific local boards determined by their opponent’s previous balanced game by introducing random opening moves, which move. For example, if a player moves in the top-left corner of a reduces the predictability of forced wins [4]. Despite these inter-local board, the next player must play on the top-left local board. esting findings, research on these variants is not so relevant for If the designated board is full or decided, the player can choose us, as it does not contribute to the understanding of the main any other available board. Despite its apparent simplicity, the game. game has enough spatial complexity that it cannot currently be While there is a lack of educational material specific to our solved using brute-force methods. game, much can be learned from related fields, such as chess, While there are several online implementations of the game, which has been extensively researched. The paper by Gobet and most focus on building strong AI agents; however, There is a Jansen [8] describes a scientific approach to learning chess, which noticeable lack of resources aimed at teaching and helping players includes methods to improve memory, perception, and problem- understand the deeper strategies of the game, which could make solving skills in players. In this context, it focuses on the acquisi- the learning curve more manageable for new and aspiring players. tion and organization of knowledge, including both explicit and Thus, we have created an application that addresses the lack of implicit learning of tactics and strategies. This approach facil- learning tools available for Ultimate Tic-Tac-Toe. This article itates a deep understanding of games and the development of places particular emphasis on the puzzle generation aspect of more effective learning methods. Chess also offers highly sophisticated practical solutions from Permission to make digital or hard copies of all or part of this work for personal which we can learn a great deal. Platforms such as chess.com [2] or classroom use is granted without fee provided that copies are not made or and lichess.org [11] offer extensive resources and tools for distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this learning chess, especially in the areas of tactics and openings. work must be honored. For all other uses, contact the owner /author(s). These platforms allow players to learn through interactive lessons, Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia solving puzzles, and studying various openings, which contribute © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2024.scai.7299 to a deeper understanding of the game and improve playing skills. 43 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Zirkelbach et al. This approach has proven extremely effective in helping players The chess rating system is used to measure the playing strength master complex strategic and tactical concepts in chess. of chess players. The most commonly used system is the Elo rat- On the mentioned platforms, the methods for learning tac- ing [6], which predicts the likelihood of one player winning tics are designed to allow players to solve problems based on against another based on their ratings: concrete game situations, which improves pattern recognition 1 and decision-making abilities in real games. Similarly, learning 𝐸 = , 𝐴 𝑅 −𝑅 𝐵 𝐴 openings involves demonstrating optimal opening moves and 1 + 10 400 their continuations, helping players develop effective strategies where 𝐸 represents the expected score for player A, 𝑅 is the 𝐴 𝐴 at the beginning of the game. rating of player A, and 𝑅 is the rating of player B. 𝐵 We have applied similar methods in our Ultimate Tic-TacToe application. For example, adapting approaches for learning tactics Table 1: Table of approximate agent strengths. Each agent can help users improve their recognition and solving of complex played 100 games (50 as X and 50 as O) against the agent situations in the game while learning openings helps to under- one level lower. The results column shows the number stand key opening moves and their impact on the further course of points each agent earned with each symbol, as well of the game. By incorporating these methods into our application, as the total score. A win awarded 1 point, while a draw we ensured more effective learning processes and improved the awarded 0.5 points. The last line shows the results of the overall gaming experience. strongest freely available agent against level 9. It had the same amount of time to think, and they played 30 games. 3 Application Details Result Agent Estimated Rating In addition to puzzle-solving, the app offers a comprehensive X O Combined learning experience through various other features. It includes Confused Chimp - 1 - - - 1 AI opponents of different difficulty levels, game analysis, and Goofy Goblin - 2 49 49 98 620 exploration of effective opening strategies, allowing players to Casual Carl - 3 41.5 35.5 77 835 refine their understanding in all phases of the game. The user in- Average Joe - 4 37 25 62 926 terface ensures smooth navigation between these modes, making Hustling Hugo - 5 39.5 34.5 74 1114 the app a versatile tool for both playing and learning Ultimate Witty Walter - 6 43 30 73 1293 Tic-Tac-Toe. By integrating these elements, the app serves as a Thinking Tiffany - 7 35 24 59 1361 resource for players at all levels, helping them to deepen their Brainy Bob - 8 42,.5 26.5 69 1506 understanding and improve their skills. Bossman - 9 36.5 22.5 59 1574 To reach a broader audience, the application was developed UT T T AI 14.5 12.5 27 1948 for both Android and Windows, the dominant operating systems in the market [15]. It uses Flutter components to deliver a respon- sive and user-friendly interface. Local data storage is utilized 5 Puzzle Description and Methodology for user settings, progress, and puzzle data, ensuring efficient performance and data management. In this section, we describe different types of puzzles and the We employed modern technologies and mobile development methodology employed to generate them for our game. practices, including state management patterns, to create an eas- ily expandable app for future updates and enhancements. The 5.1 Puzzles entire project is hosted on GitHub, though it is not open-source. The puzzles in the application are divided into tactical and strate- Test versions of the app for Android and Windows are avail- gic, with each type of puzzle covering different aspects of the able on Google Drive: https://drive.google.com/drive/f olders/1Sn game and helping players improve specific skills. O_mN_ZVa2wXd0OGI07kLiYKQTDHuEe?usp=drive_link, while Tactical puzzles are useful for understanding tactical ideas the Android production version is accessible on Google Play Store: and are particularly applicable in the endgame and middlegame https://play.google.com/store/apps/details?id=com.uttt_tutor. phases. They focus on specific situations that require precise and thoughtful moves, helping players develop the ability to think quickly and effectively. In total, we generated 1,263 tactical 4 AI Agents and Rating System puzzles, distributed across five levels. The number of puzzles for each level is shown in Table 2. Playing against intelligent agents allows users to refine their Unlike tactical puzzles, strategic puzzles aim to understand skills by competing against various virtual opponents. The appli- the position and long-term plans. They are instrumental in the cation includes nine different agents, each varying in difficulty opening and middlegame, where it is crucial to recognize strategic and gameplay strategies. These agents are designed using Mini- ideas and develop plans that provide an advantage as the game max and Monte Carlo Tree Search [3] algorithms, which provide continues. There are 50 strategic puzzles available, currently different levels of complexity and depth in move analysis. The arranged in one level, with the possibility of expansion in the agents and their approximate strengths are shown in Table 1. future. To better understand the quality of the agents and evaluate user progress, we need to establish a system for measuring their 5.2 Tactical Puzzle Generation strength. Since Ultimate Tic-Tac-Toe is not widely popular, there is no established system for rating player abilities. Therefore, we To generate tactical puzzles, we developed a specialized minimax decided to use the chess rating system as an approximation for agent that builds a tree of all possible moves leading to victory our agents. from the solver’s perspective. A key step in this process is the 44 Puzzle Generation for Ultimate Tic-Tac-Toe Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Table 2: Number of tactical puzzles on each level. strategic understanding of the position, which our agents, using relatively simple heuristics, are incapable of. Therefore, we re- Level Puzzle Depth Quantity sorted to the most powerful freely available agent [13], which is based on the ideas of the AlphaZero program. 1 1 273 Thus, we generated the strategic puzzles manually. We searched 2 3 493 for interesting and instructive positions that arose in games be- 3 5 231 tween the aforementioned agent and our stronger programs. We 4 7 176 focused on moments when there was a significant deviation in 5 9 90 the position evaluation between the two agents. When the agent with better strategic understanding detected an important change, we saved the given position, studied it more closely, and based selection of tree branches to retain only relevant and correct so- on our understanding of the game, formulated a solution. The lutions. It is essential to preserve all of the winner’s possibilities most common examples of such situations involved sacrificing while limiting the loser’s responses to those that make finding a the edge board to gain control over the central board. A basic solution as difficult as possible. Therefore, we select the continu- example of this can be seen in Figure 2. ation that allows the longest possible game for the loser while leading to the fewest continuations for the winner. From the tree, we extract all the correct solutions for the given position. For a high-quality puzzle, it must not have too many solutions. The criterion we set is that the number of solutions must be less than the depth of the puzzle. We also decided to discard all puzzles that have multiple correct continuations for the first move. This way, we avoid trivial puzzles that would be too simple. An example of a level 3 tactical puzzle with its generated solution tree is shown in Figure 1. (a) User interface of the most powerful freely available agent. For the given position, it ran 1000 simulations and assessed the move F2 as the best with an 82% probability. It evaluates the position with a value of +16.85, which means it assigns approximately 58.4% win probability to player X (a value of 0 means a draw, 100 a win, and -100 a loss). (a) Level 3 tactical puzzle. (b) Solution tree. Figure 1: An example of tactical puzzle and its generated solution tree. The generation of tactical puzzles for different difficulty levels (b) Minimax agent with a search depth of 12. It marks the move was automated by conducting matches between agents of equal F2 as the worst, as it does not recognize the long-term advantage. strength, with the search depth of both agents corresponding to the depth of the puzzle we wanted to find. We chose this Figure 2: Different interpretations of the same position, approach to ensure that the resulting positions were interesting based on which we built the strategic puzzle. and balanced, as otherwise, the stronger side would usually have an overly obvious advantage at the start of the puzzle which would make it boring to solve. 6 Evaluation and Results 5.3 Strategic Puzzle Generation We conducted a quality analysis of the application with 14 vol- Automating the creation of strategic puzzles is impossible without unteers. Their task was to use the app for an extended period a program that could interpret the given position and simultane- to improve their knowledge of the game. We were interested ously provide a human-understandable explanation. Additionally, in determining whether using the app had a positive impact on generating strategic puzzles requires an agent with an advanced the development of their Ultimate Tic-Tac-Toe playing skills and 45 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Zirkelbach et al. whether progress was dependent on motivation or the time spent understanding, making progress slower and more challenging. learning. This reflects the diminishing returns on improvement as you To assess individual progress, participants played against the climb the rating ladder. agent at the start of testing to determine their initial skill level. It must also be mentioned that users were free to use any The application then tracked the highest level each user defeated, tools within the app during testing and solving more puzzles did providing an estimate of their improvement over time. This not correlate with longer app usage. For a clearer assessment of progress, in relation to the number of puzzles solved, is illus- puzzle significance, a controlled test focusing solely on puzzle- trated in Figure 3. For a more concrete interpretation of obtained solving would be more appropriate. level strengths, refer to Section 4. 8 Conclusion 500 In this work, we presented methods for generating puzzles for the game of Ultimate Tic-Tac-Toe. To evaluate the quality of 400 these puzzles, we tracked how the number of solved puzzles im- pacted individual user progress. Our results indicate a correlation 300 between the number of puzzles solved and the ability to reach stronger AI levels. However, the evaluation could be refined by focusing exclu- sively on the puzzle-solving component, isolating it from other 60 functionalities of the application. Additionally, the automation of tactical puzzle generation could be expanded to cover the mid- 50 dlegame phase, rather than being limited to endgame scenarios. Another area of improvement is providing clearer assessments of 40 puzzle difficulty. This could be achieved by implementing a rating Puzzles d system that ranks puzzles based on completion rates, offering a e more accurate measure of challenge for each puzzle. 30 Solv of Acknowledgements er 20 The author would like to thank the family and friends who par- ticipated in testing the application. Numb 10 References [1] Guillaume Bertholon, Rémi Géraud-Stewart, Axel Kugelmann, Théo Lenoir, 0 and David Naccache. 2020. At most 43 moves, at least 29: optimal strategies and bounds for ultimate tic-tac-toe. (2020). doi: 10.48550/ARXIV.2006.02353. 0 1 2 3 4 5 6 7 8 9 [2] Chess.com. 2024. Chess.com. (June 2024). https://www.chess.com/. [3] Rémi Coulom. 2006. Efficient selectivity and backup operators in monte- AI Level Beaten carlo tree search. In In: Proceedings Computers and Games 2006. Springer- Verlag. [4] Justin Diamond. 2022. A practical method for preventing forced wins in Figure 3: Progress in relation to the number of solved puz- ultimate tic-tac-toe. (2022). doi: 10.48550/ARXIV.2207.06239. zles. Each arrow represents a human tester and indicates [5] Nelson Elhage. 2020. Solving ultimate tic tac toe. (July 2020). https://minim the change in the achieved level from the beginning to the ax.dev/docs/ultimate/. [6] Arpad E Elo. 1978. The Rating of Chessplayers, Past and Present. Arco Pub. end of the application’s use. [7] Ofek Gila. 2019. Ultimate tic tac toe. (2019). https://www.theof ekf oundatio n.org/games/UltimateTicTacToe/. [8] Fernand Gobet and Peter J Jansen. 2006. Training in chess: a scientific approach. Education and chess. 7 Discussion [9] Henryk. 2023. Ultimate tic tac toe. (Dec. 2023). https://play.google.com/stor e/apps/details?id=com.henrykvdb.sttt. The results in Figure 3 indicate that solving more puzzles im- [10] HPStudios. 2024. Ultimate tic tac toe. (June 2024). https://play.google.com/s pacted users’ ability to reach higher levels, but had less of an tore/apps/details?id=com.MertTaylan.UltimateTicTacToe. effect on lower levels. This is likely due to the fact that begin- [11] Lichess. 2024. Lichess. (June 2024). https://lichess.org/. [12] Levi Moore. 2020. Ultimate tic tac toe. (Nov. 2020). https://play.google.com ners can improve relatively quickly by simply playing the game, /store/apps/details?id=com.ZeroRare.UltimateTicTacToe. whereas advanced players require more effort to progress (eg. it [13] Arkadiusz Nowaczynski. 2021. Ar-nowaczynski/utttai: alphazero-like ai solution for playing ultimate tic-tac-toe in the browser. (Dec. 2021). https: is a lot easier to gain 100 rating points when you are rated 500 as //github.com/ar- nowaczynski/utttai. compared to when you are rated 1500). [14] David Silver et al. 2018. A general reinforcement learning algorithm that The reason for this is that at lower ratings, there is generally masters chess, shogi, and go through self-play. Science, 362, 6419, 1140–1144. eprint: https : / / www . science . org / doi / pdf / 10 . 1126 / science . aar6404. doi: more room for rapid improvement because the skill gap between 10.1126/science.aar6404. players tends to be more pronounced, and beginners can quickly [15] Michael Xing. 2019. Ultimate tic-tac-toe. (Oct. 2019). https://www.michaelx benefit from fundamental knowledge and tactical awareness. As ing.com/UltimateTTT/v3/. a result, achieving a higher rating initially is easier as players can fix obvious mistakes and exploit weaker opponents’ errors. However, as players reach higher levels, the competition be- comes tougher, and the differences in skill become more nu- anced. Players at this level are more consistent and less likely to make blunders, so improving further requires mastering ad- vanced strategies, pattern recognition, and deeper positional 46 Ethical Consideration and Sociological Challenges in the Integration of Artificial Intelligence in Mental Health Services Saša Poljak Lukek sasa.poljaklukek@teof.uni-lj.si University of Ljubljana, Faculty of Theology Ljubljana, Slovenia Abstract 1 treatment options. Chatbots provide therapy via This article explores the transformative potential natural language processing [5], while digital of artificial intelligence (AI) in the field of mental platforms support online mostly cognitive health, with a particular focus on ethical behavioral therapeutic interventions [6]. Avatar considerations and social challenges. As AI therapy uses AI to help patients manage tools become increasingly sophisticated, their conditions like dementia, autism spectrum ability to support mental health interventions disorder, and schizophrenia [7]. presents both opportunities and challenges. We discuss the importance of a human-centered 1.2 The Prospect of artificial approach to AI development and the need for intelligence in mental health services comprehensive ethical guidelines to ensure The future orientation underlines the importance patient safety and well-being. In addition, this of digital health in overcoming challenges such paper explores key social trends such as the as limited access to services, especially in evolving dynamics of modern families, aging underserved regions, and outlines measures to population, migration and considers how AI can ensure equitable access to digital health be integrated into these contexts to improve solutions across the European region [8]. The mental health care. use of AI in mental health services raises questions about the role of non-human Keywords: interventions, transparency in the use of Artificial Intelligence, Mental Health, Human- algorithms and the long-term impact on the Centered Approach, Ethics, Modern Family understanding of illness and the human Dynamics, Aging Populations, Migration condition [9]. There are also concerns about potential bias, gaps in ethical and legal 1 Introduction frameworks, and the possibility of misuse 1.1 Artificial intelligence in mental [10,11]. health services However, there are at least two potentially Research on the application of AI in mental positive effects of the use of AI in healthcare: health care has shown some positive effects on Accessibility and personalization of services. the treatment of mental health problems [1], AI offers new mechanisms to reach those who including early detection [2,3], providing might not otherwise be served. AI-supported feedback and personalized treatment plans [4], tools can improve the early detection and and developing of novel diagnose tools [2]. diagnosis of mental disorders [12]. AI chatbots AI in mental health services is implemented have shown promise in increasing referrals to through models like chatbots, digital platforms, mental health services, especially for minority and avatar therapy, enhancing accessibility and groups who are blocked from accessing traditional care [13]. These technologies can provide initial assessments, psychoeducation 1 This Publication is a Part of the Research Program The Intersection and even treatment, expanding access to mental of Virtue, Experience, and Digital Culture: Ethical and Theological Insights, financed by the University of Ljubljana. health support [12]. AI-driven virtual assistants and wearable devices enable continuous 47 monitoring and personalized care, which could 2.2 Aging Populations improve patient outcomes [11,14]. AI offers promising solutions for supporting an The integration of artificial intelligence into aging population, particularly in addressing mental health services represents a promising cognitive decline and mental health challenges. avenue for the development of personalized AI applications can monitor vital signs, health treatment plans through the sophisticated indicators, and cognition, as well as provide analysis of large datasets, enabling the support for daily activities [20]. With an identification of optimal therapeutic strategies increasing number of elderly individuals, AI can tailored to specific client profiles [15,16]. This support mental health care by providing data-driven methodology enables the dynamic companionship through intelligent animal-like adaptation of therapy to the evolving needs of robots (e.g., Paro, Harp seal) and assisting in the client. monitoring and managing conditions like dementia [21,22]. AI can also help in tracking 2 Overcoming Sociological cognitive health and providing timely Challenges through the Integration of interventions to maintain mental well-being in Artificial Intelligence in Mental Health older adults. These technologies have the Services potential to enhance independent living and quality of life for older adults and their families. 2.1 Modern Family Dynamics 2.3 Migration Modern family trends show that family structures and attitudes have changed Migrants often face mental health challenges significantly in recent decades [17]. There is a due to displacement, cultural adjustment and growing acceptance of different family forms, language barriers. AI can help migrants access including unmarried cohabitation, same-sex mental health services by providing culturally relationships and joint custody arrangements and linguistically relevant resources and [18]. These changes reflect an expansion of support. Chatbots and AI-driven platforms can developmental idealism and increasing support bridge gaps in care by providing immediate help for individual freedom in family choice [17]. and continuity of care across different regions On the other hand, there is a growing need for [23]. mental health services for families [19]. As the Recent research highlights the increasing role of most vulnerable members of the family - the digitalization and artificial intelligence (AI) in children - are usually also at risk, quick and migration and mobility systems, especially in the effective action in family mental health is of context of the COVID-19 pandemic [24]. While great importance. Many families are struggling these technologies offer opportunities for with various psychological problems. Together improving human rights and supporting with the changing family structure, this means a international development, they also bring great burden for every family member. In challenges that require careful consideration of addition, access to psychologists, psychiatrics design, development and implementation and therapists is limited, leading to an acute aspects. The integration of AI into migration shortage of mental health professionals processes requires a focus on human rights at worldwide. all stages that goes beyond technical feasibility The accessibility of services is probably the and companies' claims of inclusivity [24]. strongest argument for the integration of AI in healthcare [12]. AI-powered conversational 3 Ethical Consideration in the agents can improve the accessibility of mental Integration of Artificial Intelligence in health services by being available online at all Mental Health Services times and in underserved areas, being scalable, One of the main caveats to the use of AI in reliable, fatigue-free, and providing consistent mental health is the introduction of new ethical support, being culturally sensitive to adapt, and standards to ensure user safety. The approach to helping with education and symptom integrating AI into services should therefore be management. human-centered [25]. Any innovation should therefore focus on people in their most 48 vulnerable position. It is important to assess all bias, especially among marginalized groups, the risks with sufficient accuracy and avoid misuse risks associated with data privacy and security, of AI as much as possible. The most important and the challenges posed by the lack of areas for ethical consideration when integrating transparency of AI models. AI into mental health services should be privacy, bias, transparency, security. 4 Conclusion Data privacy and security are critical in digital We propose to define AI as a new ethical entity healthcare and require robust measures to in the field of mental health [30]. AI represents a protect sensitive information and prevent novel artifact that changes interactions, unauthorized access. Protecting privacy rights concepts, epistemic fields and normative and ensuring informed consent are critical to requirements. This change requires a maintaining trust and ethical standards in the redefinition of the role of AI, which lies on a use of personal health data [11]. Combining spectrum between a tool and an agent. This shift multiple data streams increases the risk of underscores the need for new ethical standards unauthorized use, which exacerbates privacy and guidelines that recognize the unique status issues. Ensuring informed consent and of AI as a distinct and influential actor in the field maintaining transparency, especially in of mental health. emergency operations, are critical to addressing The integration of AI into services can, on the these ethical concerns and protecting the rights one hand, provide more efficient and faster of participants [26]. solutions to some of the sociological challenges The use of AI in mental health treatment raises of today's society, but on the other hand, ethical concerns about bias, particularly among requires a precise and correct definition of the marginalized populations who are already limits within which these models can be used. discriminated against and lack access to mental These efforts aim to bridge the gap between health care. It is uncertain whether AI-assisted technology and human-centered care and psychotherapy can effectively address cultural ensure that AI complements, rather than differences and close treatment gaps in diverse replaces, the therapeutic benefits of human populations [27]. In addition, populations that interaction. are traditionally marginalized in fields such as psychology and psychiatry are most vulnerable to algorithmic biases in AI and machine learning Literature [27,28]. These biases limit the ability of AI to [1] Sandhya Bhatt. 2024. Digital Mental Health: Role of provide culturally and linguistically appropriate Artificial Intelligence in Psychotherapy. Annals of mental health resources, exacerbating existing Neurosciences, 0,0, 1-11. inequalities. The persistence of such biases in AI doi:10.1177/09727531231221612 systems not only risks increasing health [2] Sijia Zhou, Jingping Zhao and Lulu Zhang. 2022. Application of Artificial Intelligence on Psychological inequalities, but also exacerbates existing social Interventions and Diagnosis: An Overview. Frontiers in inequalities and raises critical ethical Psychiatry, 13(March), 1–7. considerations [9]. https://doi.org/10.3389/fpsyt.2022.811665 The future of artificial intelligence in clinical [3] Klaudia Kister, Jakub Laskowski, Agata Makarewicz and Jakub Tarkowski. 2023. Application of artificial settings is affected by a significant ethical intelligence tools in diagnosis and treatmentof mental dilemma concerning the trade-off between the disorders. Current Problems of Psychiatry, 24, 1–18. performance and interpretability of machine https://doi.org/10.12923/2353-8627/2023-0001 learning models [29]. The lack of transparency [4] Rachel L. Horn and John R. Weisz. 2020. Can Artificial Intelligence Improve Psychotherapy Research and in AI models makes it difficult to detect and Practice? Administration and Policy in Mental Health and correct biases. This underscores the need for Mental Health Services Research, 47, 5, 852–855. greater transparency to ensure ethical and fair https://doi.org/10.1007/s10488-020-01056-9 clinical decision-making. [5] Kerstin Denecke, Alaa Abd-alrazaq and Mowafa Househ. 2021. Artificial Intelligence for Chatbots in Mental In summary, the integration of AI into mental Health: Opportunities and Challenges. In: Househ, M., health services requires the establishment of Borycki, E., Kushniruk, A. (eds) Multiple Perspectives on strict ethical standards to protect the safety and Artificial Intelligence in Healthcare. 115–128. privacy of users. A human-centered approach is https://doi.org/10.1007/978-3-030-67303-1_10 essential, with a focus on dealing with potential 49 [6] Elias Aboujaoude, Lina Gega, Michelle B. Parish and Century of Family Attitude Trends in the United States. Donald M. Hilty. 2020. Editorial: Digital Interventions in Sociology of Development, 9, 1, 1–32. Mental Health: Current Status and Future Directions. https://doi.org/10.1525/sod.2022.0003 Front. Psychiatry 11, 111. doi: 10.3389/fpsyt.2020.00111 [19] WHO. 2022. World mental health report: transforming [7] Kay T. Pham, Amir Nabizadeh & Salih Selek. 2022. mental health for all. (June 2022) Retrieved August 20, Artificial Intelligence and Chatbots in Psychiatry. 2024 from Psychiatric Quarterly, 93, 1, 249–253. https://www.who.int/publications/i/item/978924004933 https://doi.org/10.1007/s11126-022-09973-8 8. [8] WHO. 2022. Regional digital health action plan for the [20] Sara J.Czaja and Marco Ceruso. 2022. The Promise of WHO European Region 2023–2030 (RC72). (July 2022). Artificial Intelligence in Supporting an Aging Population. Retrieved August 20, 2024 from Journal of Cognitive Engineering and Decision Making, https://www.who.int/europe/publications/i/item/EUR- 16, 4, 182–193. RC72-5 https://doi.org/10.1177/15553434221129914 [9] Amelia Fiske, Peter Henningsen and Alena Buyx. 2019. [21] Maria R. Lima. 2024. Home Integration of Conversational Your Robot Therapist Will See You Now: Ethical Robots to Enhance Ageing and Dementia Care. Implications of Embodied Artificial Intelligence in ACM/IEEE International Conference on Human-Robot Psychiatry, Psychology, and Psychotherapy. Journal of Interaction, 115–117. Medical Internet Research, 21, 5, e13216. https://doi.org/10.1145/3610978.3638378 https://doi.org/10.2196/13216 [22] Wendy Moyle. 2019. The promise of technology in the [10] Elizabeth C. Stade, Shannon Wiltsey Stirman, Lyle future of dementia care. Nature Reviews Neurology, 15, Ungar, Cody L. Boland,H. Andrew Schwartz, David B. 6, 353–359. https://doi.org/10.1038/s41582-019-0188-y Yaden, Joao Sedoc, Robert J. DeRubeis, Robb Willer and [23] Zahra Abtahi, Miriam Potocky, Zarin Eizadyar, Shanna L. Johannes C. Eichstaedt. 2024. Large Language Models Burke, Nicole M. Fava. 2022. Digital Interventions for the Could Change the Future of Behavioral Healthcare: A Mental Health and Well-Being of International Migrants: Proposal for Responsible Development and Evaluation. A Systematic Review. Research on Social Work Practice, Mental Health Res 3, 12. 33, 5, 518-529. Doi:10.1177/10497315221118854 https://doi.org/10.1038/s44184-024-00056-z [24] Marie McAuliffe, Jenna Blower and Ana Beduschi. 2021. [11] David B. Olawade, Ojima Z. Wada, Aderonke Odetayo, Digitalization and artificial intelligence in migration and Aanuoluwapo Clement David-olawade, Fiyinfoluwa mobility: Transnational implications of the covid-19 Asaolu and Judith Eberhardt. 2024. Enhancing mental pandemic. Societies, 11, 4, 135. health with Artificial Intelligence: Current trends and https://doi.org/10.3390/soc11040135 future prospects. Journal of Medicine, Surgery, and [25] Luke Balcombe and Diego de Leo. 2022. Human- Public Health, 3, 100099. Computer Interaction in Digital Mental Health. https://doi.org/10.1016/j.glmedi.2024.100099 Informatics, 9,1, 14. [12] Koki Shimada. 2023. The Role of Artificial Intelligence in https://doi.org/10.3390/informatics9010014 Mental Health: A Review. Science Insights 43, 5, 1119- [26] Nicholas C. Jacobson and Matthew D. Nemesure. 2021. 1127. doi:10.15354/si.23.re820 Using Artificial Intelligence to Predict Change in [13] Max Rollwage, Johanna Habicht, Keno Juechems, Ben Depression and Anxiety Symptoms in a Digital Carrington, Sruthi Viswanathan, Mona Stylianou, Tobias Intervention: Evidence from a Transdiagnostic U. Hauser and Ross Harper. 2023. Using Conversational Randomized Controlled Trial. Psychiatry Research, 295, AI to Facilitate Mental Health Assessments and Improve 113618. Clinical Efficiency Within Psychotherapy Services: Real- https://doi.org/10.1016/J.PSYCHRES.2020.113618 World Observational Study. JMIR AI, 2, e44358. [27] Bennett Knox, Pierce Christoffersen, Kalista Leggitt, Zeia https://doi.org/10.2196/44358 Woodruff and Matthew H. Haber. 2023. Justice, [14] David D. Luxton. 2020. Ethical implications of Vulnerable Populations, and the Use of Conversational conversational agents in global public health. Bulletin of AI in Psychotherapy. American Journal of Bioethics, 23, the World Health Organization, 98, 4, 285–287. 5, 48–50. https://doi.org/10.2471/BLT.19.237636 https://doi.org/10.1080/15265161.2023.2191040 [15] Leonard.Bickman. 2020. Improving Mental Health [28] Zoha Khawaja and Jean C. Bélisle-Pipon. 2023. Your Services: A 50-Year Journey from Randomized robot therapist is not your therapist: understanding the Experiments to Artificial Intelligence and Precision role of AI-powered mental health chatbots. Frontiers in Mental Health. Adm Policy Ment Health, 47, 795–843. Digital Health, 5, 1278186. doi: https://doi.org/10.1007/s10488-020-01065-8 10.3389/fdgth.2023.1278186 [16] Silvan Hornstein, Valerie Forman-Hoffman, Albert [29] Danilo Bzdok and Andreas Meyer-Lindenberg,. 2018. Nazander, Kristian Ranta and Kevin Hilbert. 2021. Machine Learning for Precision Psychiatry: Opportunities Predicting therapy outcome in a digital mental health and Challenges. Biological Psychiatry: Cognitive intervention for depression and anxiety: A machine Neuroscience and Neuroimaging, 3, 3, 223–230. learning approach. DIGITAL HEALTH, 7, 1-11. https://doi.org/10.1016/J.BPSC.2017.11.007 doi:10.1177/20552076211060659 [30] Jana Sedlakova and Manuel Trachsel. 2023. [17] Josef Ehmer. 2021. A historical perspective on family Conversational Artificial Intelligence in Psychotherapy: A change in Europe. In Norbert F. Schneider and Michaela New Therapeutic Tool or Agent? American Journal of Kreyenfeld (eds). Research Handbook on the Sociology Bioethics, 23, 5, 4–13. of the Family, 143–161. https://doi.org/10.1080/15265161.2022.2048739 https://doi.org/10.4337/9781788975544.00018 [18] Keera Allendorf, Linda Young-Demarco and Arland Thornton. 2023. Developmental Idealism and a Half- 50 Optimization Problem Inspector: A Tool for Analysis of Industrial Optimization Problems and Their Solutions Tea Tušar Jordan N. Cork Andrejaana Andova Bogdan Filipič Jožef Stefan Institute and Jožef Stefan International Postgraduate School Ljubljana, Slovenia {tea.tusar, jordan.cork, andrejaana.andova, bogdan.f ilipic}@ijs.si Abstract OPI is a web application, implemented by a Python library called optimization-problem-inspector included in the PyPi Python This paper presents the Optimization Problem Inspector (OPI) 1 package index . It is highly interactive and requires no program-tool for assisting researchers and practitioners in analyzing indus- ming knowledge to be used. trial optimization problems and their solutions. OPI is a highly Freely available contemporary software tools for multiobjective interactive web application requiring no programming knowl- optimization, such as DESDEO [12], jMetal [7] (and jMetalPy [2]), edge to be used. It helps the users to better understand their the MOEA Framework [8], ParadisEO-MOEO [10], platEMO [17], problem by: 1) comparing the landscape features of the analyzed pygmo [3], pymoo [4], and Scilab [15], provide the implementa-problem with those of some well-understood reference problems, tion of various optimization algorithms and test problems. While and 2) visualizing the values of solution variables, objectives, con- the majority of them include some visualization of solutions, straints and any other user-specified solution parameters. The the plots are mostly focused on showing algorithm results for features of OPI are presented using a bi-objective pressure vessel the purpose of comparing algorithm performance and not to design problem as an example. increase problem understanding. In addition, none of these tools Keywords compute additional problem features as OPI does. Therefore, OPI brings a unique perspective to optimization problem analysis and optimization, black-box problems, sampling, problem characteri- understanding. zation, visualization Next, Section 2 presents the real-world problem that will be used to showcase the features of OPI in Section 3. The paper 1 Introduction concludes with some remarks in Section 4. Industrial optimization problems often require simulations to evaluate solutions. For example, in electrical motor design [18, 2 Real-World Use Case 19], assessing the efficiency and electromagnetic performance of a Our chosen real-world problem is a version of the well-known proposed design is done by running a simulator that analyzes the pressure vessel design problem, first proposed more than 30 motor magnetic field and flux distribution. Such evaluations are years ago [16]. In this work, we adapt the formulation from [5] black boxes to the user and the optimization algorithm alike, i.e., to handle the pressure vessel volume as a constraint, as well as the underlying functions cannot be explicitly expressed, which an objective. We also remove one unnecessary constraint and makes the problem hard to understand and solve. use the original boundary constraints for the first two variables. The established way to gain a better understanding of indus- A pressure vessel is a tank, designed to store compressed trial problems is through the analysis of their solutions. Depend- gasses or liquids. It consists of a cylindrical middle part capped ing on the problem at hand, this can be a challenging task, as at both ends by hemispherical heads. The pressure vessel has industrial problems often have a large number of variables, mul- four design variables (see Figure 1): the shell thickness, 𝑥 = , 1 𝑇s tiple objectives and constraints [20]. the head thickness, 𝑥 = , the inner radius, = 2 𝑇 𝑥 𝑅 , and the h 3 The Optimization Problem Inspector (OPI) presented in this length of the cylindrical section of the vessel, 𝑥 = 4 𝐿. The two paper is a tool conceived to ease this task for both problem experts thickness variables are integer multiples of 0.0625 inches, which and optimization algorithm developers. OPI provides two ways correspond to the available thicknesses of rolled steel plates, to further the understanding of an optimization problem: while the length and the radius are continuous. The problem (1) It computes a set of landscape features of the analyzed has three constraints, two on the search variables and one on problem and compares them to those of well-understood 1 reference problems. https://pypi.org/project/optimization- problem- inspector/ (2) It provides visualizations of solutions through the values of their variables, objectives, constraints and any other 𝑥 = 4 𝐿 user-specified solution parameters. 𝑥 = = 2 𝑇 𝑥 𝑇 h 1 s Permission to make digital or hard copies of all or part of this work for personal 𝑥 = 3 𝑅 or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner /author(s). Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia © 2024 Copyright held by the owner/author(s). Figure 1: Pressure vessel design variables. https://doi.org/10.70314/is.2024.scai.8265 51 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Tea Tušar, Jordan N. Cork, Andrejaana Andova, and Bogdan Filipič the volume. Its two objectives are to minimize the total costs, If needed, the sample can be generated by the tool itself, based including the costs of the material, forming and welding, and to on the variable information provided in the problem specification maximize the volume. The problem is formally defined as follows: step. However, this is not a required step in using OPI. A user that already has a set of (evaluated) solutions to work with can 2 2 min 𝑓 (x) = 0 + 1 + 3 1 .6224𝑧1𝑥3𝑥4 .7781𝑧2𝑥 .1661𝑧 𝑥 3 1 4 skip it and input the data directly (see Section 3.3). + 2 19.84𝑧 𝑥 Sample generation requires one to choose the number of de- 1 3 sired samples, set to a default of 100, and the sample generation 4 2 3 max 𝑓 (x) = + 2 𝜋 𝑥 𝑥 𝜋 𝑥 3 4 3 method. Three sample generation methods are supported: ran- 3 dom, Sobol and Latin Hypercube, with random sampling being subject to 𝑔 (x) = 0 − ≤ 0 1 .0193𝑥 3 𝑧1 the default. The user may alter the settings of these sampling 𝑔 (x) = 0 − ≤ 0 methods, such as the random generator seed. Selecting the but- 2 .00954𝑥 3 𝑧2 ton to generate and download the sample will download it in a 𝑔 (x) = (x) ≥ 1 296 000 3 𝑓2 csv-formatted file. 𝑥 ∈ {18 1 , . . . , 32} In the pressure vessel use case, OPI warns the user that not 𝑥 ∈ {10 2 , . . . , 32} all sample generation methods are appropriate. In fact, the Sobol 𝑥 ∈ [10 sampler and the Latin Hypercube Sampler are not compatible 3, 𝑥 4 , 200] with non-continuous parameters. If used nevertheless, they may where 𝑧 = 0 1 .0625𝑥 1 produce unexpected results. 𝑧 = 0 2 .0625𝑥 2 3.3 Data 3 Optimization Problem Inspector Features In OPI, the data is essentially a set of evaluated solutions, where OPI is a web application, organized into five functional sections each solution must contain a value for all objectives, constraints and a help section, providing guidance to the user. OPI expects the and other parameters included in the problem specification. The user to provide the problem specification and its data—evaluated evaluation is conducted externally to the tool. problem solutions. Then, it generates and visualizes comparisons The data needs to be uploaded in a file in csv format. If any to artificial reference problems and visualizes the provided data. parameters from the problem specification are missing from the Next, we will describe the main features of OPI through its data, the tool will display a warning message. Any data parame- five content sections: problem specification, sample generation, ters that are not included in the problem specification, are ignored data, comparison to reference problems, and data visualization. without raising any warnings. When correctly input, the user will be able to view the data they have input, inspecting it in 3.1 Problem Specification tabular format. In the first OPI section, the user can provide the specification Inputting the data completes the setup stage of the process. of the industrial problem to be studied. The tool needs this in- The user may then begin generating visualisations to assist them formation to properly generate the samples, described in the in understanding their problem. Section 3.2, and setup the visualisations. The problem specification must be given in the yaml file for- 3.4 Comparison to Reference Problems mat and needs to contain some basic information about problem The first visualization mechanism provided by OPI visually com- parameters (variables, objectives, constraints) to be included in pares the problem to a set of artificial reference problems with the analysis. OPI can handle one or more objectives and zero or known properties. This is conducted by displaying the landscape more constraints. In addition to variables, objectives and con- features of the user-defined problem alongside the same features straints, the user can specify any number of other parameters of each of the reference problems in a parallel coordinates plot. that they want analyzed and visualized, for example, the name The plot is interactive—the user can highlight some of the prob- of the algorithm that found a solution or the time required to lems by brushing along one of the parallel axes. In addition, the evaluate a solution. feature values can be viewed in a table and downloaded to a file For each of the parameters, the user needs to specify its name in csv format. and its grouping (whether it is a variable, objective, constraint or The reference problems can be set by the user, however, con- something else). For variables, their type (continuous, integer or fined within the collection labelled here as GBBOB, i.e., gener- categorical) and the upper and lower bounds (for non-categorical alised BBOB, where BBOB stands for the well-known suite of 24 types) are also required. An example yaml file, specifying a con- Black-Box Optimization Benchmarking problems with diverse strained multiobjective problem with several variables, is already properties [9]. OPI provides a generator of GBBOB problems that provided within the tool to guide the user. match the analyzed problem in terms of the number of variables For the pressure vessel design problem, we can input four and objectives and the presence or absence of constraints. For variables (first two are integer and last two are continuous), two objectives and (optionally) the constraint, any single-objective objectives and three constraints. Alternatively, we can decide to BBOB problem instance can be used. The user can specify the de- skip the individual constraints and only use the total constraint sired GBBOB problems in the yaml format. OPI already contains violation instead. five GBBOB problems to start. A problem can be characterized by a large number of features, 3.2 Sample Generation most hard to interpret by a human. In OPI, we included the fol- In OPI, a sample is a set of x-values, corresponding to the variables lowing problem landscape features that are understandable to an set in the problem specification section. In other words, a sample expert user [1, 11, 13, 14]: CorrObj, MinCV, FR, constr_obj_corr, is a set of non-evaluated solutions. H_MAX, UPO_N, PO_N and a set of neighborhood features. CorrObj 52 Optimization Problem Inspector Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Figure 2: The initial part of the parallel coordinate plot visualizing feature values for the analyzed problem and the chosen set of artificial test problems. is a feature that shows the correlation between the objectives. well as the color map, can also be specified by the user. Both vi- MinCV represents the minimum constraint violation among all so- sualizations support interaction and can be downloaded in html lutions in the population. FR represents the proportion of feasible or png format. solutions in the population. constr_obj_corr presents the max- 3.5.1 Scatter Plot Matrix. 2 The scatter plot matrix consists of 𝑛 imum correlation between the constraints and all the problem plots for 𝑛 chosen problem parameters as it contains 2-D scatter objectives. H_MAX is the maximum information content among plots for all possible parameter pairs. In OPI, the user can apply all objectives. UPO_N is the proportion of unconstrained non- brushing and linking to select the desired solutions in one or dominated solutions, while PO_N is the proportion of the con- more of the scatter plots. These are then highlighted in all scatter strained non-dominated solutions. The neighbourhood features plots in the matrix. denoted by neighbourhood_feats are a collection of features Figure 3 shows such a scatter plot for our pressure vessel explaining the neighborhood of solutions, e.g., how many neigh- problem. This visualization includes data from two sources. The bors of a solution dominate the solution, how many neighbors first comes from a random sampling of the search space (shown are dominated by the solution, how many are incomparable to in light blue) and the second from running the NSGA-II algo- the solution, how close the neighboring solutions are, etc. OPI 6 rithm [6] on this problem for 2 · 10 function evaluations to offers a total of 16 features, but the user can choose which to achieve a good approximation of the Pareto front (shown in compute and visualize. black). The two sources are set apart by a custom parameter that Figure 2 shows the initial part of the parallel coordinates plot is then used for coloring the solutions. Some solutions from Fig- (as the entire plot would not fit the paper) for the pressure vessel ure 3 are highlighted – see the rectangle in the (𝑥 ) scatter 3, 𝑥 1 problem. In the comparison, we use the default five GBBOB plot (third from the left in the top row). reference problems as well as a custom created one. We notice These plots clearly show the linear relationship of the near- that the pressure vessel problem is most similar to the custom optimal solutions between 𝑥 and as well as and . When 1 𝑥 2 𝑥 1 𝑥 3 GBBOB problem with the first objective equal to the step ellipsoid only 𝑓 and are chosen, it is distinctively visible that the Pareto 1 𝑓2 function 𝑓 , the second to the multimodal peaks function , and 7 𝑓22 set approximation is piece-wise linear and disconnected. the linear constraint 𝑓 . This similarity might be due to our mixed- 5 integer problem containing plateaus in the continuous landscape 3.5.2 Parallel Coordinates Plot. The parallel coordinates plot space in which the features are computed, which is similar to the shows all chosen parameters as parallel coordinates and solu- step ellipsoid function, and having linear constraints. tions as lines in the plot. Similarly as with the scatter plot matrix, interaction via brushing and linking is supported to select solu- 3.5 Data Visualization tions that fit the desired values. In the data visualization section of the web application, the sup- 4 Conclusions plied data can be visualized using either a scatter plot matrix or a parallel plot. In both cases, the user can choose which prob- This work presented the features of Optimization Problem Inspec- lem parameters to visualize among all those listed in problem tor – a web application to support problem experts and algorithm specification. Additionally, a simple data filtering that limits any designers in gaining a better understanding of industrial optimiza- variable between the desired minimum and maximum values is tion problems. The tool provides comparisons to well-understood also supported and can be manipulated via the OPI interface in reference problems and interactive and highly-customizable vi- yaml format. The parameter used for coloring the solutions, as sualizations, which can be exported in html and png formats. 53 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Tea Tušar, Jordan N. Cork, Andrejaana Andova, and Bogdan Filipič Figure 3: Random (light blue) and near-optimal (black) solutions of the pressure vessel design problem visualized in OPI with a scatter plot matrix containing variables 𝑥 to and objectives and . 1 𝑥 4 𝑓1 𝑓2 Samples can be exported and solutions imported using the stan- [7] Juan José Durillo and Antonio J. Nebro. 2011. jMetal: A Java framework for multi-objective optimization. Advanced Engineering Software, 42, 10, dard csv format, which makes the data exchange between OPI 760–771. doi: 10.1016/J.ADVENGSOFT.2011.05.014. and various optimization software easy to do. OPI functionality [8] David Hadka. 2024. MOEA Framework: A free and open source Java frame- is made to be simple and at the same time flexible. Therefore, it work for multiobjective optimization. https://github.com/MOEAFramewor k/MOEAFramework. Computer software, version 4.4. (2024). is utilisable by non-experts and experts, alike, providing a wide [9] Nikolaus Hansen, Steffen Finck, Raymond Ros, and Anne Auger. 2009. Real- range of angles from which to view the problems. Parameter Black-Box Optimization Benchmarking 2009: Noiseless Functions Definitions. Research Report RR-6829. INRIA. https://hal.inria.f r/inria- 0036 Acknowledgements 2633v2. [10] Arnaud Liefooghe, Laetitia Jourdan, and El-Ghazali Talbi. 2011. A software The authors acknowledge the financial support from the Slove- framework based on a conceptual unified model for evolutionary multiob- jective optimization: ParadisEO-MOEO. European Journal of Operational nian Research and Innovation Agency (research core funding Research, 209, 2, 104–112. doi: 10.1016/J.EJOR.2010.07.023. No. P2-0209 “Artificial Intelligence and Intelligent Systems”, and [11] K. M. Malan, J. F. Oberholzer, and A. P. Engelbrecht. 2015. Characterising constrained continuous optimisation problems. In Proceedings of the 2015 project No. N2-0254 “Constrained Multiobjective Optimization IEEE Congress on Evolutionary Computation (CEC 2015), 1351–1358. doi: Based on Problem Landscape Analysis”) and from the Jožef Ste- 10.1109/CEC.2015.7257045. fan Innovation Fund (project “A Tool for Analysis of Industrial [12] Giovanni Misitano, Bhupinder Singh Saini, Bekir Afsar, Babooshka Shavazi- pour, and Kaisa Miettinen. 2021. DESDEO: The modular and open source Optimization Problems and Their Solutions”). This publication is framework for interactive multiobjective optimization. IEEE Access, 9, 148277– also based upon work from COST Action “Randomised Optimi- 148295. doi: 10.1109/ACCESS.2021.3123825. sation Algorithms Research Network” (ROAR-NET), CA22137, [13] Mario A. Muñoz, Michael Kirley, and Saman K. Halgamuge. 2015. Ex- ploratory landscape analysis of continuous space optimization problems supported by COST (European Cooperation in Science and Tech- using information content. IEEE Transactions on Evolutionary Computation, nology). We are grateful to Jernej Zupančič for implementing the 19, 1, 74–87. doi: 10.1109/TEVC.2014.2302006. [14] Cyril Picard and Jürg Schiffmann. 2021. Realistic constrained multiobjec- core functionalities of the Optimization Problem Inspector. tive optimization benchmark problems from design. IEEE Transactions on Evolutionary Computation, 25, 2, 234–246. doi: 10.1109/TEVC.2020.3020046. References [15] Philippe Roux and Perrine Mathieu. 2016. Scilab: I. Fundamentals. In Scilab, from theory to practice. D-Booker Editions. [1] Hanan Alsouly, Michael Kirley, and Mario Andrés Muñoz. 2023. An instance [16] E. Sandgren. 1990. Nonlinear integer and discrete programming in mechan- space analysis of constrained multiobjective optimization problems. IEEE Transactions on Evolutionary Computation ical design optimization. Journal of Mechanical Design, 112, 2, 223–229. , 27, 5, 1427–1439. doi: 10.1109 [17] Ye Tian, Ran Cheng, Xingyi Zhang, and Yaochu Jin. 2017. PlatEMO: A MAT- /TEVC.2022.3208595. LAB platform for evolutionary multi-objective optimization. IEEE Computa- [2] Antonio Benítez-Hidalgo, Antonio J. Nebro, José García-Nieto, Izaskun tional Intelligence Magazine, 12, 4, 73–87. doi: 10.1109/MCI.2017.2742868. Oregi, and Javier Del Ser. 2019. jMetalPy: A Python framework for multi- [18] Tea Tušar, Peter Korošec, and Bogdan Filipič. 2023. A multi-step evaluation objective optimization with metaheuristics. Swarm and Evolutionary Com- putation process in electric motor design. In Slovenian Conference on Artificial In- , 51, 100598. doi: 10.1016/J.SWEVO.2019.100598. telligence, Proceedings of the 26th International Multiconference Information [3] Francesco Biscani and Dario Izzo. 2020. A parallel global multiobjective Society (IS 2023). Vol. A. Jožef Stefan Institute, Ljubljana, Slovenia, 48–51. framework for optimization: pagmo. Journal of Open Source Software, 5, 53, [19] Tea Tušar, Peter Korošec, Gregor Papa, Bogdan Filipič, and Jurij Šilc. 2007. 2338. doi: 10.21105/joss.02338. A comparative study of stochastic optimization methods in electric motor [4] Julian Blank and Kalyanmoy Deb. 2020. Pymoo: Multi-objective optimization design. Applied Intelligence, 27, 2, 101–111. doi: 10.1007/S10489- 006- 0022- 2. in Python. IEEE Access, 8, 89497–89509. doi: 10.1109/ACCESS.2020.2990567. [20] Koen van der Blom, Timo M. Deist, Vanessa Volz, Mariapia Marchi, Yusuke [5] Carlos A. Coello Coello. 2002. Theoretical and numerical constraint-handling Nojima, Boris Naujoks, Akira Oyama, and Tea Tušar. 2023. Identifying techniques used with evolutionary algorithms: A survey of the state of the properties of real-world optimisation problems through a questionnaire. In art. Computer Methods in Applied Mechanics and Engineering, 191, 11, 1245– Many-Criteria Optimization and Decision Analysis: State-of-the-Art, Present 1287. doi: https://doi.org/10.1016/S0045- 7825(01)00323- 1. Challenges, and Future Perspectives. Natural Computing Series. Dimo Brock- [6] Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and T. Meyarivan. 2002. A hoff, Michael Emmerich, Boris Naujoks, and Robin C. Purshouse, editors. fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation Springer, 59–80. doi: 10.1007/978- 3- 031- 25263- 1_3. , 6, 2, 182–197. doi: 10.1109/4235.996017. 54 Multi-Agent System for Autonomous Table Football: A Winning Strategy ∗ ∗ Marcel Založnik Kristjan Šoln Jožef Stefan Institute Faculty of Electrical Engineering, University of Ljubljana Ljubljana, Slovenia Ljubljana, Slovenia marcel.zaloznik@gmail.com ks4835@student.uni- lj.si Abstract This paper presents a multi-agent system (MAS) for autonomous table football, developed for the FuzbAI competition at the Uni- versity of Ljubljana. Our system consists of four independent agents, each dynamically assigned specific roles—Goalkeeper, Defender, Midfielder, and Attacker—based on real-time game analysis. This role-based architecture enabled seamless coordi- nation between offensive and defensive strategies, allowing our team to secure first place. We describe the simulation framework used, the processing of sensor data, and the control strategies that allowed the agents to execute precise actions in a dynamic environment. The results highlight the effectiveness of adaptive, role-based decision-making, demonstrating the potential of MAS in real-time, competitive settings. Keywords Figure 1: Table setup for the FuzbAI autonomous football multi-agent system, autonomous table football, role-based strat- competition. egy, real-time decision making, AI in robotics 1 Introduction selecting roles that dictated their actions during gameplay. This The FuzbAI competition, held as part of the “Dnevi Avtomatike” strategic approach enabled our team to outperform competitors event at the Faculty of Electrical Engineering, University of Ljubl- and ultimately secure first place in the competition. jana, is a premier contest for students specializing in automation This paper delves into the development and implementation and artificial intelligence [11]. This event challenges participants of our multi-agent system. We will explore the architectural to develop intelligent autonomous agents capable of playing table choices, the role-based decision-making strategies employed by football without human intervention. The competition not only each agent, and the overall system’s performance in the context serves as a platform for demonstrating technical skills but also of the FuzbAI competition. fosters innovation in the application of AI and machine learning techniques in real-time environments. Figure 1 illustrates the 2 Competition Setup and System Description table setup used in the competition. The FuzbAI competition required all participants to develop pro- The FuzbAI competition is structured in a way that teams grams capable of playing table football autonomously. To facil- must design and implement a fully autonomous system capable itate this, the competition provided a standardized simulation of effectively competing against other AI-driven systems. Each environment and a set of initial tools that every team used as match is a test of the participants’ ability to integrate advanced the foundation for their development. This section describes the algorithms and robotics, simulating the dynamics of a real foot- simulation framework, the types of data available from the sys- ball game on a miniature scale. The competitive format includes tem, and the means by which agents could interact with both the both qualification rounds and knockout stages, ensuring that simulated and real game environments. only the most capable and innovative solutions advance to the final stages. 2.1 Simulation Framework Our entry into the FuzbAI competition focused on the develop- Participants were provided with a Python-based simulation frame- ment of a multi-agent system (MAS), where each of our four rods work designed to emulate a real table football match, as shown in functioned as an independent agent. These agents were designed figure 2. This simulator accurately replicated the physics of the to collaborate through a streamlined decision-making process, game, including the movement of the ball and rods, and managed ∗ Both authors contributed equally to this research. the interactions between the environment and the agents control- ling the rods. The framework included fundamental functionali- Permission to make digital or hard copies of all or part of this work for personal ties such as ball tracking, rod positioning, and interaction rules, or classroom use is granted without fee provided that copies are not made or allowing all teams to concentrate on AI development without distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this needing to construct the simulation infrastructure themselves. work must be honored. For all other uses, contact the owner /author(s). One of the key features of the competition setup was that the Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia interaction protocols for the simulator and the physical table © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2024.scai.8341 were identical. The same signals and commands used to control 55 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Založnik and Šoln 3 Related Work Research on multi-agent systems (MAS) and their application in robotic football has been extensively explored. This section reviews some contributions that have informed the development of autonomous systems for table football and real football. Moos et al. (2024) [5] developed an automated football table as a research platform for reinforcement learning, highlighting the challenges of transferring learned behaviors from simulation to real-world environments and the need for robust algorithms to handle uncertainties. While reinforcement learning is a com- mon approach in such studies, we did not achieve satisfactory results with it. Therefore, we decided to use multi-agent systems instead. Klančar et al. (2002) [4] investigated cooperative control in robot football (real football) using multi-agent systems, Figure 2: Simulator interface. focusing on behavior-based control and dynamic role assignment among robots to optimize performance. Their approach empha- sized effective communication for coordination in multi-agent settings. This work particularly inspired our approach to multi- the actuators in the simulator were also used for the real table agent systems, where we focused on behavior-based control and without any modification. This feature ensured that teams could dynamic role assignment. Ribeiro et al. (2024) [6] proposed a seamlessly transition their algorithms from the simulated envi-probability-based strategy (PBS) for robotic football (real foot- ronment to the physical table setup, which was used in the final ball), utilizing real-time data for centralized decision-making rounds of the competition. As a result, the simulation provided without relying heavily on pre-defined plays. Their approach a consistent testing ground that mirrored the actual physical demonstrated flexibility across different environments. Smit et setup, enabling teams to develop and refine their strategies under al. (2023) [8] explored scaling multi-agent reinforcement learn-uniform conditions. ing (MARL) to a full 11v11 simulated football environment (real football), focusing on computational efficiency and the use of 2.2 Sensor Data attention mechanisms to enhance scalability in large-scale multi- Both the simulation environment and the real table provided agent settings. Song et al. (2024) [9] conducted an empirical study each team with data from two cameras, one placed on each side on the Google Research Football platform (real football), intro- of the table. Each camera captured different views of the game, ducing a population-based MARL training pipeline to quickly and teams had to decide how to combine the information from develop competitive AI players, highlighting the importance of both cameras. The data provided by each camera included: scalable training frameworks. Scott et al. (2022) [7] examined • end-to-end learning in RoboCup simulations (real football), op- Ball position: The coordinates of the ball on the 2D plane timizing both low-level skills and high-level strategies through of the table. • competitive self-play, providing a comprehensive approach to Ball speed: Velocity of the ball. • multi-agent training in competitive environments. Ball size: Area of the ball in the captured image (in pixels). • Rod positions: Calibrated position of all rods (in the inter- 4 MAS Approach to Autonomous Table val [0, 1]). • Rod angles: Calibrated angle of all rods (in the interval Football Control [−32, +32]). In this section, we describe the the methodology of our approach. This camera data was streamed continuously, requiring teams We describe agent architecture, different agent roles and outline to process and merge the inputs from both cameras to accurately the actions they can take. Then, we discuss the conditions and interpret the game’s state. The accuracy and frequency of the priorities for role assignment during the game and evaluate the data were sufficient to enable real-time decision-making by the behavior of the system as a whole. autonomous agents, whether interacting with the simulator or the physical table. 4.1 Agent Architecture There exist several agent architectures, commonly used in MAS. 2.3 Actuator Outputs Approaches, such as [4, 10, 12, 13], use role-based approach for interaction between agents and with the environment. In role-To interact with the environment, each agent could send com- based approach, based on the concepts from role theory [1], the mands to the actuators that controlled the rods. The system agents are assigned roles which affect their behavior. While the allowed for two primary types of commands: overall long-term goal of the system is typically predefined and • Translatory movement: Moving the rod left or right across does not change, e.g. win a table football match, the current role the table. of an individual agent defines agent’s short-term goals, which • Rotational movement: Rotating the rod to control the angle influences agent behavior, their decision-making process, and at which the players struck the ball. how they interact with the rest of the system. Furthermore, sepa- Precise and timely commands were crucial for effective game ration of agent functionality into independent roles can provide control, as they enabled the agents to optimally position their fig- simplification and decoupling of individual tasks, leading to a ures, strike the ball accurately, and execute defensive or offensive more modular system, which can simplify and improve the ex- strategies effectively. tensibility of the implementation [3]. 56 Autonomous Table Football: Winning Strategy Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia There exist several approaches to role and behavior implemen- opponent scoring even if the actuators fail to respond fast enough tation in MAS, such as merging different roles, role models and to block this style of attack. Here, communication between the class members [2, 3, 4]. In our implementation, we simplify the two agents is performed implicitly, as each agent perceives the architecture by allowing an agent to occupy only a single role at roles of other agents as a part of the overall environmental state. a time, and defining the roles in a way that allows reassigning be- Defender is an agent tasked with blocking opponent attacks tween iterations of the algorithm without regard to the previous by intercepting the ball when it is in the opponent’s possession or role or state of the agent. moving towards the goal. This role utilizes a single follow action, Each role defines a set of possible actions an agent can take. similar to the Goalkeeper’s follow action. Whenever the Defender The agents decide which action to take based on their priority role is active, the agent tracks the position and velocity of the ball, and the current environment. More complex roles can be im- trying to match either its current coordinate or the estimated plemented in a stateful manner, meaning the decision on which intersection with the trajectory of the ball. The agent identifies action to take is dependent on previous actions as well. An agent the figure closest to the intersection and attempts to move the can only be assigned a single role at a time, but can switch be- rod using minimal amount of movement. This approach allows tween roles throughout iterations regardless if the particular goal for faster adjustments during the game, improving defensive is fulfilled, when appropriate conditions arise. Additionally, every efficiency. agent must have a role assigned at all times. Midfielder is a an agent role with the primary task of raising An action is a discrete, autonomous task that an agent can the figures to allow passing the ball from behind the current take on by making appropriate decisions and acting onto the agent. This role, although simple, is essential in order to avoid environment, e.g. by sending commands to the actuators. This accidentally breaking a friendly attack by an Attacker agent advances the agent toward the goal imposed by the current role. behind the current rod. An agent can only execute a single action at a time. Additionally, Attacker is an agent with the task of kicking the ball towards every agent must be actively executing an action at all times. the opponent goal in an attempt to score a point. Unlike other These concepts were implemented using an Object Oriented roles, the Attacker role is implemented in a stateful manner. approach, as suggested by the authors of the competition. In our Actions can only happen in a specified order, when the corre- implementation, each agent repeatedly executes a fast processing sponding conditions are met. The role implements follow, kick routine. Every iteration, the environment data is updated and role and prevent back-kick actions. selection for the agent is performed. Then, as the agent decides on Whenever the agent is assigned this role, the follow action is a role for that iteration, the appropriate role processing function executed first. During the follow action, the agent slightly raises is called, executing individual actions. the figures in order to prepare for a kick. The figure closest to the ball is selected and rod offset is adjusted in order to align the figure with the ball. Whenever the agent determines that the 4.2 Role Description alignment with the ball is sufficient, the agent moves onto the A typical table football setup consists of four rods per player, each next state, the kick action. Here, the rod is rotated in order to with a number of mounted figures. In this implementation, each strike the ball. During this state, it is still necessary to track the rod is considered an agent, resulting in a system with four agents position of the ball, as the ball can move significantly within a for which we define the following roles, typically associated with few iterations of the algorithm. As the rod completes the forward table football games. rotation, the agent monitors the position of the ball and assesses Goalkeeper is the final line of defense, primarily responsible if the figure successfully hit the ball. In that case, the next action for intercepting the ball before it reaches the goal. Typically the is set back to follow, and the agent is usually assigned a new role left-most rod, which is nearest to the goal and has a single figure, according to the environment. However, if the figure missed the the goalkeeper follows the ball position using two possible ac- ball during the kick, the agent moves onto the prevent back-kick tions: follow and misaligned follow. The follow action simply tries action. This final action is meant to prevent an accidental kick to align the figure on the rod with the current ball position. How- in the opposite of the intended direction. The rod is translated ever, if the velocity of the ball exceeds a predefined threshold, sideways and slowly rotated into a neutral position, in order to the agent instead attempts to estimate the ball trajectory based circumvent the ball. While executing this action, role switching on its velocity vector. This estimation is simplified by assuming for the current agent is disabled as well. that the ball maintains a straight-line path. The figure is there- During execution, the agent aligns the rod position with the fore positioned at the intersection of the rod and the estimated ball; however, a perfectly aligned figure results in a straight shot, trajectory in an attempt to intercept a fast-moving ball. which is easily defended by maintaining alignment with the ball. The misaligned follow action is an augmented variant of the A more effective strategy involves kicking at an angle to aim former action, designed to increase the overall defense surface of for the goal or create a rebound off the wall, which is harder the defending agents. A common scenario in table football occurs to defend. This role achieves this by slightly misaligning the when an attacker attempts to bypass the defenses by slightly figure before and during the kick. The agent computes the angle pushing the ball parallel to the rod and striking it immediately between the ball’s current position and the selected target, with after. Even though a human player might react fast enough to the figure’s required misalignment set proportionally to this angle block such an attack, actuator response times are often insuffi- and adjusted by a tunable parameter for fine-tuning. This attack cient. A defense strategy against such attacks is is to misalign the strategy significantly increases the performance of the Attacker goalkeeper and defender figures, increasing the defense surface. role. Here, this is implemented by the misaligned follow action, and is activated whenever the ball is relatively slow, in the possession of the opponent and another agent in front of the Goalkeeper is currently in a Defender role. This decreases the chances of the 57 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Založnik and Šoln 4.3 Role Assignment success rate. Additionally, even though there are no explicit, intentional passes between agents, the strategy of simply passing Individual roles are assigned to agents according to defined as- the ball as far forward as possible is enough for a successful signment conditions and rules. Some approaches use an objective gameplay. function in order to select a role, often taking role priority into The system overall is sensitive to changes in parameters and account [4]. In this approach, we instead define a simple set of requires precise tuning. The simulator, although effective, does conditions which, along with role priority, decide on the most not perfectly simulate the physical table, and additional parame- appropriate role for a particular agent based on the current state ter tuning is required when transitioning from the simulator to of the environment. real-world application. If in a particular instant, more roles fulfill the assignment con- ditions for a particular agent, the role with higher priority is 5 Conclusion selected. In this implementation, the highest priority belongs to the Attacker role, followed by the Goalkeeper, Defender and This paper presented a multi-agent system (MAS) for autonomous finally the Midfielder with the lowest priority. This ordering is table football, developed for the FuzbAI competition. Our role- based on the strictness of assignment conditions for each role, based design allowed each rod to act as an independent agent, and the importance of that particular role. For example, the At- dynamically adapting to the game state. This approach enabled tacker role has the strictest selection conditions among all roles, effective coordination between offense and defense, contributing and therefore is assigned the highest priority, while the Mid- to our system’s first-place win. fielder role has a very broad assignment condition and is not as The results demonstrate the effectiveness of a modular, adap- important compared to an Attack agent. tive architecture in dynamic environments, highlighting the im- We define the role selection conditions as follows. The At- portance of robust decision-making and quick role-switching. tacker role is selected whenever the ball speed drops below a Future work could include machine learning to predict opponent specified threshold, and the ball is within kicking clearance of behavior and optimize strategies, as well as expanding the system the rod. The Goalkeeper role is selected if that particular agent to more complex environments. Overall, our MAS showed strong belongs to the left-most rod, closest to the player’s goal. The performance in a competitive setting, offering valuable insights Defender role is selected whenever the ball is in front of the rod. for future developments in autonomous systems. Lastly, the Midfielder role is selected whenever the ball is behind References the rod, as the role’s only task is to raise the figures to allow the ball to pass forward. [1] Bruce J Biddle. 1986. Recent developments in role theory. Annual review of sociology, 12, 1, 67–92. This set of conditions combined with the defined role priority, [2] G. Cabri, L. Ferrari, and L. Leonardi. 2004. Agent role-based collaboration and allow the agents to switch between roles effectively and covers coordination: a survey about existing approaches. In 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583). Vol. 6. the main functionality required to play the game. Role priority IEEE. isbn: 1062-922X. doi: 10.1109/ICSMC.2004.1401064. ensures that the agent works toward a correct goal based on [3] E.A. Kendall. 1999. Role modelling for agent system analysis, design, and the circumstances. For example, any rod, even the Goalkeeper, implementation. In Proceedings - 1st International Symposium on Agent Systems and Applications and 3rd International Symposium on Mobile Agents, should attempt to kick the ball if it is close and slow enough, ASA/MA 1999. IEEE, 204–218. isbn: 0769503403. doi: 10.1109/ASAMA.1999 while only the left-most rod should attempt to be the goalkeeper. .805405. [4] Gregor Klančar, Marko Lepetič, Boštjan Potočnik, Rihard Karba, and Drago Matko. Cooperative control of mobile agents in soccer game. Faculty of 4.4 Behavior of the System as a Whole Electrical Engineering, University of Ljubljana, Slovenia, (2002). [5] Janosch Moos, Cedric Derstroff, Niklas Schröder, and Debora Clever. 2024. The system’s primary offensive strategy is for the Attacker agents Learning to play foosball: system and baselines. In Cornell University Li- brary, arXiv.org. doi: 10.48550/arxiv.2407.16606. to advance the ball as far forward as possible, ultimately aiming [6] António Fernando Alcântara Ribeiro, Ana Carolina Coelho Lopes, Tiago for the goal, while Midfielder agents ensure that they do not Alcântara Ribeiro, Nino Sancho Sampaio Martins Pereira, Gil Teixeira Lopes, obstruct forward passes. During opponent attacks, the systems and António Fernando Macedo Ribeiro. 2024. Probability-based strategy for a football multi-agent autonomous robot system. Robotics, 13, 1. doi: primary defensive strategy is for the Defender and Goalkeeper 10.3390/robotics13010005. roles to intercept the ball. In certain situations, they collaborate [7] Atom Scott, Keisuke Fujii, and Masaki Onishi. 2022. How does ai play to expand the defense surface, compensating for the limitations football? an analysis of rl and real-world football strategies. In International Conference on Agents and Artificial Intelligence. Vol. 1. Elsevier Scopus, 42–52. posed by actuator response times. Once the opponent’s attack isbn: 2184-3589. doi: 10.5220/0010844300003116. ends, agents detect the change in the environment and the roles [8] Andries Smit, Herman A. Engelbrecht, Willie Brink, and Arnu Pretorius. 2023. Scaling multi-agent reinforcement learning to full 11 versus 11 sim- are reassigned to shift the game towards offensive play. ulated robotic football. Autonomous agents and multi-agent systems, 37, 1. The system’s game strategy can be adjusted by modifying doi: 10.1007/s10458- 023- 09603- y. parameters such as role priority, assignment rules, or individual [9] Yan Song, He Jiang, Zheng Tian, Haifeng Zhang, Yingping Zhang, Jiangcheng Zhu, Zonghong Dai, Weinan Zhang, and Jun Wang. 2024. An empirical study actions. For instance, a more defensive strategy can be achieved on google research football multi-agent scenarios. International journal of by tightening the conditions for assigning the Attacker role. automation and computing, 21, 3, 549–570. doi: 10.1007/s11633-023-1426-8. Overall, the implemented algorithm performs well, with the [10] Manuela Veloso, Peter Stone, and Kwun Han. 1998. The cmunited-97 robotic soccer team: perception and multiagent control. In Proceedings of the second combination of discrete roles resulting in a competent gameplay. international conference on Autonomous agents, 78–85. However, delays and noise present in measurements, and delays [11] Laboratorij za avtomatiko in kibernetiko. 2024. Dnevi avtomatike. Accessed: 2024-08-21. (2024). https://dnevi- avtomatike.si/?page_id=270. due to actuator response times, sometimes cause the system to [12] Franco Zambonelli, Nicholas R. Jennings, and Michael Wooldridge. 2003. miss, e.g. during attacks. The prevent back-kick action of the Developing multiagent systems: the gaia methodology. ACM transactions on Attacker role proves essential in such situations, performing software engineering and methodology, 12, 3, 317–370. ObjectType-Article-2. doi: 10.1145/958961.958963. careful repositioning. Another surprisingly successful strategy is [13] Xiaoqin Zhang, Haiping Xu, and Bhavesh Shrestha. 2007. An integrated aiming at the goal or the wall during the attack action. Even if the role-based approach for modeling, designing and implementing multi-agent ball does not follow the intended trajectory due to measurement systems. Journal of the Brazilian Computer Society, 13, 2, 45–60. doi: 10.100 7/bf 03192409. noise and system delays, it still considerably increases the attack 58 Towards a Decision Support System for Project Planning: Multi-Criteria Evaluation of Past Projects Success Miha Hafner Marko Bohanec Elea iC d.o.o. Jožef Stefan Institute Department for Tunnels and Geotechnics, and Department of Knowledge Technologies Jožef Stefan International Postgraduate School Ljubljana, Slovenia Ljubljana, Slovenia marko.bohanec@ijs.si miha.hafner@elea.si Abstract • Objectives that support project goals include concrete and measurable project characteristics such as deliverables, Project planning typically refers to the project management step milestones, and other steps and strategies to achieve the in which project assets, timelines, budgets, milestones, goals [7]. subcontractors, etc., are determined before the new project starts. • Scope and requirements concerning project boundaries, In this paper, we address infrastructure design projects in the e.g., the need for experts, potential subcontractors, technical context of a specific company (Elea iC) and explore the idea of equipment and other requirements to finish the project. using data about past-finished projects to help project managers • and project leaders in project planning. A crucial requirement in Constraints and limitations concerning project deadlines, costs, etc. [8]. this context is the ability to evaluate/assess the success of Besides that, each project should finish with the client’s and finished/new projects. This paper proposes a solution using a stakeholders’ satisfaction [8]. multi-criteria model to evaluate finished projects. This way, we To achieve the above for the new project, project planning is add project success information to the finished projects database, vital at the beginning of each new project [6], [8]. It is the project which we shall use in the decision support system being designed management and project leaders' task to recognize and include to extract knowledge for the new project plan. all these in the project plan so that the work and processes lead Keywords to successful project completion. This study aims to support this process in the context of Project success evaluation, multi-criteria model, decision support Elea iC company, an interdisciplinary provider of engineering systems, data analysis, data mining, project management, project services and projects in Slovenia [5]. We wanted to include the leading tools. knowledge obtained from past–finished projects in the project planning process for the new projects. The company collected this data from 2001. The assumptions are as follows: 1 Introduction 1. The finished projects in the database offer valuable Infrastructure, such as tunnels, bridges, schools, houses, sewage information for the new project planning phase. systems, roads, etc., and its design discipline play a vital role in 2. The project workflows established in the company and society. Thus, infrastructure design must have properly and requirements remained similar over the years. thoroughly defined requirements, objectives, scope and The main challenge related to this question is the new project constraints concerning many expert fields such as civil success assessment and its consideration in light of the available engineering, architecture, geology, geotechnics, environmental finished project data [7]. Unfortunately, the actual finished engineering, urban planning, and other expert fields [1], [2], [3]. projects database does not contain much information about the The term design is connected to the process that ends with finished projects' success. To bridge this, we had to construct a technical documentation, technical approvals, models, and other project success evaluation model, evaluate finished projects in deliverables prepared at the end of the design process. Each such the database and add this information to the database. The process is referred to as the project [4]. The projects are expected expected result of those steps is a database suitable for applying to have clearly defined: data-analysis and knowledge extraction methods, such as • Goals defining the project's desired result, e.g., a building hierarchical clustering and machine learning [20]. permit for a bridge, static analysis of a retaining wall, This paper describes the finished project success evaluation architectural design for a subway station, geotechnical component (hereafter called FPSE), which is part of the future exploration for a tunnel, etc. [4]. decision support system (DSS, [12], [13], [14]) for project planning (hereafter called E-DSS). First, we present the general Permission to make digital or hard copies of part or all of this work for personal or architecture of the E-DSS, explaining the role and integration of classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full FPSE in its context. In section 3, we present the database of citation on the first page. Copyrights for third-party components of this work must finished projects and its preparation for supporting the be honored. For all other uses, contact the owner/author(s). configuration of new projects. The evaluation model used and Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia © 2024 Copyright held by the owner/author(s). the experimental evaluation of FPSE are presented in sections 4 https://doi.org/10.70314/is.2024.scai.8463 and 5, respectively. Section 6 concludes the paper. 59 2 E-DSS Architecture Figure 1 also shows the element E-DSS administrator used to upgrade FPSE periodically by upgrading the database of the E-DSS is a DSS under construction to support the project finished projects or making changes in EM according to the management and project leaders in the Elea iC company system's operational requirements and expected results. (hereafter called “the user”) in configuring the new project plan parameters when a new project starts. The user is expected to define the E-DSS input as shown in Figure 1: the new project objectives, requirements, desires, and expectations. Practically, this means that the user collects all the available new project data by: • Extracting the new project data from the new project assignment and contracts containing relevant information for the project planning. • Checking the company and potential subcontractors' state of the resources and assets needed to complete the new project. Examples of those data include projected monetary value, project scope and goals, project start and finish date, the expert fields needed for project completion, etc. The E-DSS output (Figure 1) consists of the new project plan Figure 2: FPD+S development workflow configuration together with the corresponding success scores This paper focuses on the development of FPSE. The (+S). The configuration comprises the data such as the number workflow is shown in Figure 2, consisting of the following steps: of employees involved, the number of subcontractors, work Step A. Finished projects database preparation (FPD), as distribution, work duration, the number of pauses, etc. Project described in section 3. success scores are assessed assuming this configuration settings. Step B. Project success evaluation model (EM), as described in section 4. Step C. Finished projects database with EM success scores (FPD+S): The result of the FPSE is the upgraded database of the finished projects with the finished projects' success scores (FPD+S). 3 Data Description E-DSS is a data-driven system that operates on data from past- finished projects. This data was collected in Elea iC company from the year 2001 to 2023. At the beginning of the data collection, the number of the observed variables was relatively small, but it has grown substantially over the years. At the time Figure 1: E-DSS architecture of this study, the database contained data on 4704 finished Accordingly, E-DSS is composed of the following projects, described by 39 numeric variables; 6 of them were components (Figure 1): date/time/year variables, and 2 categorical variables. • NPPE (New Project Parameters Extraction) is the Data preparation (Step A, Figure 2) was carried out as component that extracts the potential new project follows: configuration parameters and corresponding data to support 1. Data cleaning: replacing “Nan” and deleting erroneous data; the decision-making. NPPE is currently under construction 2. Outlier’s removal using the Interquartile range approach and is aimed to operate interactively with the user and [18]; support: searching for similar projects in FPD+S according 3. Data imputation: replacing the missing values using a to a predefined range, searching the projects by desired descriptive statistic (e.g. mean, median, or most frequent) success score, project segmentation, and project group along each column or using a constant value [19]. We identification—unsupervised descriptive analytics and employed the mean strategy. parameter prediction by supervised machine learning 4. Sensitive data and information removal. For this reason, all methods. The component NPPE+S inside NPPE evaluates numeric data was scaled to a range between 0 and 1. the success of the potential new project's configuration We ended up with the database FPD containing data on 3132 parameters obtained. The evaluation is made by EM, which finished projects described by 27 numeric variables. The is part of FPSE. variables describe the main project management characteristics, • FPSE (Finished Project Success Evaluation) consists of: such as project financial results, workload distribution, number o EM (Project success Evaluation Model) for evaluating of employees, subcontractors, etc. Table 1 shows the list of all the new project configuration (described in section 4). variables together with their basic statistics. o FPD+S is the database of finished projects with project This way, the finished project database (FPD) was prepared success evaluations (section 3). for the FPSE component. FPD is the main resource for Exploratory Data Analysis for observing the data and its 60 properties, such as variable correlations, variable information • Pauses Time Share: the ratio between the months the gain, etc. These operations are invoked interactively by the user employees did not work and the total number of months. in the context of NPPE and are not discussed further in this paper. • Hour Income: the ratio between project value and the number of work hours necessary to finish the project. Table 1: Basic statistics of the variables after data cleaning, outliers’ removal, and data scaling Figure 3: Multi-criteria model for the projects’ success evaluation Evaluation parameters represent outputs of the model : • WORK DISTRIBUTION: assesses the characteristics of the work distribution in the project duration. • PROJECT PAUSES: assesses the work pauses. • PROJECT WORKFLOW: combines evaluation parameters WORK DISTRIBUTION and PROJECT PAUSES • PROJECT FINANCIAL RESULT: assesses the project's success from the financial point of view. • PROJECT SUCCESS SCORE: overall success score, determined by aggregation of all subordinate parameters. Aggregation functions map subordinate EM parameters to the 4 Evaluation of Projects’ Success corresponding parent parameters. Employed is the weighted The project success evaluation model (EM), developed in Step B average function, using weights shown in Figure 3. Currently, (Figure 2), is aimed at: weights are chosen to make all parameters equally important. • The evaluation of the projects in FPD resulting in the FPD+S (Figure 2). • The evaluation of potential new projects suggested through 5 Experimental Evaluation of FPSE interaction between the user and NPPE+S (Figure 1). Figure 4 shows an example of evaluating a project from FPD. Project success evaluation involves multiple criteria that have Input parameters’ values (terminal nodes) were obtained from to be aggregated into a single evaluation score. Different criteria the data base, while evaluation parameters’ values (green nodes) might be of different importance and affect the score differently, were calculated by EM. The example project shows good i.e., with different weights. For this purpose, we chose MAUT workflow score (0.75), but has a poor financial score (0.29), both (Multi-Attribute Utility Theory) [11], a multi-criteria modelling leading to an average success score (0.52) of the project. Several approach that facilitates both hierarchical structuring of criteria other projects of different types were evaluated in this way, and using weights for the aggregation of scores. confirming the appropriateness of EM structure and Considering the above requirements, available FPD data and conformance with requirements of potential users. In this way, other multi-criteria approaches to project evaluation ([15], [16], the quality of EM was assessed on a sample of past projects. [17]), we developed the EM as presented in Figure 3. Further assessment is planned in the next stages while EM consists of three components [10]: input parameters, configuring new projects, where EM’s results can be confronted evaluation parameters and aggregation functions. with opinions of project leaders actively involved in the process. EM already enables evaluation of multiple finished projects. Input parameters are variables in the leaf nodes of the model: In Step C (Figure 2), FPD was extended by adding five variables • Project Work Concentration: explains the distribution of the corresponding the five Evaluation parameters of EM. All work on the project. If the value is closer to 0 or 1, the projects in FPD were evaluated by EM, resulting in FPD+S. majority of the work is done at the beginning or at the end Basic statistics of FPD+S is presented by the distribution of of the project, respectively. the variables in Figure 5. The variables marked with red colour • Time Reserve: explains if the project work ended earlier on the x-axis are E-DSS input parameters, the green uppercase than defined in the contract. variables are those corresponding to success scores, and the blue • Number of Pauses: the number of times the work on the variables are potential new project parameters. The distribution project stopped. of the final project evaluation, PROJECT_SUCCESS_SCORE, 61 (average = 0.52, min = 0.15, max = 0.94) indicates that it well and created decision trees for prediction of individual output covers the range of possible outcomes and enables the parameters that may lead to high new project success scores. discrimination and sorting of projects. Future work will primarily continue by further data analysis and data mining of FPD+S, attempting to design effective algorithms for interactive exploration of past projects and suggesting as good as possible configurations of new projects. On this basis, we shall make a detailed functional specification of the NPPE+S component and design/implement the E-DSS. Despite that E-DSS considered here is tailor-made for the specific business environment and is bound to the specific data base, the approach seems general enough to be applied to similar environments, projects and processes [9]. This work is a showcase of substantial efforts needed to prepare a corporate database for decision-support, which is often neglected in the literature. The main contribution is a combination of data Figure 4: Example of evaluating a project using EM processing with MAUT-based multi-criteria decision modelling. References [1] Fransje L. Hooimeijer, Jeremy D. Bricker, Adam J. Pel, A. D. Brand, Frans H.M. Van de Ven, Amin Askarinejad. 2022. Multi-and interdisciplinary design of urban infrastructure development. In Proceedings of the Institution of Civil Engineers: Urban Design and Planning. Vol.175. TU Delft. 153-168. [2] Simon Christian Becker, Philip Sander. 2023. Development of a Project Objective and Requirement System (PORS) for major infrastructure projects to align the interests of all the stakeholders. In Expanding Underground - Knowledge and Passion to Make a Positive Impact on the World. CRC Press, London, UK, 3369-3376. DOI:10.1201/9781003348030-408. [3] Michel-Alexandre Cardin, Ana Mijic, Jennifer Whyte. 2023. Data-driven infrastructure systems design for uncertainty, sustainability, and resilience. In D. M. Fabio Biondini, Life-Cycle of Structures and Infrastructure Systems. CRC Press, London, UK, 2565 – 2572. DOI: 10.1201/9781003323020-312. [4] Saša Žagar. 2016. Organizacijski model v projektivnem podjetju Elea iC d.o.o.. Maribor, B.Sc. Thesis, Retrieved July 12, 2024 from https://dk.um.si/IzpisGradiva.php?id=58799&lang=eng. [5] Elea iC webpage. https://www.elea.si/en/. [6] Jürg Kuster, Eugen Huber, Robert Lippmann, Alphons Schmid, Emil Schneider, Urs Witschi, Roger Wüst. 2015. Project Management Handbook. Springer-Verlag, Berlin Heidelberg, Germany. Figure 5: Distribution of the FPD+S features, including EM [7] Anton Hauc. 2007. Projektni management. (2nd. ed.). GV Založba, project success assessments Ljubljana, Slovenija. [8] Harvey A. Levine. 2002. Practical Project Management: Tips, Tactics, and Tools. John Wiley & Sons, Inc., New York, NY. [9] Nadja Damij, Talib Damij. 2014. Process management. Springer-Verlag, 6 Conclusions Berlin Heidelberg, Germany. E-DSS is a DSS under construction, aimed at supporting the [10] Marko Bohanec. 2012. Odločanje in modeli. DMFA – založništvo, Ljubljana, Slovenija. project management and project leaders' process in the new [11] Salvatore Greco, Matthias Ehrgott, José Rui Figueira. 2016. Multiple Criteria infrastructure project planning phase. We presented the design Decision Analysis, State of the Art Surveys. Springer, Portsmouth, UK. DOI 10.1007/978-1-4939-3094-4 and development of the FPSE component, consisting of a multi- [12] Maria Rashidi, Maryam Ghodrat, Bijan Samali and Masoud Mohammadi. criteria project success evaluation model EM and a data base of 2018. Decision Support Systems. In Management of Information Systems. projects, extended with success evaluation scores FPD+S. IntechOpen, London, UK, 19-38. DOI: 10.5772/intechopen.79390. [13] Daniel Joseph Power. 2013. Decision Support, Analytics, and Business EM has been developed using the MAUT approach and has Intelligence. Business Expert Press, New York, NY. DOI turned out to be “fit-for-purpose”. It employs the data that is 10.4128/9781606496190. [14] Sofiat Abioye, Lukumon Oyedele, Lukman Akanbi, Anuoluwapo Ajayi, available in the projects’ database. It meaningfully describes Juan Manuel Davila Delgado, Muhammad Bilal, Olugbenga Akinade, Ashraf aspects of the project's success and offers practical and functional Ahmed. 2021. Artificial intelligence in the construction industry: A review of present status, opportunities and future challenges. Journal of Building model for the evaluation of multiple projects in the database. Engineering 44, Elsevier. https://doi.org/10.1016/j.jobe.2021.103299. FPSE is a key decision-support resource for E-DSS. E-DSS [15] Erwin Berghuis. 2018. Measuring Systems Engineering and Project Success, will allow the user to interactively search for similar past Master’s Thesis. University of Twente. https://purl.utwente.nl/essays/75088 [16] Ali Beiki Ashkezari, Mahsa Zokaee, Amir Aghsami, Fariborz Jolai, Maziar projects, to filter them according to the success score and Yazdani. 2022. Selecting an Appropriate Configuration in a Construction simulate the effects of alternative project configurations, Project Using a Hybrid Multiple Attribute Decision Making and Failure Analysis Methods. Buildings, MDPI, Volume 12, 643. DOI: ultimately proposing the best one. Approaches based on https://doi.org/10.3390/buildings9050112. unsupervised descriptive analytics (clustering) and supervised [17] Urban Pinter, Igor Pšunder. 2013. Evaluating construction project success machine learning methods for prediction of E-DSS output with use of the M-TOPSIS method. Journal of civil engineering and management, Volume 19(1), 16–23. doi:10.3846/13923730.2012.734849 parameters are foreseen for this purpose. Actually, we have [18] Interquartile range. Retrieved May 15, 2024 from already tested hierarchical clustering and decision tree https://en.wikipedia.org/wiki/Interquartile_range [19] SimpleImputer. Retrieved May 15, 2024 from https://scikit- classification methods on FPD+S, and first results are learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html encouraging. We obtained meaningful clusters of past projects [20] Aggarwal C. Aggarwal. 2015. Data Mining: The Textbook. Springer, New York, USA. 62 Minimizing Costs and Risks in Demand Response Optimization: Insights from Initial Experiments Mila Nedić Tea Tušar Faculty of Mathematics and Physics Jožef Stefan Institute and University of Ljubljana Jožef Stefan International Postgraduate School Ljubljana, Slovenia Ljubljana, Slovenia mn38120@student.uni- lj.si tea.tusar@ijs.si Abstract techniques. However, baselines can be exploited, e.g., when con- sumers artificially increase consumption before an event to inflate This paper presents a method for changing the energy use of their baseline and maximize the awarded rebate. consumers participating in Demand Response (DR) programs, 1 Through the SEEDS project , we are developing a method- focusing on peak balancing to improve grid stability. Multiple ob- ology for providing energy flexibility services to prosumers – jectives including costs and risks are considered, and a weighted participants in energy markets capable of both producing and sum is used to transform them into a single objective. This re- consuming energy – in order to enhance grid stability. Machine sults in an optimization problem that can be optimally solved. learning is used to predict the baseline energy usage of prosumers To calculate the costs, the load consumption baseline needs to and their flexibility, while mixed-integer linear programming be established. Since this is challenging and can be exploited, (MILP) is used to optimize the operation of prosumers within we conduct initial experiments to test whether our method to their flexibility. Our approach will be tested in the Slovenian pilot, adjust the baseline can be easily manipulated. We explore an in collaboration with Petrol d.d. and Elektro Celje d.d. original scenario and three of its variants to examine the effects Our work integrates prosumer flexibility into DR optimization, of various parameters on the optimization outcome. Our results focusing on minimizing costs and risks while limiting energy indicate that 1) an excessive emphasis on risk results in no energy fluctuations. While the goal is to eventually use this approach on change, 2) enforcing a net zero energy change minimizes energy real-world data from the pilot, this paper reports on some initial use while still securing the rebate, and 3) without an adjustment experiments verifying whether the current problem formulation period, the consumer is less inclined to increase the load just be- results in solutions with desired properties. In particular, we wish fore the demand period. In future work, we will reformulate some to test if our adjusted consumer baseline approach can be easily objectives to avoid exploitation and better reflect the real-world exploited. needs of DR. Research on prosumer flexibility, optimization techniques, Keywords and demand response optimization includes a wide range of approaches [8]. In [3], Balázs et al. quantify residential prosumer multiobjective optimization, mixed-integer linear programming, flexibility using engineering models and real-world data. Their demand response, baseline consumption, electrical grid work provides valuable insight into prosumer behavior and en- ergy management. Capone et al. [4] optimize district energy 1 Introduction systems by balancing costs and carbon emissions with genetic al- Peaks in energy demand can strain the electrical grid, leading to gorithms and linear programming, showing significant emission inefficiencies and potential failures. A widely used strategy for reductions at a modest cost increase. Magalhães and Antunes [7] balancing these peaks is Demand Response (DR), in which the compare thermal load models in demand response strategies Distribution System Operator (DSO) forecasts future peaks and using MILP, finding that discrete control formulations improve requests from consumers to adjust their energy use to reduce computational efficiency. Thus, our methodology is in line with them. In the peak time rebate DR program [2], consumers receive related work while the actual optimization problem (its varia rebate if they reduce their load in the demand period. On the ables, objectives and constraints) differs from existing ones as it other hand, if they commit to respond to the demand, but fail to is adapted to our specific use case. do so, they can be penalized. It is therefore of utmost importance This paper is further organized as follows. In Section 2, we to accurately assess whether and how much a consumer reduced provide a brief overview of the optimization problem, followed their load to meet the demand. by its detailed definition in terms of its variables, constraints and The load reduction of a consumer is computed as the differ- objectives. The optimization approach is explained in Section 3, ence between its baseline (the amount of energy the customer where we discuss the scalarization technique used to transform would have consumed without a demand request) and its actual our multi-objective problem into a single-objective MILP form use [2]. The importance of establishing a baseline and the various and the method used to solve it. The experiments and their results ways of calculating it are presented in [5]. Common methods are given in Section 4. Finally, conclusions and further work ideas for calculating baselines include simple historical data averages, are described in Section 5. exponential moving averages and short-term load forecasting 2 Optimization Problem Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or The problem formulation in this work assumes a peak time rebate distributed for profit or commercial advantage and that copies bear this notice and DR program in which the DSO and the consumer have a contract the full citation on the first page. Copyrights for third-party components of this stipulating the following conditions: 1) the consumer can chose work must be honored. For all other uses, contact the owner /author(s). Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia whether to respond to a demand request, 2) if the consumer © 2024 Copyright held by the owner/author(s). 1 https://doi.org/10.70314/is.2024.scai.8587 https://project- seeds.eu/ 63 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Mila Nedić and Tea Tušar A F participates in DR, it receives a rebate proportional to the reduced 𝐸 instead of the forecast one 𝐸 , where the adjustment is deter- 𝑡 𝑡 load, 3) if the consumer participates in DR but does not reduce the A mined by the energy amounts in the adjustment period – the 𝑛 load by at least 75 % of the required amount, it is penalized, 4) the S intervals before the start of the demand period 𝑡 . More formally, load reduction is estimated using an adjusted consumer baseline, the adjusted timetable is computed as which takes into account the forecast consumer energy usage as S 𝑡 −1 well as its actual consumption before the demand period.   1 ∑︁  F F A 𝐸 − 𝐸 − 𝐸 , if 𝑛 > 0; The optimization task is to set the energy consumption of 𝑗 A   𝑡 A 𝑗 𝐸 = 𝑛 𝑡 S A all loads of a consumer participating in DR taking into account 𝑗 =𝑡 −𝑛    F their flexibility so that consumer costs, risks and energy fluctua- 𝐸 , otherwise 𝑡  tions are minimized. This ensures efficient grid operation while S E for all intervals 𝑡 ∈ {𝑡 , . . . , 𝑡 } in the demand period. Then, maintaining economic feasibility for the consumer. R the recognized load reduction 𝐸 at demand time interval 𝑡 ∈ To formally define our optimization problem, we first intro- 𝑡 S E duce its variables, followed by the constraints and the objective {𝑡 , . . . , 𝑡 } is determined as functions we aim to optimize. Finally, we provide an overview R A 𝐸 = 𝐸 − 𝐸 , 𝑡 of the weighted sum approach, which serves as the scalarization 𝑡 𝑡 technique to transform all objective values into a single one. R while the total recognized load reduction 𝐸 is computed as 2.1 Variables E 𝑡 ∑︁ R R 𝐸 = 𝐸 . A solution is specified by the energy amounts 𝐸 ∈ 𝑐,𝑖 R for each 𝑡 S consumer load 𝑐 ∈ C and time interval 𝑖 ∈ {1, . . . , 𝑛}. They 𝑡 =𝑡 correspond to the change of consumption from the forecast one. R A rebate is awarded if 𝐸 is negative (the consumption has These are the only variables of this optimization problem. been reduced). If the total recognized load reduction exceeds the From these energy amounts and the forecast timetable of en- T total demanded energy reduction 𝐸 , the rebate is capped, i.e., ergy usage, the resulting energy consumption 𝐸 in time interval 𝑖 𝑖 ∈ {1, . . . , 𝑛 } is computed as ( B R T R 𝑝 min 𝐸 , 𝐸 , if 𝐸 < 0 R ∑︁ 𝑓 = . F 𝐸 = 𝐸 + 𝐸 . 𝑖 𝑖 𝑐,𝑖 0, otherwise 𝑐 ∈ C Finally, a penalty is added to the total costs if the demand 2.2 Constraints has not been met, that is, the ratio between the recognized and D The energy amounts of a solution need to adhere to two kinds of demanded energy reduction, 𝐸 , in any of the demand time S E constraints. The first type are the interval energy constraints: intervals 𝑡 ∈ {𝑡 , . . . , 𝑡 } is lower than 75 %, min max 𝐸 ≤ 𝐸 ≤ 𝐸 , R 𝐸 𝑐,𝑖 𝑐,𝑖 𝑐,𝑖   P | T | 𝑡 S E } P  𝑝 𝐸 , if < 75 % for one or more 𝑡 ∈ {𝑡 , . . . , 𝑡 𝑓 = D . for each consumer load 𝑐 ∈ C and time interval 𝑖 ∈ {1, . . . , 𝑛}. 𝐸  0, otherwise The second are the total energy constraints:  𝑛 The second optimization objective represents risks. In order ∑︁ 𝑓2 𝑇 ,min 𝑇 ,max 𝐸 ≤ 𝐸 ≤ 𝐸 , 𝑐 𝑐,𝑖 𝑐 to penalize any changes to the timetable when the risks are high, 𝑖 =1 the objective function is defined as for each consumer load 𝑐 ∈ C. 𝑛 ∑︁ ∑︁ 𝑓 = 𝑟 2.3 Objective Functions 2 𝑖 𝐸𝑐,𝑖 , 𝑖 =1 𝑐 ∈ C The three objectives to be minimized in this scenario are the where 𝑟 represents the risk at time interval 𝑖 . costs, risks and energy fluctuations. 𝑖 To penalize unnecessary energy fluctuations, the third objec- The first optimization objective 𝑓 consists of all costs associ- 1 tive 𝑓 averages the consecutive changes in energy amounts for ated with the solution and equals 3 all consumer loads, i.e., E R P 𝑓 = − + 1 𝑓 𝑓 𝑓 , 𝑛 1 ∑︁ ∑︁ E R where 𝑓 represents the energy costs, 𝑓 is the rebate for the 𝑓 = − 3 𝐸 𝐸 𝑐,𝑖 𝑐,𝑖 −1 . (𝑛 − 1) |C | P recognized load reduction and 𝑓 is the penalty that is charged 𝑖 =2 𝑐 ∈ C in case the recognized load reduction does not meet the require- 2.4 Weighted Sum Approach ments. E The energy costs 𝑓 equal the sum of energy costs over all Since the optimal solutions to this problem appear to reside in time intervals 𝑖 ∈ {1, . . . , 𝑛}, the convex region of the objective space, we use a weighted sum approach to transform all objective values into a single one. The 𝑛 ∑︁ E 𝑓 = 𝑝 𝐸 , single objective function to be minimized thus equals 𝑖 𝑖 𝑖 =1 𝑓 = 𝑤 + + 1 𝑓1 𝑤 2 𝑓2 𝑤 3 𝑓3 where 𝑝 is the interval energy price. 𝑖 The solution gains a rebate it the load is reduced in the demand under the condition 𝑤 + = 1. The weight can be set 1 𝑤 2 𝑤 3 S E R period {𝑡 , . . . , 𝑡 }. Note that the recognized load reduction 𝐸 , independently of 𝑤 and 𝑤 and serves as a measure of limiting 𝑡 1 2 S E 𝑡 ∈ {𝑡 , . . . , 𝑡 }, is computed from the adjusted timetable energy the energy fluctuations. 64 Minimizing Costs and Risks in Demand Response Optimization Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia 3 Optimization Approach The three scenario variants differ from the basic as follows. 3.1 Setting Weights in the Weighted Sum The first scenario variant has no demand. In the second and third scenario variant, the total energy change is set to 0 kWh To obtain diverse solutions with the weighted sum approach, a ensuring the reduction in energy consumption in some intervals good strategy for setting the weights is needed. While we plan to is matched with its increase in others. Additionally, the third use a more sophisticated approach for this purpose in future work, scenario variant has no adjustment period, i.e, 𝑛 = 0. A these initial experiments were made by choosing equidistant values of 𝑤 from the interval [0 as 1 − . 1 , 1] and defining 𝑤2 𝑤 1 4.2 Results and Discussion −3 In order to limit energy fluctuations, we set 𝑤 to 10 . Smaller 3 We discuss here the results of our original scenario and its three weights proved insufficient in limiting the fluctuations while variants. They are depicted also in plots in Figures 1 to 4, which larger weights interfered with the first two objectives, which are show with a black line how the consumer load changes from its more important than the third. planned timetable. Consumer load flexibility at each time interval 3.2 Linearization is shown in gray (there is no flexibility in the first four and last four intervals). The demand period is denoted in red and the Since all of the objective functions specified in Section 2.3 are adjustment period in blue. In most cases (unless the risk has either non-linear or contain non-linear parts, specific techniques a large weight), the consumer reduces the load in the demand are required to linearize these objectives and ensure the problem period enough to meet the required demand and earn the entire fits the MILP form. In particular, it is necessary to linearize the ab- available rebate while not incurring any penalty. The amount of solute value of a real variable, the product of a binary variable and this reduction and the energy change outside of this period differ a real variable, the minimum of two variables, along with other for the various scenario variants. non-linear function conditions. We use standard approaches to achieve linearization for all these cases [9]. 4.2.1 Original Scenario. When the risk has a large weight, the load does not change outside of the demand period (see the top 3.3 Tool and Solver plot in Figure 1). However, when the impact of risk is minimal 2 (bottom plot in Figure 1), the load is reduced everywhere except We use the OR-Tools Python library to implement and solve during the adjustment period. This strategy artificially increases the single-objective MILP problem. The library is a comprehen- the perceived load reduction to maximize the rebate, as dictated sive tool for solving optimization problems, including linear pro- by the rebate calculation formula. gramming, integer programming, and combinatorial optimiza- tion. Specifically, we use the SCIP (Solving Constraint Integer 4.2.2 Scenario Variant #1: No Demand. If the optimization is 3 Programs) solver [1] integrated within OR-Tools for solving called without a demand, the result depends on the weighting of MILP problem instances. the first two objectives. As long as the impact of risk is significant To solve a MILP problem using OR-Tools and the integrated (top plot in Figure 2), the load does not change. Otherwise, the SCIP solver, the following steps are performed: import the linear load is reduced to the maximum extent in each interval (bottom solver wrapper, declare the SCIP solver, define the variables with plot in Figure 2). This approach minimizes the function 𝑓 , there- 𝐸 their respective bounds, set the constraints and the objective fore reducing costs. This means that the consumer behavior can function and lastly, analyze and display the solution. change when optimized even if no demand is present. 4 Experiments 4.2.3 Scenario Variant #2: Zero Total Energy Change. Due to the zero energy constraint, the consumer makes adjustments solely We first conduct experiments using a basic scenario with a single within the demand and adjustment periods (see Figure 3). During consumer load. Then, we variate some parameters of this scenario the adjustment period, the user offsets the consumption from the to see how they affect the resulting solutions. demand period, thereby achieving a maximal rebate. To adhere 4.1 Experimental Setup to the requirement of minimizing risks and fluctuations in other intervals, no additional changes are made, as such actions would The basic scenario has the following parameters: increase the objective value. • Time is represented as 28 15-minute intervals. • The demand period starts at 𝑖 = 13 and ends at 𝑖 = 16. 4.2.4 Scenario Variant #3: Zero Total Energy Change and No Ad- • T The total required reduction 𝐸 equals −8 kWh and the justment Period. When the baseline is not adjusted, the load is D required reduction 𝐸 at each interval equals −2 kWh. increased in intervals outside of the demand period, regardless • The adjustment period has a duration of four intervals. whether they occur before or after it. The specific intervals when • The load change needs to be within [−3 kWh, 3 kWh] for this happens depend on the solver and are random as they lead each interval 𝑖 = 5, 6, . . . , 24 and is fixed to 0 kWh for the to the same objective function value. An example of such a case remaining intervals. in depicted in Figure 4. • F The forecast timetable energy 𝐸 is constant and equals The last two variants additionally confirm that the usage of 𝑖 12 kWh for all time intervals. the adjustment period enables exploitation – the entire rebate • The total energy constraint is unbounded. can be gained with a smaller load reduction in the demand period • The risk equals 0.50 for all time intervals. if the load is increased in the adjustment period. • R All prices are constant: 𝑝 = 0.25 EUR, 𝑝 = 0.50 EUR and 𝑖 P 5 Conclusions 𝑝 = 1.00 EUR. This paper focuses on demand response optimization and the 2 https://developers.google.com/optimization 3 growing role of prosumers in energy systems. A standard MILP https://github.com/google/or- tools/blob/stable/ortools/linear_solver/samples/mi p_var_array.py framework is used to set the consumer load energies within 65 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Mila Nedić and Tea Tušar ] for generating a set of diverse solutions representing various Adjustment Demand Wh period period Load flexibility [k trade-offs between costs and risks. 3 2 1 By creating three scenario variants, we were able to explore 0 change −1 the effect of some parameters on the optimization outcome. We −2 −3 observe that: Energy 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 • Regardless of the variant, the optimal load schedule does Time intervals ] not deviate from the forecast one if the importance of risk Adjustment Demand Wh is too high, i.e., if the weight 𝑤 is too large. This critical period period Load flexibility [k 2 3 2 value of 𝑤 depends on the scenario variant. 2 1 0 • If the consumer is obliged to a zero sum in load increase change −1 −2 and reduction, the optimal solution uses the minimal nec- −3 Energy 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 essary resources to earn a rebate while avoiding excessive Time intervals energy changes. • When the adjustment period is unspecified, the prosumer Figure 1: Results for the original scenario with 𝑤 = 0 1 .6 is less likely to increase the load just before the demand and 𝑤 = 0 = 0 = 0 2 .4 (top) and 𝑤1 .8 and 𝑤2 .2 (bottom). period. Moving forward, we need to refine the objectives. The cur- ] rent method to assess the baseline consumption is susceptible Wh to exploitation and should be amended. We could calculate the Load flexibility [k 3 2 consumer baseline from similar consumers that do not partici- 1 0 change pate in DR as suggested in [6]. We will also need to revise the −1 −2 − penalty calculation to account for the imminent change of tariffs 3 Energy 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 in the Slovenian energy market. We additionally plan to improve Time intervals the calculation of risks to ensure more robust optimization and ] real-world applicability. Finally, we intend to develop a better Wh Load flexibility [k strategy for setting the weights, targeting values with the most 3 2 1 significant impact rather than evenly distributing them. 0 change −1 −2 −3 Acknowledgements Energy 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 The SEEDS project is co-funded by the European Union’s Horizon Time intervals Europe innovation actions programme under the Grant Agree- Figure 2: Results for the variant without demand with 𝑤 = ment n°101138211. The authors acknowledge the financial sup- 1 0.5 and 𝑤 = 0 = 0 = 0 port from the Slovenian Research and Innovation Agency (re- 2 .5 (top) and 𝑤1 .7 and 𝑤2 .3 (bottom). search core funding No. P2-0209). The authors wish to thank Bernard Ženko, Martin Žnidaršič and Aljaž Osojnik for helpful ] discussions when shaping this work. Adjustment Demand Wh period period Load flexibility [k 3 2 1 References 0 change −1 [1] Tobias Achterberg. 2009. SCIP: Solving constraint integer programs. Mathe- −2 −3 matical Programming Computation, 1, 1–41. doi: 10.1007/s12532-008-0001-1. Energy 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 [2] AEIC Load Research Committee. 2009. Demand response measurement & verification: Applications for load research. Tech. rep. AEIC Load Research Time intervals Committee. [3] István Balázs, Attila Fodor, and Attila Magyar. 2021. Quantification of the Figure 3: Results for the variant with zero total energy flexibility of residential prosumers. Energies, 14, 4860. doi: 10.3390/en141648 change with 60. 𝑤 = 0 = 0 1 .6 and 𝑤2 .4. [4] Martina Capone, Elisa Guelpa, and Verda Vittorio. 2021. Multi-objective optimization of district energy systems with demand response. Energy, 227, 120472. doi: 10.1016/j.energy.2021.120472. ] [5] Antonio Gabaldón, Ana García-Garre, María Carmen Ruiz-Abellón, Antonio Demand Wh period Load flexibility [k Guillamón, Carlos Álvarez-Bel, and Luis Alfredo Fernandez-Jimenez. 2021. 3 Improvement of customer baselines for the evaluation of demand response 2 1 through the use of physically-based load models. Utilities Policy, 70, 101213. 0 change −1 doi: 10.1016/j.jup.2021.101213. −2 − [6] Joe Glass, Stephen Suffian, Adam Scheer, and Carmen Best. 2022. Demand 3 response advanced measurement methodology: Analysis of open-source Energy 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 baseline and comparison group methods to enable CAISO demand response Time intervals resource performance evaluation. Tech. rep. California Independent System Operator (CAISO). Figure 4: Results for the variant with zero total energy [7] Pedro L. Magalhães and Carlos Henggeler Antunes. 2020. Comparison of ther- mal load models for MILP-based demand response planning. In Sustainable change and no adjustment period with 𝑤 = 0 = 1 .6 and 𝑤2 Energy for Smart Cities. Springer International Publishing, Cham, 110–124. 0.4. [8] Javier Parra-Domínguez, Esteban Sánchez, and Ángel Ordóñez. 2023. The prosumer: A systematic review of the new paradigm in energy and sustainable development. Sustainability, 15, 13. doi: 10.3390/su151310552. [9] Nace Sever. 2022. Časovno razporejanje terenskih nalog z mešanim celoštevil- their flexibility so that the costs, risks and energy fluctuations skim linearnim programiranjem. Bachelor’s Thesis. University of Ljubljana, Faculty of Mathematics and Physics. https://repozitorij.uni- lj.si/IzpisGradiva are all minimized. Since the objectives are scalarized with the .php?lang=slv&id=140427. weighted sum approach, correctly setting their weights is crucial 66 Predicting Hydrogen Adsorption Energies on Platinum Nanoparticles and Surfaces with Machine Learning Lea Gašparič Anton Kokalj Sašo Džeroski lea.gasparic@ijs.si tone.kokalj@ijs.si saso.dzeroski@ijs.si Jožef Stefan Institute, Jožef Jožef Stefan Institute, Jožef Jožef Stefan Institute, Jožef Stefan international postgraduate Stefan international postgraduate Stefan international postgraduate schoole school school Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia Abstract The mechanism of HER includes adsorbed hydrogen The growing interest in hydrogen gas as a fuel drives re- atom (H*) as an intermediate. Consequently, the adsorp- search into environmentally friendly hydrogen production tion energy of hydrogen is often used as a descriptor of methods. One viable approach of obtaining hydrogen is the catalytic activity of the material [15, 21]. The most the electrocatalysis of water, which includes the hydrogen straightforward approach to obtain the adsorption energies evolution reaction (HER) as one of the half-reactions. In is with density-functional theory (DFT) calculations. How- the search of highly active catalysts for the HER, machine ever, as the size of the system and the number of different learning can be effectively utilized to develop models for adsorption sites increase, a full DFT analysis becomes com- calculating hydrogen adsorption energy, a key descriptor of putationally unfeasible. To address this challenge, machine- catalytic activity. In this study, we learned models for pre- learning methods can be employed to predict hydrogen dicting hydrogen adsorption energy on platinum. We used adsorption energies based on DFT results, enabling the various machine-learning (ML) techniques on two datasets, investigation of more complex systems [10]. For example, one for extended surfaces and the other for nanoparticles. bimetallic nanoparticles were investigated by Jäger et al. The respective results reveal that ML models for extended [8] and Zhang et al. investigated amorphous systems [20]. surfaces are more accurate than those for nanoparticles, This contribution focuses on the use of machine learning and that the features describing the local environment are for predicting hydrogen adsorption energies on platinum the most significant for the predictions. For surfaces, the using electronic and geometric descriptors. Two separate coordination number is the most relevant feature, while the datasets were constructed, one for surfaces and the other d-band center is the most important for nanoparticles. The for nanoparticles. By employing supervised learning and ML models developed in this study lack sufficient accuracy attribute ranking, we built ML models, assessed their accu- to provide reliable results, highlighting the need for further racy and analyzed whether the two datasets exhibit similar investigation with additional features or larger datasets. correlations. The idea of the contribution is illustrated in Figure 1. Keywords platinum, hydrogen, DFT calculations, decision trees, fea- ture ranking 1 Introduction A lot of scientific and societal interest is devoted to hydro- gen fuel, which can generate electrical power by producing water as a byproduct. One environmentally friendly method of producing hydrogen is through the electrocatalysis of water, where hydrogen and oxygen gases are formed. This Figure 1: Supervised machine learning and feature process involves two reactions: oxygen and hydrogen evolu- ranking was performed for hydrogen adsorption tion reactions. Considerable effort is being directed towards energy on platinum catalysts modeled as surfaces improving catalysts for both reactions and understanding and nanoparticles. the fundamental processes involved [21, 13]. In this contribution, we will focus on the hydrogen evolution reaction (HER), for which platinum is known to be a highly ac- 2 Materials and Methods tive catalyst due to its near-optimal hydrogen adsorption 2.1 DFT Calculations and Datasets free energy [15, 21]. However, the high cost of platinum motivates ongoing research of alternative materials. We utilized DFT calculations to calculate hydrogen ad- sorption energies (a target variable for ML) and electronic Permission to make digital or hard copies of all or part of this descriptors for ML. We also utilized geometric descriptors. work for personal or classroom use is granted without fee provided Two datasets were constructed, one for platinum nanopar- that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on ticles and the other for platinum surfaces. the first page. Copyrights for third-party components of this work DFT calculations were performed with the Perdew-Burke- must be honored. For all other uses, contact the owner/author(s). Ernzerhof (PBE) approximation [17], a plane-wave basis Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia © set, and PAW pseudopotentials [3]. Energy cutoffs were set 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2024.scai.8689 to 50 and 575 Ry for wavefunctions and electron density, 67 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Gašparič et al. respectively. Methfessel-Paxton smearing [12] of 0.02 eV was employed. Pt(111), Pt(100), and Pt(110) surface slab models were constructed with the calculated lattice parameter of bulk Pt (3.97 ˚ A). The models of Pt(111) and Pt(100) surfaces consist of 4 atomic layers, with the bottom layer fixed to Figure 2: Models of extended surfaces used to cal- bulk positions, while Pt(110) has 6 atomic layers with the culate hydrogen adsorption energies. bottom two layers fixed. To achieve a greater variety of adsorption sites, Pt(111) and Pt(100) were also modeled with a missing-row defect. All surface models are shown in to the size of nanoparticles were also utilized, in particular: Figure 2. Calculations accounted for the dipole correction the number of all atoms (𝑁 and periodic images of slabs were separated by at least all) in the nanoparticle, the number of surface atoms (𝑁 15 ˚ A of vacuum. Different sizes of surface supercells were surf ), the maximal (𝑟max) and minimal (𝑟 used, and the k-point grid for (1×1) surface unit cells of min) distances from the center of the nanoparticle to the surface atoms and the distance from the center of Pt(111), Pt(100), and Pt(110) were 12×12×1, 11×11×1, the nanoparticle to the adsorption site (𝑟 and 11×8×1, respectively. For larger supercells, the number ads). The datasets for surfaces and nanoparticles contained 46 and 85 data of k-points was adapted accordingly. points, respectively. Calculations with nanoparticles were performed with the gamma k-point and Martyna-Tuckerman correction for isolated systems [11]. Nanoparticles were modeled with 2.2 Machine-Learning Methods different shapes and sizes, consisting of 3 and up to 116 The prepared datasets were analyzed using the Weka soft- atoms. Their periodic images were separated by at least ware package [4]. The target value in both datasets is the 15 ˚ A of vacuum. All calculations were preformed with the hydrogen adsorption energy, making this a regression task. Quantum ESPRESSO package [5]. Supervised machine learning was employed to develop mod- The hydrogen adsorption energy was calculated as: els for predicting the target value, which were evaluated by 10-fold cross-validation. 1 𝐸 One of the used methods is linear regression, that com- ads = 𝐸H* − 𝐸* − 𝐸H (1) 2 2 putes the linear relationship between the target value and where 𝐸H* is the calculated energy of optimized adsorp- the descriptors. The relevant descriptors included in the tion system, 𝐸* is the energy of the standalone platinum equation were selected according to the M5 method [18]. system, and 𝐸H is the energy of the hydrogen molecule. 2 This method iteratively removes descriptors with the small- All performed calculations included only one adsorbed H est effect on the model until the error of the model no atom per supercell or nanoparticle. longer decreases. As an electronic descriptor, we used the d-band center, We also used the random forest method [7, 1] with 100 which is considered to be a good indicator of metal reac- trees of unlimited depth. With this method, multiple deci- tivity [6]. It was obtained through DFT calculations using sion trees were constructed by selecting relevant features the following equation: from a random subset of int(log (𝑚) + 1) features, where 2 𝑚 is the total number of features. The final values are the ∞ ∫︀ 𝑛d(𝐸)𝐸𝑑𝐸 averages of the predictions from the individual trees. −∞ 𝜀 To obtain an explainable ML model, we also built regres- d = (2) ∞ ∫︀ sion trees using the M5’ method [18, 19]. In this method, 𝑛d(𝐸)𝑑𝐸 −∞ trees are built by splitting the training sets according to where 𝐸 is the energy and 𝑛 attributes that maximize the standard deviation reduction. d is the projected density of states on d-orbitals of the atoms forming the adsorption After the trees are constructed, they are pruned to avoid site. overfitting and smoothed to address discontinuities between For the geometric descriptors, we determined the average the leaves. For our datasets, we used unpruned trees to coordination number of Pt atoms forming the adsorption prevent the formation of trees that are too small and give site, as well as the generalized coordination number (GCN) poor predictions. We also restricted tree branching to a of the adsorption site [2], calculated as: minimum of 6 instances per leaf node for surfaces and 20 for nanoparticles to avoid overfitting the data and to ensure 𝑁𝑖 trees of sufficient size. ∑︁ CN(𝑗) GCN(𝑖) = (3) We also performed variable importance estimation and CNmax 𝑗=1 ranking for our selected descriptors with all data points where 𝑖 denotes an atom or a group of atoms forming the used as a test set. To evaluate the importance of the de- adsorption site, 𝑁𝑖 is the number of first nearest neighbors scriptors with respect to hydrogen adsorption energy, we of 𝑖, which are denoted with 𝑗. CN(𝑗) is the coordination employed two methods: ReliefF [9] and correlation [16]. The number of atom 𝑗 and CNmax is the maximal coordination ReliefF method is more sensitive to feature interactions of a given site found in the bulk material. and works by calculating the distances between training in- In addition, the type of adsorption site was used as a stances and identifying the ’nearest hit’ and ’nearest miss’. descriptor. For extended surfaces, the coverage of H atoms, It then adjusts the weights of the differing descriptors be- the surface area per H atom and surface type were also used tween the target and nearest instances. The correlation for learning. For nanoparticles, some descriptors relevant method evaluates the Pearson correlation coefficient [16] 68 Hydrogen on platinum Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia between the features and the target variable, without ac- The regression-tree models shown in Figure 3 have lower counting for interactions between features. It gives scores accuracy and, consequently, are less reliable. ranging from −1 to 1, with 1 being the highest correla- The ML models could be improved by expanding the tion score, a score of −1 indicates anti-correlation, and 0 dataset or by calculating additional descriptors. For sur- indicates no correlation. faces, more data can be obtained through calculations on a wider variety of surface types and by accounting for dif- 3 Results and Discussion ferent surface defects. However, expanding the dataset for 3.1 Machine-Learning Models nanoparticles is limited by their size, since DFT calcula- tions for larger particles are computationally too demand- Supervised machine learning was performed using linear ing. Therefore, a larger number of different smaller particles regression, random forest, and M5’ regression tree. The can be tested instead. Using more sophisticated descriptors obtained Pearson’s correlation coefficients and root mean such as atom-centered symmetry functions, smooth overlap squared errors (RMSE) between true and predicted values of atomic positions and many body tensor representation are shown in Table 1. could also improve the results, but would require different We can observe that not all ML models provide better sampling of adsorption structures. The use of transfer learn- RMSE values compared to those calculated with a simple ing from pre-trained models based on chemical structures arithmetic average, referred to as the default predictor. For could also lead to significant improvements. surfaces, linear regression and random forest perform the best and yield similar results. The regression tree model 3.2 Feature Ranking performs the worst and has higher RMSE compared to Feature ranking was performed for both surfaces and nanopar- the default predictor. For nanoparticles, all methods yield ticles, with the results presented in Figure 4. The ReliefF errors close to those of the default predictor and correlation and correlation importance criteria provide different rank- coefficients bellow 0.5. ings of features. For surfaces, the coordination number is The obtained results indicate that with the selected identified as the most relevant descriptor, followed by the descriptors, the hydrogen adsorption energies are more generalized coordination number. In contrast, for nanopar- accurately predicted on surfaces, which are simpler as com- ticles, the d-band center is the most important descriptor. pared to nanoparticles. Surfaces have high symmetry and Features describing the size of different nanoparticles show only a handful of different adsorption sites, while nanopar- lower relevance for predictions. The most relevant features ticles have different shapes and sizes, consist of different in both data sets describe the local environment of the facets, and each nanoparticle has numerous different ad- adsorption site, indicating the local nature of adsorption. sorption sites. This gives a huge variety of adsorption sites The importance of the d-band center is already well- that can make the prediction of adsorption energies harder. documented in the literature [14], as it correlates with Considering the best models, the obtained adsorption the reactivity of metals. As seen from the graphs, the d- energies have an error of ±0.13 eV for surfaces and ±0.22 eV band center is not so strongly correlated with the hydrogen for nanoparticles. Due to the exponential dependence of binding energy on surfaces. This can be attributed to the reaction rate and adsorption energy, even a small error in fact that on a perfectly flat surface, all surface atoms have adsorption energy hugely affects the reaction rate. Hence, the same d-band center. In contrast, on nanoparticles, the the models, particularly for nanoparticles, do not provide d-band center varies for each adsorption site because the sufficiently accurate results for any practical use. atoms are not equivalent. Therefore, the d-band center is The selected ML models also provide insights into the expected to be more relevant for nanoparticles. For the relations between the considered features and the target ranking based on correlation, the calculated factors for the variable. The linear regression model for nanoparticles includes only the d-band center and a factor for the hollow adsorption site, whereas the equation for surfaces is more complex. It includes adsorption site, surface type, and both coordination numbers. This indicates that for nanoparticles, the d-band center is the most relevant factor, while for surfaces, geometric factors exhibit greater predictive value. Table 1: Pearson’s correlation coefficients (CC) and root mean squared errors (RMSE) in eV units for all three used ML methods. For comparison, RMSE of the default predictor is also given. surfaces Nanoparticles CC RMSE CC RMSE linear regression 0.71 0.13 0.38 0.22 Figure 3: Schematic representation the obtained random forest 0.69 0.13 0.34 0.22 random-tree models for ideal surfaces and nanopar- M5’ decision tree 0.49 0.19 0.34 0.22 ticles. Nodes are denoted with orange and the resulting classes are represented with turquoise circles and in- default predictor / 0.18 / 0.23 clude the number of data points in the class. 69 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Gašparič et al. [3] Andrea Dal Corso. 2014. Pseudopotentials periodic table: From H to Pu. Comput. Mater. Sci., 95, (Dec. 2014), 337–350. (files: H.pbe-kjpaw psl.1.0.0.UPF, Pt.pbe-n-kjpaw psl.1.0.0.UPF). doi: 10.1016/j.commatsci.2014.07.043. [4] Eibe Frank, Mark A. Hall, and Ian H. Witten. 2016. The WEKA Workbench. Online Appendix for ”Data Mining: Practical Machine Learning Tools and Techniques”. Fourth Edition. Morgan Kaufmann. https://ml.cms.waikato.ac.nz/w eka/Witten et al 2016 appendix.pdf. [5] Paolo Giannozzi et al. 2009. Quantum ESPRESSO: a modular and open-source software project for quantum simulations of materials. J. Phys: Condens. Matter, 21, 39, 395502. Code available from http : / / www . quantum - espresso . org/. doi: 10.1088/0953-8984/21/39/395502. [6] Bjørk Hammer and Jens K. Nørskov. 1995. Electronic factors determining the reactivity of metal surfaces. Surf. Sci., 343, 3, (Dec. 1995), 211–220. doi: 10.1016/0039-6028(96)80007-0. [7] Tin Kam Ho. 1995. Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition. Vol. 1. IEEE, 278–282. Figure 4: Variable importance scores calculated by the [8] Marc O. J. Jäger, Yashasvi S. Ranawat, Filippo Federici ReliefF and correlation criteria. Importance scores for Canova, Eiaki V. Morooka, and Adam S. Foster. 2020. Efficient correlation criteria are given as absolute values. machine-learning-aided screening of hydrogen adsorption on bimetallic nanoclusters. ACS Comb. Sci., 22, 12, (Dec. 2020), 768–781. doi: 10.1021/acscombsci.0c00102. [9] Igor Kononenko, Edvard Šimec, and Marko Robnik-Šikonja. d-band center are negative. This indicates that a lower 1997. Overcoming the myopia of inductive learning algorithms d-band center corresponds to a higher adsorption energy with RELIEFF. Applied Intelligence, 7, 1, (Jan. 1997), 39–55. and consequently a less reactive site, which is physically doi: 10.1023/A:1008280620621. [10] Jin Li et al. 2023. Machine learning-assisted low-dimensional intuitive. electrocatalysts design for hydrogen evolution reaction. Nano- It is also interesting to note that the surface type de- Micro Lett., 15, 1, (Oct. 2023), 227–27. doi: 10.1007/s40820- scriptor is not very relevant according to correlation, yet 023-01192-5. [11] Glenn J. Martyna and Mark E. Tuckerman. 1999. A reciprocal it becomes the second most important feature when other space based method for treating long range interactions in ab descriptor are considered. This can be attributed to the initio and force-field-based calculations in clusters. J. Chem. Phys., 110, 6, (Feb. 1999), 2810–2821. fact that this descriptor has the same value for all adsorp- doi: 10.1063/1.477923. [12] Michael Methfessel and Anthony Thomas Paxton. 1989. High- tion sites on the same surface. However, when combined precision sampling for brillouin-zone integration in metals. with other descriptors, it can give additional information, Phys. Rev. B, 40, 6, (Aug. 1989), 3616–3621. doi: 10.1103 /PhysRevB.40.3616. as similar adsorption sites on different surfaces can yield [13] Bishnupad Mohanty, Piyali Bhanja, and Bikash Kumar Jena. considerably different adsorption energies. 2022. An overview on advances in design and development of materials for electrochemical generation of hydrogen and oxygen. Mater. Today Energy, 23, (Jan. 2022), 100902. 4 Conclusion doi: 10.1016/j.mtener.2021.100902. We applied different ML techniques to predict the adsorp- [14] Anders Nilsson, Lars G. M. Pettersson, Bjørk Hammer, Thomas Bligaard, Claus Hviid Christensen, and Jens K. Nørskov. 2005. tion energy of hydrogen on platinum surfaces and nanopar- The electronic structure effect in heterogeneous catalysis. ticles using simple geometric and electronic descriptors. Catal. Lett., 100, 3, (Apr. 2005), 111–114. doi: 10.1007/s1056 2-004-3434-9. Models for predicting adsorption energy on surfaces per- [15] Jens Kehlet Nørskov, Thomas Bligaard, Ashildur Logadottir, formed better, with the linear regression and random forest John R. Kitchin, Jingguang G. Chen, Stanislav Pandelov, and methods showing the highest correlation coefficient and Ulrich Stimming. 2005. Trends in the exchange current for hydrogen evolution. J. Electrochem. Soc., 152, 3, (Jan. 2005), accuracy. In contrast, predictions for nanoparticles yielded J23. doi: 10.1149/1.1856988. lower correlation coefficients and accuracy similar to the [16] Karl Pearson. 1895. Vii. note on regression and inheritance one calculated by a default predictor. Therefore, the mod- in the case of two parents. proceedings of the royal society of London, 58, 347-352, 240–242. els presented in this contribution do not provide accurate [17] John P. Perdew, Kieron Burke, and Matthias Ernzerhof. 1996. estimation of hydrogen adsorption energies. Utilizing more Generalized gradient approximation made simple. Phys. Rev. Lett., 77, 18, (Oct. 1996), 3865–3868. sophisticated descriptors and larger training data sets could doi: 10.1103/PhysRev Lett.77.3865. enhance the performance of these models. [18] John R et al. Quinlan. 1992. Learning with continuous classes. Differences between datasets are also evident in feature In 5th Australian joint conference on artificial intelligence. Vol. 92. World Scientific, 343–348. ranking. For surfaces, coordination numbers are the most [19] Yong Wang and Ian H Witten. 1997. Inducing model trees relevant descriptors, while for nanoparticles, the d-band for continuous classes. In Proceedings of the ninth European center shows the highest relevance. All these relevant de- conference on machine learning number 1. Vol. 9. Citeseer, 128–137. scriptors are related to the local environment of the adsorp- [20] Jiawei Zhang, Peijun Hu, and Haifeng Wang. 2020. Amor- tion site, indicating that adsorption is a local phenomenon. phous catalysis: machine learning driven high-throughput screening of superior active site for hydrogen evolution reac- tion. J. Phys. Chem. C, 124, 19, (May 2020), 10483–10494. References doi: 10.1021/acs.jpcc.0c00406. [1] Leo Breiman. 2001. Random forests. Machine Learning, 45, [21] Jing Zhu, Liangsheng Hu, Pengxiang Zhao, Lawrence Yoon 1, (Oct. 2001), 5–32. doi: 10.1023/A:1010933404324. Suk Lee, and Kwok-Yin Wong. 2020. Recent advances in elec- [2] Federico Calle-Vallejo, José I. Mart´ ınez, Juan M. Garc´ ıa- trocatalytic hydrogen evolution using nanoparticles. Chem. Lastra, Philippe Sautet, and David Loffreda. 2014. Fast pre- Rev., 120, 2, (Jan. 2020), 851–918. doi: 10.1021/acs.chemrev diction of adsorption properties for platinum nanocatalysts .9b00248. with generalized coordination numbers. Angew. Chem. Int. Ed., 53, 32, (Aug. 2014), 8316–8319. doi: 10.1002/anie.20140 2958. 70 SmartCHANGE Risk Prediction Tool: Demonstrating Risk Assessment for Children and Youth Marko Jordan Nina Reščič Sebastjan Kramar Jožef Stefan Institute, Jožef Stefan Institute, Jožef Stefan Institute, Department of Intelligent Systems Department of Intelligent Systems Department of Intelligent Systems Ljubljana, Slovenia Jožef Stefan International Ljubljana, Slovenia marko.jordan@ijs.si Postgraduate School sebastjan.kramar@ijs.si Ljubljana, Slovenia nina.rescic@ijs.si Marcel Založnik Mitja Luštrek Jožef Stefan Institute, Jožef Stefan Institute, Department of Intelligent Systems Department of Intelligent Systems Ljubljana, Slovenia Jožef Stefan International marcel.zaloznik@ijs.si Postgraduate School Ljubljana, Slovenia mitja.lustrek@ijs.si Abstract healthy lifestyle can improve physical, social, and mental well- being, especially among youth, while mitigating the risks of Non-communicable diseases (NCDs) have become a significant NCD-related morbidity and mortality [15], [14], [5]. public health challenge in developed countries, driven by com- Traditionally, clinical prevention strategies for NCDs have mon risk factors such as obesity, low physical activity, and un- been directed at adults, as the risk factors typically become ev- healthy lifestyle choices. Early childhood and adolescence are ident in adulthood. However, recent evidence suggests that fo- crucial for establishing healthy behaviours, and early interven- cusing interventions on children and adolescents can be a more tion can play a crucial role in preventing or delaying the onset effective strategy for reducing NCD risk through behaviour mod- of NCDs later in life. However, current tools for identifying high- ification [13]. While NCDs may not appear in childhood or ado-risk individuals are primarily designed for adults, which results lescence, early signs can alreadexistnt. Tackling risk factors and in missed early detection opportunities in younger populations. promoting healthy habits during these stages can prevent or de- The SmartCHANGE project (https://smart-change.eu/) seeks to lay NCDs later in life [12]. Childhood and youth are also crucial bridge this gap by developing reliable AI tools that assess risk periods for establishing healthy lifestyle habits. Since risk fac- factors in children and adolescents as accurately as possible while tors for NCDs often persist from childhood into adulthood [9], promoting optimized risk reduction strategies. early risk assessment and reduction of risk factors can potentially In developing the risk assessment tool, we addressed the chal- lower the incidence of NCD. Lastly, NCDs in youth are a signifi- lenge of merging diverse datasets, predicting missing data to cre- cant global health challenge, with nearly one in five adolescents ate longitudinal datasets, implementing existing validated models worldwide being overweight or obese [1]. for diabetes (QxMD) and cardiovascular disease (SCORE2), and Identifying high-risk individuals for future health problems ultimately creating a simple online application to demonstrate is essential for targeted preventive interventions. Existing tools the functionality of the developed risk tool. focus mainly on adults [6], for instance predicting 10-year risk of Keywords developing cardiovascular disseased [17] or diabetes [8], missing the opportunity to identify high-risk individuals during child-risk tool, dataset merge, neural networks, online application hood and adolescence, a critical period for forming lifestyle habits. However, recognition of health risks is not a trivial task. For 1 Introduction instance, only 35% of doctors in the UK are aware of the rec- In developed countries, non-communicable chronic diseases (NCDs) ommendations for physical activity, and only 13% can specify have emerged as the foremost public health challenge over recent the recommended weekly duration. Moreover, more than 80% of decades. According to the World Health Organization (WHO), parents of inactive children incorrectly believe that their children NCDs account for more than 70% of mortality in the European are sufficiently active [4]. Developing risk prediction tools for region [18]. Common risk factors for NCD include obesity, poor children and youth would significantly improve NCD prevention physical fitness, and unhealthy lifestyle habits such as inadequate and promote cost-effective strategies. physical activity, sedentary behaviour, poor nutrition, insufficient This paper presents the development of an initial demo ap- sleep, smoking, and excessive alcohol consumption. Embracing a plication of a risk assessment tool designed for children and adolescents in the SmartCHANGE project [3] - merging datasets, Permission to make digital or hard copies of all or part of this work for personal predicting missing data to build longitudinal datasets, and im-or classroom use is granted without fee provided that copies are not made or plementing existing validated models for diabetes (QxMD) and distributed for profit or commercial advantage and that copies bear this notice and cardiovascular disease (SCORE2) and finally, the application de-the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner /author(s). velopment. Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2024.scai.8844 71 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Trovato et al. Table 1: Overview of Selected Datasets Dataset Name SLOfit LGS YFS AFINA-TE Country of Origin SI BE FI PT Age Range 5 - 20 5 - 25 0 - 60 5 - 25 Longitudinal Study Yes Yes Yes No # of Participants 280,165 17,991 3,596 1,632 # of Measurements 3,121,399 31,127 32,364 1,632 # of Variables 13 80 24 59 % of Missing Values 2.55% 16.25% 39.49% 33.53% 2 Methodology (a) Example of the datasets pre-imputation. 2.1 Datasets To estimate the risk of non-communicable diseases in children, ideally, one would need a dataset that tracks risk factors from a young age (when the prediction is made) to an older age (when these diseases typically emerge). Such comprehensive longitu- dinal datasets would allow for accurate predictions of an indi- vidual’s likelihood of developing a disease later in life based on their early risk factors. However, such datasets are currently un- available, so we must rely on a collection of partial and often heterogeneous datasets. In our study, we have chosen 16 types of variables that are used by risk models SCORE2 [17] and QxMD [8]. The datasets we were using are described in Table 1. The SLOfit program is a school fitness monitoring initiative in Slovenia [11]. The Leuven (b) Example of the datasets post-imputation. Growth Study (LGS)[2, 16] is a longitudinal study initiated in 1969 that evaluates physical fitness. The Cardiovascular Risk in Figure 1: The YFS dataset (blue) covers a broad range of vari- Young Finns Study (YFS)[10], started in the late 1970s, focuses on ables across a wide age span but includes a relatively small early cardiovascular disease risk factors. The AFINA-TE dataset number of participants. In contrast, the SLOfit dataset [7] is part of an intervention program in Portugal designed to (green) has many participants but includes fewer variables enhance physical fitness, activity, and nutritional knowledge over a shorter age span. In the first step, we imputed the among children and adolescents. missing variables across the datasets (grey). 2.2 Data Imputation Through Datasets The first step involved imputing missing values within each dataset (see Figure 1 for representation). To guide this process, we calculated the coverage for each variable. Initially, we used only fully observed variables—such as height, weight, and sex—as features in models to impute missing values for other variables. The variables were imputed based on their coverage using ma- chine learning on existing features. After this initial imputation sweep, we had a complete, though potentially imperfect, dataset. In the second sweep, we treated all columns as complete, incor- porating the newly imputed values from the first sweep. This allowed us to train models with a more comprehensive dataset, improving the accuracy of the imputation. Figure 2: Longitudinal filling of the datasets. 2.3 Longitudinal Data Imputation For instance, a vertical jump one standard deviation above the In the second step, we employed a similar approach but focused mean in the LGS dataset was considered equivalent to a stand- on merging the datasets to fill the new merged dataset longitudi- ing long jump one standard deviation above the mean in the nally (see Figure 2 for representation). To maximize their overlap, SLOfit dataset. After matching and standardizing the columns we treated certain variables as equivalent—such as vertical jump across datasets, we merged the individual datasets into a single, from the LGS dataset and standing long jump from the SLOfit comprehensive dataset and repeated the imputation process. dataset. With a merged dataset free of missing values, we built models Since the raw values of these variables differ, we standardized to predict attribute values at age 55—the oldest age supported them by converting them to z-scores, which were calculated as by our data—using values from age 14. Due to the lack of data follows: 𝑣 𝑎𝑟 𝑖𝑎𝑏𝑙 𝑒 − 𝑚𝑒𝑎𝑛 covering the entire age range from 14 to 55, we approached this 𝑧 _𝑠𝑐𝑜𝑟 𝑒 = . in two stages: predicting from age 14 to 18 and then from 18 to 𝑠𝑡 𝑎𝑛𝑑𝑎𝑟 𝑑 _𝑑𝑒 𝑣𝑖𝑎𝑡 𝑖𝑜𝑛 72 Short title to put in the header Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Ind. 18 Ind. 55 Pop. Height [cm] 3.11 3.47 1.62 Weight [kg] 4.79 13.60 10.58 SBP [mmHg] 1.46 2.39 10.91 Total cholesterol [mmol/L] 0.05 0.10 0.64 HDL [mmol/L] 0.02 0.08 0.21 LDL [mmol/L] 0.05 0.17 0.51 Smoking [1-9] 1.01 1.72 2.26 Table 2: Mean absolute error for individual forecasting to ages 18 and 55, and for population forecasting. Figure 3: Population-based approach using z-scores. with greater accuracy in this approach. In the future, we may explore combining both methods or select the more accurate one depending on the variable. 55. The models used were simple neural networks with a single hidden layer. 4 Demo Application This individual forecasting approach required available data for the same person from the start to the end age. However, To show the general idea of the project, we constructed a demo since we had more data available for different people of various application implemented with Python in the Dash framework. ages, we also explored a population-based approach to forecast In the app, a user can specify the inputs (some inputs, such as the typical evolution of each variable. While this method is less steroid use, were fixed to make the app more concise) to the personalized, it is also less prone to anomalies caused by atypical models, which in turn yielded two plots which showed how individuals. In the population-based approach, we again used z- the cardiovascular and diabetes risk evolved from the currently scores, assuming that each person’s z-score remains constant. For selected age up to an age of an older adult, at age 55. In a different example, if someone’s blood pressure is one standard deviation plot, we also showed how a risk factor chosen changes over time. below the mean at age 14, it is assumed to stay one standard 4.1 Risk Prediction using Demo Application deviation below the mean at age 55 (see Figure 3). The developed demo application interface offers a dynamic tool 2.4 Risk Models for visualizing health risks based on various user-input parame- ters used in risk models (Figure 4). By allowing users to adjust The SCORE2 and QxMD models were used in the application these parameters, the dashboard generates real-time projections to assess cardiovascular disease and type 2 diabetes risk. These of two key risk metrics: a 10-year cardiovascular risk score and models were chosen for their validity, robustness and effective- a 10-year risk of developing diabetes. These risks are shown in ness in predicting these chronic conditions. By incorporating two line graphs, illustrating how these conditions’ probability both, healthcare practitioners can comprehensively evaluate car- evolves with age. Additionally, the dashboard includes a feature diometabolic risk factors, aiding in well-informed patient man- that tracks the progression of a selected health parameter (BMI, agement and intervention decisions. systolic blood pressure, total cholesterol, HDL) over time, provid- The SCORE2 model, developed by the European Society of ing insight into how this factor might change as the individual Cardiology, estimates the risk of cardiovascular events over ten ages. The developed tool intuitively explains how lifestyle and years. It calculates the risk score by incorporating variables such physiological factors contribute to long-term health risks, offer- as age, sex, smoking status, blood pressure, and lipid profile. Ad- ing valuable insights for clinical decision-making and personal ditionally, SCORE2 considers regional variations in risk factors, health management. providing more accurate predictions tailored to specific popula- tions [17]. 4.2 Further Development of the Application The QxMD Diabetes Risk Calculator, a comprehensive clinical decision support tool, is employed to evaluate the risk of devel- The current version of the demo application is developed based on oping type 2 diabetes mellitus. This model integrates risk factors, the data and models currently available. However, there remains including age, BMI, family history, physical activity level, and an open question regarding the specific needs and preferences of dietary habits, to estimate an individual’s diabetes risk [8]. the medical experts who will ultimately use the final application. To address this, we plan to present the current version to these 3 Evaluation experts and, based on their feedback, refine and enhance the application in subsequent iterations. Table 2 presents the cross-validated evaluation results of our forecasting models. As anticipated, the errors in the first stage 5 Conclusion of individual forecasting are shallow due to the relatively short period. The emistakesin the second stage are higher but still con- The SmartCHANGE project represents a significant step toward sidered acceptable, with the notable exceptions of weight and improving the early detection and prevention of non-communicable smoking. We hypothesize that the high variability during puberty, diseases (NCDs) in children and youth. While the tool presented which many adolescents experience around age 14, complicates in this paper is a demo version demonstrating some basic func- accurate weight forecasting. In population forecasting, the errors tionalities, our future work will focus on developing a more are generally more significant, which aligns with the less per- comprehensive web application for medical professionals and a sonalized nature of this method. However, weight is forecasted mobile application for families. We also plan to enhance the tool 73 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Trovato et al. Figure 4: The figure is a dashboard interface that allows users to input various health-related parameters and observe the evolution of associated risks over time. by replacing the current SCORE2 and QxMD risk models with cardiovascular risk: a report of the american college of cardiology/american heart association task force on practice guidelines. Circulation, 129, 25, (June more advanced models—Test2Prevent for diabetes and Healthy 2014), S49–S73. Heart Score for cardiovascular disease—incorporating features [7] Noelia González-Gálvez, Jose Carlos Ribeiro, and Jorge Mota. 2022. Car- related to diet and physical activity. Additionally, the application diorespiratory fitness, obesity and physical activity in schoolchildren: the effect of mediation. International journal of environmental research and public will be updated to meet medical experts’ needs based on their health, 19, 23, 16262–16270. ObjectType-Article-1. doi: 10.3390/ijerph19231 feedback. 6262. [8] S. J. Griffin, P. S. Little, C. N. Hales, A. L. Kinmonth, and N. J. Wareham. Acknowledgements 2000. Diabetes risk score: towards earlier detection of type 2 diabetes in general practice. Diabetes/Metabolism Research and Reviews, 16, 3, 164–171. This work was carried out as a part of the SmartCHANGE project, [9] D. R. Jacobs, J. G. Woo, A. R. Sinaiko, S. R. Daniels, J. Ikonen, and M. Juonala. 2022. Childhood cardiovascular risk factors and adult cardiovascular events. which received funding from the European Union’s Horizon Eu- New England Journal of Medicine, 386, 19, (May 2022), 1765–1777. rope research and innovation program under grant agreement [10] Markus Juonala et al. 2008. Cohort profile: the cardiovascular risk in young finns study. International Journal of Epidemiology, 37, 6, 1220–1226. No 101080965. The SLOfit dataset for wasvided by the University [11] Gregor Jurak et al. 2020. Slofit surveillance system of somatic and motor of Ljubljana (courtesy of Gregor Jurak et al.), the LGS dataset was development of children and adolescents: upgrading the slovenian sports provided by KU Leuven (courtesy of Martine ThomThomashe educational chart. Acta Universitatis Carolinae. Kinanthropologica, 56, 1, 28–40. doi: 10.14712/23366052.2020.4. AFINA-TE dataset was provided by the University of Porto (cour- [12] H. C. Jr McGill, C. A. McMahan, E. E. Herderick, G. T. Malcom, R. E. Tracy, tesy of José Ribeiro) and the the University of Turku provided and J. P. Strong. 2000. Origin of atherosclerosis in childhood and adolescence. the YFS dataset are grateful for their support. American Journal of Clinical Nutrition, 72, 5, (Nov. 2000), 1307S–1315S. [13] K. Pahkala, H. Hietalampi, T. T. Laitinen, J. S. Viikari, T. Rönnemaa, H. Ni- inikoski, and et al. 2013. Ideal cardiovascular health in adolescence: effect of References lifestyle intervention and association with vascular intima-media thickness and elasticity (the special turku coronary risk factor intervention project [1] P. S. Azzopardi, S. J. C. Hearps, K. L. Francis, E. C. Kennedy, A. H. Mokdad, for children [strip] study). Circulation, 127, 18, (May 2013), 2088–2096. N. J. Kassebaum, S. Lim, and et al. 2019. Progress in adolescent health and [14] J. R. Ruiz, I. Cavero-Redondo, F. B. Ortega, G. J. Welk, L. B. Andersen, and wellbeing: tracking 12 headline indicators for 195 countries and territories, V. Martinez-Vizcaino. 2016. Cardiorespiratory fitness cut points to avoid 1990-2016. Lancet, 393, 10190, (Mar. 2019), 1101–1120. cardiovascular disease risk in children and adolescents; what level of fitness [2] Gaston P Beunen, Robert M Malina, Marc A Van’t Hof, Jan Simons, Michel should raise a red flag? a systematic review and meta-analysis. British Ostyn, Roland Renson, and Dirk Van Gerven. 1988. Adolescent growth and motor performance: A longitudinal study of Belgian boys Journal of Sports Medicine, 50, 13, 773–779. . Human Kinetics [15] T. J. Saunders, C. E. Gray, V. J. Poitras, J. P. Chaput, I. Janssen, P. T. Katz-Publishers. marzyk, and et al. 2016. Combinations of physical activity, sedentary be- [3] SmartCHANGE Consortium. 2024. Smartchange - horizon europe project. haviour and sleep: relationships with health indicators in school-aged chil- Accessed: 2024-09-02. (2024). https://www.smart- change.eu/. dren and youth. Applied Physiology, Nutrition, and Metabolism, 41, 6, (June [4] K. Corder, E. M. van Sluijs, I. Goodyer, C. L. Ridgway, R. M. Steele, D. Bamber, 2016), 486–505. V. Dunn, S. J. Griffin, and U. Ekelund. 2011. Physical activity awareness of [16] Johan Simons, Gaston Beunen, Roland Renson, Albrecht L. M. Claessens, british adolescents. Archives of Pediatrics Adolescent Medicine, 165, 3, 281– Bernard Vanreusel, and Jos A. V. Lefevre. 1990. Growth and fitness of Flemish 289. girls: The Leuven Growth Study. Human Kinetics, Champaign, IL. [5] A. García-Hermoso, R. Ramírez-Campillo, and M. Izquierdo. 2019. Is mus- [17] SCORE2 working group and ESC Cardiovascular risk collaboration. 2021. cular fitness associated with future health benefits in children and adoles- SCORE2 risk prediction algorithms: new models to estimate 10-year risk cents? a systematic review and meta-analysis of longitudinal studies. Sports Medicine of cardiovascular disease in Europe. European Heart Journal, 42, 25, (June , 49, 7, (July 2019), 975–989. 2021), 2439–2454. [6] D. C. Jr Goff, D. M. Lloyd-Jones, G. Bennett, S. Coady, R. B. D’Agostino, and [18] World Health Organization. 2018. Global Health Estimate 2016: Deaths R. Gibbons. 2014. American college of cardiology/american heart association by Cause, Age, Sex, by Country and by Region, 2000-2016. World Health task force on practice guidelines. 2013 acc/aha guideline on the assessment of Organization. 74 Predicting Mental States During VR Sessions Using Sensor Data and Machine Learning ∗ Emilija Kizhevska Mitja Luštrek emilija.kizhevska@ijs.si mitja.lustrek@ijs.si Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan International Postgraduate School (IPS) Jožef Stefan International Postgraduate School (IPS) Ljubljana, Slovenia Ljubljana, Slovenia Abstract VR creates an immersive environment that enhances empathy by allowing users to experience different perspectives and engage Empathy is a multifaceted concept with both cognitive and emotionally. VR is effective for empathy training and is referred emotional components that plays a crucial role in social to as ’the ultimate empathy machine’ [1, 11] for various reasons: interactions, prosocial behavior, and mental health. In our 1) Immersive Experience: Provides a strong sense of presence, study, empathy and general arousal were induced via VR, helping users adopt new viewpoints [15]. 2) Perspective-Taking with physiological signals measured and ground truth collected and Emotional Engagement: Simulates realistic scenarios to through questionnaires. Data from over 100 participants were provoke emotional responses and understanding [19]. 3) Empathy collected and analyzed using multiple machine learning models Training: Effective in healthcare, education, and diversity training and classification algorithms to predict empathy based on by challenging preconceptions and deepening emotional insights physiological responses. We explored different data balancing [16]. 4) Ethical Considerations: Ensures respectful use of VR, techniques and labeled data in multiple ways to enhance balancing immersive experiences with participants’ well-being model performance. Our results show that they are effective in [2]. detecting general arousal, empathy, and differentiating between The objective of this study was to examine how participants’ non-empathic and empathic arousal, but the models encountered empathy correlates with changes in their physiological metrics, difficulties with precise emotion detection. The dataset extracted measured using sensors such as inertial measurement unit (IMU), at 5-second intervals and models using Random Forest and photoplethysmograph (PPG), and electromyography (EMG). Extreme Gradient Boosting showed the best performance. Future Participants were immersed in 360° VR videos featuring actors work will focus on refining emotion detection through advanced displaying various emotions (sadness, happiness, anger, and modeling techniques and investigating gender differences in anxiety) and reported their empathetic experiences via brief empathy. questionnaires. Using data from these sensors and questionnaires, Keywords machine learning models were developed to predict empathy scores based on physiological responses during the VR sessions VR, mental states, machine learning, sensor data [9]. 1 Introduction 2 Materials and Data Collection Process Empathy is a multifaceted concept explored across various fields, 2.1 Materials and Setup for Empathy including psychology, neuroscience, and sociology. Though no universal definition exists, empathy is generally understood to Elicitation in VR include both cognitive (understanding another’s perspective) To elicit empathy, we immersed participants in a 360º and 3D and emotional (experiencing another’s feelings) components [8]. virtual environment, as VR has proven more effective than Our research defines empathy as the ability to model others’ methods like 2D videos, workshops, or text-based exercises [8, emotional states and respond sensitively while recognizing the 13, 17, 20]. We used videos featuring actors expressing four self-other distinction [14]. emotions—happiness, sadness, anger, and anxiousness—without There is no "golden standard" for measuring empathy additional content to avoid confounding factors [2]. Recognizing [10], with methods varying from self-report questionnaires the impact of understanding emotional context, an audio to psychophysiological measures like heart rate and skin narrative version was also created, followed by a corresponding conductance. Each method has its pros and cons, often leading video (50-120 seconds). To ensure gender balance, we recorded to a combination of approaches for a comprehensive assessment. videos with two male and two female actors. Five versions were Psychophysiological measures offer objective data but face developed: four with narratives (two male, two female) and one challenges due to individual variability and non-empathetic non-narrative, where all emotions are portrayed by all actors factors. Our study addresses these issues by using machine without accompanying narratives. The non-narrative version learning to directly measure empathy from physiological signals, allows gradual transitions between emotions, making it suitable offering a novel approach. for participants of all linguistic backgrounds. Additionally, a 2-minute forest video ("The Amsterdam Forest Permission to make digital or hard copies of all or part of this work for personal in Springtime") was included at the start to establish a relaxed or classroom use is granted without fee provided that copies are not made or baseline and a roller coaster video ("Official 360 POV - Yukon distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this Striker - Canada’s Wonderland") at the end to control for work must be honored. For all other uses, contact the owner /author(s). non-empathic arousal. Both videos were sourced from YouTube. Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Participants completed trait empathy questionnaires (QCAE) © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2024.scai.9356 [14] and, after each emotion-specific video, provided feedback 75 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia E.Kizhevska et al. Figure 1: The best accuracies for each group of models, developed using datasets extracted at two different frequencies and various data balancing techniques, presented for all the labeling schemes on their empathic state (State Empathy Scale) [18], arousal and 3 Methodology valence levels (SAM) [3], and personal distress (IRI) [5]. Each VR 3.1 Preprocessing session lasted around 20 minutes to minimize VR sickness, with Since all the features or insights are numeric, except for the participants viewing one of five versions. feature "Expression/Type," which has three values—smile, frown, Sensor data were collected using the emteqPRO system and neutral—we applied one-hot encoding, a technique used in attached to the Pico Neo 3 Pro Eye VR headset, including EMG data preprocessing where categorical (non-numeric) variables for facial muscle activation, PPG for heart rate, and IMU for head are transformed into a numerical format. Each unique value in motion tracking. The device uses an internal clock as well [12]. the original non-numeric feature is transformed into a separate binary (0 or 1) feature. Next, because missing values represent less than 1% of the total 2.2 Dataset Description data for each participant, they were filled in using the average In this research, we used convenience sampling to recruit of each feature’s values. Scaling the values in the descriptive participants from the general public without a specific selection features between 0 and 1 was the final step in the preprocessing pattern. Participants were invited from various sources, including process. Jožef Stefan Institute employees, university students, and the general public. Invitations were sent verbally or in 3.2 Feature Engineering writing. Data collection concluded with 105 participants, averaging 22.43 ± 5.31 years (range 19–45), with 75.24% Since features were provided at intervals ranging from 1 second identifying as female. Participants had diverse educational and to 500 milliseconds, we divided the data into two windows: one professional backgrounds. Additionally, ethical clearance for of 5 seconds and one of 500 milliseconds. For each window, this study was obtained from the Research Ethics Committee we computed features from the 22 insights across the seven at the Faculty of Arts, University of Maribor, Slovenia modules, as well as from the features for head activity and (No. 038-11-146/2023/13FF UM). Furthermore, written informed facial muscle electrodes, deriving a total of 108 new features, consent was obtained from the actors prior to recording. including minimum, maximum, average, and standard deviation The EmteqPRO system not only provides raw sensor data for each original feature or insight. Additionally, the features for but also generates derived variables through the Emteq Emotion head activity and facial muscle electrodes were used to define AI Engine, which utilizes data-fusion and machine learning to ’Expression/Type,’ and the time and row index were used as analyze multimodal sensor data and assess the user’s emotional provided. However, the row index was disregarded further in the state. This system provides a file with 29 derived features, called study. affective insights for each recording: 7 features for heart-rate We labeled the dataset in six different ways: 1) as a binary variability (HRV) and 3 for breathing rate; 2 features for facial classification aiming to detect empathic arousal, comparing expressions; 4 features for arousal and 4 for valence; 1 feature for empathic parts with the forest part of the video, while excluding facial activation; and 1 feature for facial valence. Additionally, the non-empathic content of the roller coaster video; 2) as a head activity is tracked, reflecting the percentage of the session binary classification using the forest and roller coaster, aiming to with head movement. Dry EMG electrodes on facial muscles such detect non-empathic arousal; 3) again, as a binary classification, as the zygomatic, corrugator, frontalis, and orbicularis provide but including only empathic parts and the roller coaster, aiming four more features, each representing muscle activation as a to distinguish between empathic and non-empathic arousal, and percentage of maximum activation observed during calibration. examining the differences in physiological responses between The data also includes the time elapsed since the start of the empathic content and non-empathic arousal-inducing content, recording and the row index. such as the roller coaster video; 4) aiming to detect arousal in 76 Predicting Mental States, VR Sessions, Sensor Data and ML Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia general, regardless of whether it is empathic or non-empathic, activation of particular muscles from the calibration session, by splitting the entire dataset into two classes: the forest and especially the zygomaticus and orbicularis muscles—were also everything else, including empathic parts and the roller coaster; highly correlated. 5) into three classes: treating the chunks of the roller coaster and Regarding the labeling schemes, we can conclude the forest as separate classes and grouping all the empathic parts into following: 1) We can detect empathic arousal with confusion one class, without differentiating between the different emotions. matrices that show a relatively good distribution of correct The goal is to distinguish among no-arousal, empathic arousal, predictions across both classes and high accuracies for most and non-empathic arousal; 6) with the average of participants’ of the developed models; 2) We can detect non-empathic answers to the state empathy questions for each part of the video, arousal, with almost every developed model achieving a balanced with each part of the empathic content considered a separate accuracy higher than 60%, reaching up to 78%, and a reasonable chunk. Additionally, there are two other classes: the forest and balance between classes, indicating satisfactory classification the roller coaster. The aim is to detect the level of empathy performance; 3) We can even distinguish between empathic and participants experience during the session. We also included each non-empathic arousal with balanced accuracy of 79%; 4) We can participant’s ID, intending to later use it for model evaluation detect arousal in general, again with high accuracies and balanced with the ’leave-one-subject-out’ technique. classes; 5) We can distinguish to some extent among no-arousal, empathic arousal, and non-empathic arousal; 6) However, it is 4 Experiments and Results currently very challenging to detect the precise level of empathy 4.1 Experimental setup participants are feeling during the session using these methods, and to determine whether they are empathizing by mirroring To build models for predicting a participant’s state empathy emotions or experiencing something different while observing during the VR session, we used six different classification specific emotions. The best we can detect in this regard is up algorithms: Gaussian Naive Bayes, Stochastic Gradient Descent to 28% balanced accuracy, with confusion matrices showing a Classifier, K-Nearest Neighbors Classifier, Decision Tree relatively balanced performance across multiple classes, with a Classifier, Random Forest Classifier, and Extreme Gradient good number of correct classifications, particularly in the more Boosting Classifier. The balanced accuracy score was used as frequent classes. an evaluation metric to assess the classification models for Regarding the two window sizes, both models showed similar predicting participants’ state empathy. This metric evaluates the class balance and balanced accuracy scores. However, the dataset overall balanced accuracy of the model by calculating the average extracted at 5-second intervals performed slightly better. Using of recall obtained on each class. Additionally, we used a confusion this dataset, false positives and false negatives were reduced more matrices to evaluate the performance of the classification models effectively. This led to more reliable classification performance, by comparing the actual and predicted labels. especially in terms of precision and recall, despite the smaller For model evaluation, we used a Leave-One-Subject-Out scale. Thus, the models developed using the 5-second interval cross-validation setup, where each subject is a unique participant dataset generally performed better, showing more effective identified by their ID. classification and fewer errors. The simpler confusion matrix Because the labeling schemes 2, 3, 5, and 6 are not balanced and potentially better handling of fewer classes suggest that it (with the 80% of the majority class), we conducted four performs better in practical terms (Figure 2, Figure 1). experiments for each developed model: 1) applying the Synthetic Regarding the data balancing techniques, the undersampling Minority Over-sampling Technique (SMOTE) to create synthetic technique never produced the best results. For the dataset samples for the minority class to balance the dataset; 2) using extracted at 500 ms intervals, using the SMOTE oversampling the RandomUnderSampler (RUnderS) method to randomly select technique and SMOTETomek yielded the best results. For the samples from the majority class, thereby reducing their count dataset extracted at 5-second intervals, using the entire dataset and balancing the dataset; 3) using SMOTETomek, a combination yielded the best results, although models developed using of SMOTE for oversampling and Tomek links for undersampling, SMOTETomek yielded slightly lower results in each combination which targets both the minority and majority classes; and 4) using of different labeling schemes. the dataset as it is, without any undersampling or oversampling. Regarding the classification algorithms, Gaussian Naive Bayes performed the worst in terms of balanced confusion matrices, 4.2 Results while Random Forest Classifier and Extreme Gradient Boosting Including models developed by six different classification performed the best across all combinations, with Random Forest algorithms on two distinct datasets—with two different window Classifier showing slightly better results for most combinations sizes—and utilizing four different data balancing techniques: (Figure 2, Figure 1). undersampling, oversampling, combination techniques, and the dataset in its original form, along with six different labeling schemes, we obtained 288 unique confusion matrices and 4.3 Conclusion corresponding accuracies for each combination. We ran a correlation matrix, which revealed that the highest In this study, we define the entire plan for developing materials, correlation with the state empathy feature was found with methods, and environments to evoke and measure the level of the derived maximum and minimum values from the mean empathy. We started by defining the videos and the session, heart rate, the derived maximum and minimum values from creating or selecting questionnaires for later use as ground truth, the arousal class feature, and the average of the arousal class writing the narratives, recording the VR videos, and then editing — the insight, which can be -1 (low), 0 (medium), or 1 (high). and preparing them for use. Additionally, we collected a dataset The derived standard deviation, maximum, and minimum values from over 100 participants, which we filtered, preprocessed, and from the activation—expressed as a percentage of the maximum prepared for feature engineering and analysis. 77 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia E.Kizhevska et al. Figure 2: The best confusion matrices for each group of models, developed using dataset extracted at a 5s window size and various data balancing techniques, shown for all labeling schemes We conducted and analysed four groups of experiments, empathy-related abilities using body ownership illusions in embodied virtual reality. Frontiers in Robotics and AI, 5, 326671. totaling 288 combinations, where we developed models using [3] M. M. Bradley and P. J. Lang. 1994. Measuring emotion: the self-assessment two different window sizes, six classification algorithms, and manikin and the semantic differential. Journal of Behavior Therapy and three resampling techniques, with six different labeling schemes Experimental Psychiatry, 25, 1, 49–59. [4] A. Cuevas, M. Febrero, and R. Fraiman. 2004. An anova test for functional aimed at detecting various aspects of the dataset chunks: four data. Computational Statistics and Data Analysis, 47, 1, 111–122. empathetic parts, forest, and roller coaster. [5] M. H. Davis. 1980. A multidimensional approach to individual differences The main conclusion is that we can detect arousal in general, in empathy. JSAS Catalog of Selected Documents in Psychology/American Psychological Association, 85. non-empathic arousal, empathy, and differentiate between [6] A. French, M. Macedo, J. Poulsen, T. Waterson, and A. Yu. 2008. Multivariate non-empathic and empathic arousal, as well as between relaxed analysis of variance (manova). San Francisco State University. [7] T. K. Kim. 2015. T test as a parametric statistic. Korean Journal of states and arousal. However, we face difficulties in detecting and Anesthesiology, 68, 6, 540–546. distinguishing between the precise levels of empathy during VR [8] E. Kizhevska, F. Ferreira-Brito, T. Guerreiro, and M. Luštrek. 2022. Using sessions using these methods and approaches. virtual reality to elicit empathy: a narrative review. VR4Health@ MUM, 19–22. Our next steps involve refining the detection of empathy [9] E. Kizhevska, K. Šparemblek, and M. Luštrek. 2024. Protocol of the study levels during VR sessions by applying detailed data filtering for predicting empathy during vr sessions using sensor data and machine and transforming it into a stationary format. Furthermore, we learning. PloS One, 19, 7, e0307385. [10] F. F. D. Lima and F. D. L. Osório. 2011. Empathy: assessment instruments and will develop models such as Autoregressive, Moving Average, psychometric quality–a systematic literature review with a meta-analysis and Extended Recurrent Moving Average, and use clustering of the past ten years. Frontiers in Psychology, 12. 781346. [11] M. Mado, F. Herrera, K. Nowak, and J. Bailenson. 2021. Effect of virtual reality techniques like DBSCAN and HDBSCAN. Additionally, we will perspective-taking on related and unrelated contexts. Cyberpsychology, extract more features from the raw data or use end-to-end neural Behavior, and Social Networking, 24, 12, 839–845. networks. We plan to analyze gender differences in empathy [12] M. J. Magnée, B. De Gelder, H. Van Engeland, and C. Kemner. 2007. Facial electromyographic responses to emotional information from faces and with a t-test [7], and explore the impact of narrative context and voices in individuals with pervasive developmental disorder. Journal of emotions on empathic responses using ANOVA and MANOVA Child Psychology and Psychiatry, 48, 11, 1122–1130. [4, 6]. [13] K. M. Nelson, E. Anggraini, and A. Schlüter. 2020. Virtual reality as a tool for environmental conservation and fundraising. PloS One, 15, 4, e0223631. [14] R. L. Reniers, R. Corcoran, R. Shryane Drake, N. M., and B. A. Völlm. 2011. Acknowledgements The qcae: a questionnaire of cognitive and affective empathy. Journal of personality assessment, 93, 1, 84–95. doi: doi:10.1080/00223891.2010.528484. The part of Emilija Kizhevska was supported by the Slovenian [15] G. Riva, J. A. Waterworth, and E. L. Waterworth. 2004. The layers of presence: Research and Innovation Agency (ARIS) as part of the young a bio-cultural approach to understanding presence in natural and mediated environments. CyberPsychology Behavior, 7, 4, 402–416. researcher PhD program, grant PR-12879. The technical aspects [16] R. O. Roswell, C. D. Cogburn, J. Tocco, J. Martinez, C. Bangeranye, J. N. of the videos, the recording and video editing were skillfully Bailenson, and L. Smith. 2020. Cultivating empathy through virtual reality: conducted by Igor Djilas and Luka Komar. The actors featured in advancing conversations about racism, inequity, and climate in medicine. Academic Medicine, 95, 12, 1882–1886. the videos were Sara Janašković, Kristýna Šajtošová, Domen Puš, [17] N. S. Schutte and E. J. Stilinović. 2017. Facilitating empathy through virtual and Jure Žavbi. The questionnaires were selected and created, reality. Motivation and Emotion, 41, 708–712. the narratives were written, and the psychological aspects of the [18] L. Shen. 2010. On a scale of state empathy during message processing. Western Journal of Communication, 74, 5, 504–524. video creation were considered by Kristina Šparemblek. [19] M. Slater, A. Antley, A. Davison, D. Swapp, C. Guger, C. Barker, and M. V. Sanchez-Vives. 2006. A virtual reprise of the stanley milgram obedience References experiments. PloS One, 1, 1, e39. doi: 10.1145/1188913.1188915. [20] J. Stargatt, S. Bhar, T. Petrovich, J. Bhowmik, D. Sykes, and K. Burns. 2021. [1] D. Banakou, P. D. Hanumanthu, and M. Slater. 2016. Virtual embodiment of The effects of virtual reality-based education on empathy and understanding white people in a black virtual body leads to a sustained reduction in their of the physical environment for dementia care workers in australia: a implicit racial bias. Frontiers in Human Neuroscience, 10, 226766. controlled study. Journal of Alzheimer’s Disease, 84, 3, 1247–1257. [2] P. Bertrand, J. Guegan, L. Robieux, C. A. McCall, and F. Zenasni. 2018. Learning empathy through virtual reality: multiple strategies for training 78 Biomarker Prediction in Colorectal Cancer Using Multiple Instance Learning Miljana Shulajkovska∗ Matej Jelenc miljana.sulajkovska@ijs.si jelenc11matej@gmail.com Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia Jitenndra Jonnagaddala Anton Gradišek jitendra.jonnagaddala@unsw.edu.au anton.gradisek@ijs.si School of Population Health, Faculty of Medicine and Jožef Stefan Institute Health Ljubljana, Slovenia Syndey, Australia Abstract Deep learning methods have emerged as a promising non- Microsatellite instability (MSI) is a crucial biomarker in colorec- invasive alternative for MSI prediction by analysing whole slide tal cancer, guiding personalised treatment strategies. The focus images (WSIs) of histopathological samples. These models can of our paper is on evaluating how different state-of-the-art pre- detect patterns linked to MSI, eliminating the need for genetic trained artificial intelligence models perform in extracting fea- testing. WSIs provide a comprehensive view of tumor histology, tures on molecular and cellular oncology (MCO) study dataset offering a faster, less invasive, and more accessible means of to predict biomarkers. In this study, we present an advanced diagnosis. approach for MSI prediction using multiple instance learning on Integrating deep learning into clinical practice can improve whole slide images. Our process begins with comprehensive pre- early MSI detection, personalise treatment, and reduce invasive processing of WSIs, followed by tessellation, which breaks down procedures. WSI-based methods streamline diagnostics and en- large images into manageable tiles. State-of-the-art feature ex- hance cancer care with accessible predictive analytics. traction techniques are utilised on these selected tiles, employing To manage these challenges, WSIs are often divided into smaller pretrained models to capture rich, discriminative features. Vari- regions or patches. A common method to address these issues ous aggregation methods are applied to combine these features, is Multiple Instance Learning (MIL) [3, 8]. Due to the vast size leading to the prediction of MSI status across the entire slide. of WSIs, computational resources can be easily overwhelmed, We assess the performance of different pretrained models within making MIL an essential approach. MIL is a machine learning this framework, demonstrating their effectiveness in accurately technique that operates on sets or "bags" of instances, where the predicting MSI, with results showing an AUROC of 0.91 on the label is assigned to the entire bag rather than individual instances. MCO dataset. Our findings underscore the potential of multiple This is particularly advantageous in WSI analysis, where labels instance learning-based approaches in enhancing biomarker pre- such as MSI status apply to the entire slide, which is composed diction in colorectal cancer, contributing to more targeted and of numerous smaller regions or patches. effective treatment strategies. In this context, [4] demonstrates state-of-the-art (SOTA) results in predicting MSI in colorectal cancer. Their workflow uti- Keywords lizes the Swin-T model on small datasets to predict MSI. First, a pretrained tissue classification model is employed to filter out multiple instance learning, whole slide images, colorectal cancer, non-tissue patches, followed by fine-tuning a pretrained model biomarker prediction to classify the remaining patches. Both intra-cohort and exter- nal validation are performed. When trained on the MCO dataset 1 Introduction (N=1065), the model achieved a mean AUROC of 0.92 ± 0.05 for MSI is a crucial biomarker in colorectal cancer (CRC) that indi- MSI prediction. Similarly, [11] employs a transformer-based ap-cates defects in the DNA mismatch repair system, leading to a proach for large-scale multi-cohort evaluation, involving over high mutation rate within tumor cells. MSI status has significant 13,000 patients for biomarker prediction, achieving a negative clinical implications, influencing treatment decisions, particularly predictive value of over 0.99 for MSI prediction. When trained the use of immunotherapy, and providing prognostic information. and tested only on a single cohort (MCO), the model achieved Traditionally, MSI is determined through laboratory tests such as an AUROC of 0.85. While [4] achieved promising results on the PCR-based assays or immunohistochemistry (IHC) on tumor tis-MCO dataset using an additional tissue classifier, we obtained sue samples, which require invasive biopsy procedures. However, comparable performance without the need for tissue classifica- these methods can be time-consuming, costly, and dependent on tion. On the other hand, [11] used a multicentric cohort, which the availability of sufficient tissue samples. demands additional computational resources. In comparison to their results on the MCO dataset, we achieved a 6% improvement Permission to make digital or hard copies of all or part of this work for personal using a smaller dataset. or classroom use is granted without fee provided that copies are not made or In this study, we leverage MIL to process WSIs for the pre- distributed for profit or commercial advantage and that copies bear this notice and diction of MSI in CRC. By testing SOTA models on the MCO the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). dataset, we aim to assess their performance in MSI prediction Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia using MIL. This approach not only highlights the potential of MIL © 2024 Copyright held by the owner/author(s). in processing complex, unannotated WSIs but also contributes https://doi.org/10.70314/is.2024.scai.9705 79 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Miljana Shulajkovska, Matej Jelenc, Jitenndra Jonnagaddala, and Anton Gradišek to the broader goal of improving biomarker prediction in CRC, where 𝑤 and 𝑉 are trainable parameters. ultimately supporting more personalized and effective treatment This approach allows the model to dynamically focus on the strategies. most relevant patches, leading to more accurate MSI predictions. The paper is organised as follows: Section 2 outlines the meth-Another technique similiar to attention is DSMIL [7] or a dual ods used in the pipeline, Section 3 provides a description of the stream aggregator, consisting of two branches, employing both an data, Section 4 presents the results, and Section 5 discusses the instance classifier and a bag classifier. Let 𝐿 ×1 ℎ ∈ be a feature 𝑖 R findings and potential directions for future work. embedding, and 𝐵 = {ℎ0, ..., ℎ } a bag of embeddings. The first 𝑛 stream uses an instance classifier, followed by a max-pooling 2 Methods operation to obtain a score 𝑐 (𝐵) and the critical embedding ℎ . 𝑚 𝑚 This section outlines the pipeline for MSI prediction, as illus- The second stream aggregates the embeddings into a single bag trated in Figure 1. The process begins with the preprocessing embedding which is then passed through a bag classifier: of WSIs, including tessellation into smaller patches. Next, SOTA 𝑛 −1 ∑︁ pretrained models are employed to extract features from these 𝑐 (𝐵) = 𝑊 𝑈 (ℎ , ℎ )𝑣 𝑏 𝑏 𝑖 𝑚 𝑖 patches. These models, trained on large and diverse datasets, 𝑖 capture rich and discriminative features crucial for accurate MSI Where 𝑊 is a weight vector for classification, 𝑣 an information 𝑏 𝑖 prediction. Finally, aggregation techniques are applied to com- vector and 𝑈 is a distance measurement between an arbitrary bine the information from the patches, enabling precise MSI embedding and the critical embedding: status prediction for the entire slide. Each subsection provides a exp(⟨𝑞 , 𝑞 ⟩) 𝑖 𝑚 concise explanation of these individual processes. 𝑈 (ℎ , ℎ ) = 𝑖 𝑚 Í𝑛=1 exp(⟨𝑞 , 𝑞 ⟩) 𝑘 𝑚 𝑘 =0 2.1 Preprocessing where is a query vector. Both 𝑞 and 𝑣 are calculated by: 𝑖 𝑖 WSIs are first tessellated into smaller, more manageable patches 𝑞 = 𝑊 ℎ , 𝑣 = 𝑊 ℎ , 𝑖 = 0, ..., 𝑛 − 1 𝑖 𝑞 𝑖 𝑖 𝑣 𝑖 to facilitate further processing. This step involves dividing the where 𝑊 and 𝑊 are weight matrices. The final prediction is 𝑞 𝑣 large images into smaller regions using the tiatoolbox presented given by: in [9]. Non-informative tissue patches are removed to ensure the 1 analysis focuses solely on relevant tissue areas. 𝑐 (𝐵) = (𝑐 (𝐵) + 𝑐 (𝐵)) 𝑚 2 𝑏 Specifically, patches that are out of bounds—where only a The last approach for feature aggregation reviewed in this portion contains actual image data and the remainder consists of paper is TransMIL, as proposed in [10], a Transformer based padding—are discarded. Patches that consist entirely of tissue are aggregation method, which unlike the afore-mentioned methods, retained for subsequent analysis. This preprocessing step ensures takes into account spatial information as well. By treating a that only informative and relevant patches are used for feature bag of embeddings as a sequence of tokens, TransMIL uses a extraction and MSI prediction. novel TPT module made up of two Transformer layers and a position encoding layer, where Transformer layers are designed 2.2 Feature Extraction Methods for aggregating morphological information and Pyramid Position Since only WSI-level annotations are available, several pretrained Encoding Generator (PPEG) which encodes spatial information, feature extraction models - UNI [1], ProvGigaPath [13], Phikon followed by a multi-layer perceptron (MLP) which classifies the [2] and CTransPath [12] - are applied to patches, removing bag. the need for detailed patch-level labeling. These SOTA models, trained on large datasets, can capture complex and discrimina- 2.4 MSI Classification tive features essential for accurate biomarker prediction. The The aggregation step produces a single feature vector F, which extracted feature embeddings are then used as input for the ag- encapsulates the most informative characteristics of the entire gregation and classification stages, laying the foundation for slide. This aggregated feature vector F is then passed through precise MSI status prediction. For technical details about these one or more fully connected (dense) layers. These layers apply models, see Table 1. learned weights and biases to the features to transform them into a form that is more suitable for classification. The output of 2.3 Aggregation Methods the fully connected layer is often passed through an activation After feature extraction, we apply aggregation techniques to function, such as a sigmoid or softmax, depending on whether combine patch-level features into a slide-level representation. the classification task is binary (microsatellite instability MSI vs. Traditional pooling methods like max-pooling and mean-pooling microsatellite stability MSS) or multi-class. For MSI prediction, provide straightforward approaches. a sigmoid function is typically used, outputting a probability However, these methods are limited by their lack of trainability. value between 0 and 1. The final output of the model is a single In recent years, attention-based pooling or ABMIL became a probability value indicating the likelihood of the slide being MSI. popular technique that adresses this issue [6]. ABMIL assigns a A threshold (e.g., 0.5) is applied to this probability to make a weight binary decision. 𝛼 to each patch’s feature vector, reflecting its importance: 𝑖 ∑︁ 3 Data 𝐹 = 𝛼 𝑓 𝑖 𝑖 For this paper the MCO study [5] was used for training and test- 𝑖 ∈𝑃 The attention scores ing. The MCO study collection contains 1,500 digitized whole 𝛼 are computed as: 𝑖 slide images (WSIs) of colorectal cancer tissues. Conducted by exp( ⊤ 𝑤 tanh(𝑉 𝑓 )) the Molecular and Cellular Oncology (MCO) Study group from 𝑖 𝛼 = 𝑖 Í exp( ⊤ 1994 to 2010, this study systematically gathered tissue samples 𝑤 tanh(𝑉 𝑓 )) 𝑘 ∈𝑃 𝑘 80 Biomarker Prediction in CRC Using MIL Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Figure 1: General architecture: multiple-instance learning approach. feature extractor architecture dataset embedding size Mass-100k: in-house histopathology slides from MGH and BWH, and external slides from UNI [1] ViT-large, DINOv2, 16 heads 1024 the GTEx consortium containing >100M images, derived from >100,000 WSIs across 20 major tissue types Prov-Path: dataset from Providence, ProvGigaPath [13] ViT-large, DINOv2, 24 heads a large US health network comprising 28 cancer centres, 1536 consisting of 1,3B images from 171,189 WSIs PanCancer40M: dataset from TCGA, Phikon [2] ViT-large, iBOT combining MIM and CL covering 13 anatomic sites and 16 cancer subtypes, 768 consisting of 43,4M images from 6,093 WSIs dataset from TCGA and PAIP, CTransPath [12] CNN with multi-scale Swin Transformer 768 consisting of 15M images from 32,220 WSIs Table 1: Technical details about the pretrained feature extraction models. and clinical data from over 1,500 patients who underwent col- Three feature aggregation methods—ABMIL, DSMIL, and Trans- orectal cancer surgery. Each slide, representing a typical tumor MIL—were applied to the extracted features to generate a single section, is stained with Hematoxylin and eosin and scanned at representative feature for each WSI. Following aggregation, a a 40x objective, achieving a resolution of 0.25 mpp comparable simple neural network with a sigmoid activation function and a to an optical microscope (∼100,000 dpi). The total data size is threshold of 0.5 was used to classify MSI and MSS. approximately 3 Terabytes, and the collection is available on the Each aggregation model was then trained for each feature Intersect Australia RDSI Node. extraction method on each fold, with training being conducted over 50 epochs using the AdamW optimiser and the 1-cycle learning rate scheduler to adjust the learning rate as models 4 Results approached convergence. Binary cross-entropy (BCE) was used as the loss function. After each epoch, model performance was The dataset used in this study comprised 996 whole slide images evaluated on the validation set using the AUROC metric to select (WSIs), with 242 labeled as MSI and 754 as MSS. To evaluate the best checkpoint, as most models tended to overfit toward the the performance of various aggregation methods, models were end of training. The selected checkpoints were then tested to trained using 5-fold cross-validation, which ensured robust train- calculate the mean AUROC across all folds. ing and validation. To create a balanced testing set of 96 samples, Results are presented in Figure 2a. The best performance was 20% of positive (MSI) samples and an equal number of negative achieved using the DSMIL aggregation method with the ProvGi- (MSS) samples were randomly excluded. The remaining data was gaPath feature extractor, yielding an AUROC of 0.91 ± 0.01. The split into five equally balanced parts for cross-validation, with ABMIL method performed best with the Phikon and UNI extrac- each fold consisting of 180 samples in the validation set and 720 tors, achieving AUROCs of 0.91 ± 0.02. Finally, the TransMIL samples in the training set. method combined with ProvGigaPath resulted in an AUROC WSIs were then preprocessed into bags, each containing ap- of 0.90 ± 0.01. Additionally, statistical analysis was performed, proximately 2,000 to 4,000 patches. Each patch was then con- specifically, the Wilcoxon signed-rank test, which yielded an verted into feature embeddings using four different feature ex- average p-value of 0.446, showing a relatively insignificant dif- traction methods: Phikon, CTransPath, ProvGigaPath, and UNI. ference in performance of different feature extraction methods, Specifically, CTransPath and Phikon produced embeddings with as expected. 768 features, UNI with 1024 features, and ProvGigaPath with 1536 features. 81 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Miljana Shulajkovska, Matej Jelenc, Jitenndra Jonnagaddala, and Anton Gradišek (a) ABMIL (b) DSMIL (c) TransMIL Figure 2: Predictive performance of 5-fold cross-validation of different feature extractors and aggregation methods. AUROC plots for prediction of MSI/MSS status. The true positive rate represents sensitivity and the false negative rate represents 1-specificity. The shaded areas represent the standard deviation (SD). The value of the lower right each plot represents mean AUROC ± SD. 5 Discussion and Conclusion better capture the complex relationships between patches within n this study, we explored the potential of MIL combined with a WSI. Advanced methods may help in refining the prediction SOTA pretrained models for predicting MSI in colorectal cancer. process, leading to further improvements in model performance. Our results indicate that the approach is highly effective, achiev- Overall, our study demonstrates the potential of MIL-based ap- ing an AUROC of 0.913 on the MCO dataset. This is a notable proaches in enhancing biomarker prediction in colorectal cancer, achievement, particularly when compared to previous studies, paving the way for more personalized and effective treatment such as [4] and[11], which reported AUROCs of 0.92 and 0.85, strategies. respectively, on the same dataset. Our results not only validate References the effectiveness of our approach but also suggest that the careful selection and combination of feature extraction and aggregation [1] Richard J Chen et al. 2024. Towards a general-purpose foundation model for computational pathology. Nature Medicine, 30, 3, 850–862. methods can yield improvements in predictive accuracy. [2] Alexandre Filiot, Ridouane Ghermi, Antoine Olivier, Paul Jacob, Lucas Fidon, The positive and negative rates observed in our results reflect Alice Mac Kain, Charlie Saillard, and Jean-Baptiste Schiratti. 2023. Scaling self-supervised learning for histopathology with masked image modeling. the model’s ability to correctly classify MSI and MSS cases. A medRxiv, 2023–07. high true positive rate (sensitivity) indicates the model’s pro- [3] Michael Gadermayr and Maximilian Tschuchnig. 2024. Multiple instance ficiency in identifying MSI-positive cases, which is crucial for learning for digital pathology: a review of the state-of-the-art, limitations & future potential. Computerized Medical Imaging and Graphics, 102337. ensuring that patients who could benefit from MSI-targeted ther- [4] Bangwei Guo, Xingyu Li, Jitendra Jonnagaddala, Hong Zhang, and Xu Steven apies are accurately identified. Conversely, a high true negative Xu. 2022. Predicting microsatellite instability and key biomarkers in colorec- rate (specificity) shows the model’s effectiveness in correctly clas- tal cancer from h&e-stained images: achieving sota predictive performance with fewer data using swin transformer. arXiv preprint arXiv:2208.10495. sifying MSS cases, thereby minimising false positives. To further [5] Nick Hawkins. 2015. MCO study whole slide image collection. (2015). enhance the accuracy and reliability of MSI prediction, several [6] Maximilian Ilse, Jakub Tomczak, and Max Welling. 2018. Attention-based deep multiple instance learning. In International conference on machine avenues for future work are planned. learning. PMLR, 2127–2136. Utilisation of the Entire Dataset: We plan to leverage the full [7] Bin Li, Yin Li, and Kevin W Eliceiri. 2021. Dual-stream multiple instance dataset to improve the robustness of our model. Training on a learning network for whole slide image classification with self-supervised contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer larger dataset may help in capturing more nuanced patterns and Vision and Pattern Recognition, 14318–14328. variations, leading to even more accurate predictions. [8] Oded Maron and Tomás Lozano-Pérez. 1997. A framework for multiple- Fine-Tuning of Pretrained Models: While we used pretrained instance learning. Advances in neural information processing systems, 10. [9] Johnathan Pocock et al. 2022. Tiatoolbox as an end-to-end library for ad- models without fine-tuning in this study, fine-tuning these mod- vanced tissue image analytics. Communications medicine, 2, 1, 120. els specifically for the task of MSI prediction could further im- [10] Zhuchen Shao, Hao Bian, Yang Chen, Yifeng Wang, Jian Zhang, Xiangyang Ji, et al. 2021. Transmil: transformer based correlated multiple instance prove their performance. Tailoring the models to our specific learning for whole slide image classification. Advances in Neural Information data distribution and task requirements may yield significant Processing Systems, 34, 2136–2147. gains in accuracy. [11] Sophia J Wagner et al. 2023. Transformer-based biomarker prediction from colorectal cancer histology: a large-scale multicentric study. Cancer Cell, 41, Incorporation of a Tissue Classifier: Since MSI is typically 9, 1650–1661. found in tumor tissue, we plan to integrate a tissue classifier to [12] Xiyue Wang, Sen Yang, Jun Zhang, Minghui Wang, Jing Zhang, Wei Yang, automatically remove non-tumor tissue from the analysis. This Junzhou Huang, and Xiao Han. 2022. Transformer-based unsupervised con- trastive learning for histopathological image classification. Medical image step should enhance the model’s focus on relevant tissue regions, analysis, 81, 102559. potentially improving MSI prediction accuracy and speed up the [13] Hanwen Xu et al. 2024. A whole-slide foundation model for digital pathology from real-world data. Nature, 1–8. whole process. Development of Advanced Aggregation Methods: We also plan to explore more sophisticated aggregation techniques that can 82 Feature-Based Emotion Classification Using Eye-Tracking Data Tomi Božak Mitja Luštrek Gašper Slapničar tb85088@student.uni- lj.si mitja.lutsrek@ijs.si gasper.slapnicar@ijs.si Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Jožef Stefan International Ljubljana, Slovenia Postgraduate School Ljubljana, Slovenia Abstract [8, 12]. We hypothesized that eye-tracking data is a valuable modality for multi-modal emotion recognition on its own, with The field of emotion recognition from eye-tracking data is well- potential applications in real-world scenarios like office work, established and offers near-real-time insights into human affec- driving, and psychological assessments, as well as in estimat- tive states. It is less obtrusive than some other modalities, such ing well-being. Our motivation was to explore eye-tracker-based as electroencephalogram (EEG), electrocardiogram (ECG) and predictive models as an essential component in such practical galvanic skin response (GSR), which are often used in emotion applications. recognition tasks. This study examined the practical feasibility of The primary objective of our study was to validate existing emotion recognition using an eye-tracker with a lower frequency findings on the performance of classical ML models for emotion than that typically employed in similar research. Using ocular classification from eye-tracking data, using the models – Support features, we explored the efficacy of classical machine learning Vector Machine (SVM) and k-Nearest Neighbors (KNN) – and (ML) models in classifying four emotions (anger, disgust, sadness, features already explored in the literature [9, 15] as well as exand tenderness) as well as neutral and “undefined“ emotions. The ploring classifiers not so frequently used in this field – such as features included gaze direction, pupil size, saccadic movements, RF and XGBoost (XGB). Additionally, we aimed to explore the fixations, and blink data. The data from the “emotional State potential of emotion recognition at lower sampling frequencies Estimation based on Eye-tracking database“ was preprocessed available in most non-professional eye trackers. For the early and segmented into various time windows, with 22 features ex- feasibility study, we used an existing dataset, which collected tracted for model training. Feature importance analysis revealed data using a wearable eye-tracker but findings could possibly be that pupil size and fixation duration were most important for extended to high-quality unobtrusive contact-free trackers. Our emotion classification. The efficacy of different window lengths research also focused on understanding the impact of individual (1 to 10 seconds) was evaluated using Leave-One-Subject-Out features and window lengths on model performance. (LOSO) and 10-fold cross-validation (CV). The results demon- strated that accuracies of up to 0.76 could be achieved with 10- fold CV when differentiating between positive, negative, and neutral emotions. The analysis of model performance across different window lengths revealed that longer time windows generally resulted in improved model performance. When the 2 Related Work data was split using a marginally personalised 10-fold CV within In literature, various physiological signals have been employed video, the Random Forest Classifier (RF) achieved an accuracy of for emotion recognition, with a particular focus on modalities 0.60 in differentiating between the six aforementioned emotions. such as EEG, GSR, and eye-tracking systems [1, 6, 9]. Researchers Some challenges remain, particularly in regard to data granu-have explored both uni- and multi-modal approaches, finding that larity, model generalization across subjects and the impact of the integration of multiple modalities can significantly enhance downsampling on feature dynamics. emotion recognition accuracy. Lu et al. achieved 0.78 accuracy with eye-related features recorded with eye-tracking glasses – Keywords which are not contact-free but record at relatively low frequen- eye-tracking, emotion recognition, machine learning cies of 60 Hz or 120 Hz. They predicted positive, negative and neutral classes with SVM. Interestingly, they observed a 0.10 in- 1 Introduction crease in accuracy when combining eye-related and EEG features [12]. Similarly, Guo et al. observed a more substantial gain, with Emotion recognition is a vibrant area of research, leveraging di-accuracy improving by 0.20 when integrating EEG, eye-tracking, verse data sources such as images [11], audio [16], and also, ocular and eye images, as opposed to using only eye-tracking data [7]. features like pupil dilation, gaze direction, blinks, and saccadic The features derived from eye-tracking have been widely used movements [3, 8, 12]. Such eye-related features provide valu-in ML algorithms to detect emotional states [2, 7, 12, 15]. However, able insights into emotional states, offering a less-invasive and most studies have traditionally categorized emotions into broad real-time approach to understanding human affective responses. groups like positive, negative, and neutral [12, 14]. Pupil size, in Most studies that tried to predict emotions from these eye-related particular, has emerged as a valuable indicator for distinguishing features relied not only on eye-tracking data but also on EEG between positive and negative emotions [2, 7, 12] . Recent efforts Permission to make digital or hard copies of all or part of this work for personal have begun to refine these broad categories, identifying more or classroom use is granted without fee provided that copies are not made or specific emotions like happiness, sadness, fear, anger, etc. [2, 7, distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this 15]. Although current methods can effectively identify certain work must be honored. For all other uses, contact the owner /author(s). emotions such as sadness and fear, further research is needed to Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia reliably differentiate between others like disgust, joy, and surprise © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2024.scai.9988 [2]. 83 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Tomi Božak, Mitja Luštrek, and Gašper Slapničar 3 Methodology We used 50% sliding window overlap. From each window, we 3.1 Data computed 22 features, belonging to the following groups: In our research, we used the “emotional State Estimation based (1) gaze coordinates on screen: std of x and y coordinates on Eye-tracking database“ (eSEEd) [13]. The eSEEd comprises (2) pupil ellipse sizes of a and b for each eye: mean, std data from 48 participants, each of whom watched 10 carefully (3) blinks: number; mean and std of duration (all 0 if no blinks) selected videos intended to evoke specific emotional responses. (4) saccades: number; mean speed; mean, std, total duration After viewing each video, participants ranked their emotions (5) fixations: number; mean, std, total duration – anger, disgust, sadness, and tenderness – on a scale from 0 to 10. Tenderness, however, is not regarded as one of the basic emotions, but it has been widely utilized in emotion research Saccade and, implicitly, fixation calculations were done using in recent years [13]. Since the participants had ranked all four existing code based on the algorithm proposed by Engbert et al. [5, emotions for every video, a labelling problem emerged when 10]. The algorithm calculates the velocity and acceleration of eye multiple emotions shared the highest score, in our case, leading movements by using a velocity threshold identification method to “undefined“ labels. In our study, emotions were mapped by to detect saccades based on continuous 3D gaze data. In our study applying a set of extraction rules in the following order: if the we define fixation (interval) as an absence of a saccade (interval), highest-ranked emotion is below four, the response is labelled as thus one fixation is declared between every two saccades (and neutral; if multiple emotions share the highest rank, the label is before the first and after the last one). undefined; otherwise, the emotion with the highest rank is cho- As mentioned previously, our data was imbalanced in terms sen. The boundary of four was chosen because the original study of class distribution, namely the distributions for anger, disgust, on eSEEd constructed this rule and we adapted it from there sadness, neutral, tenderness and undefined were 8.7%, 13.6%, [13]. Although the initial study design aimed for an even distri-17.5%, 25.7%, 15.8% and 18.7%, respectively. Notably, for the 1 s bution of emotions, neutral responses dominate, representing window length, the number of windows was 67,181, whilst for about one-fourth of the labels (depending on window length). the 10 s window, the number of instances decreased to 6,507. 3.1.1 Data Preprocessing. We have preprocessed the data to 3.2 Experiments make it more suitable for our future research and to reduce its size. We wanted to study the performance of data with a rela- We initially examined feature correlation matrices to identify tively low frequency rate of 60 Hz, which is used by relatively potential correlations between features, as well as between fea- affordable mid-tier eye-trackers, like Tobii Pro Spark. Firstly, the ture and class. Then, we compared the following classifiers from features that were uninformative or could be misleading (e.g. the Scikit-learn library: Random Forest (RF), Support Vector Ma- raw tracker signal and timestamps) were removed, and the fol- chines (SVM), k-Nearest Neighbors (kNN), and XGBoost (XGB) lowing set of features was preserved: 2D screen coordinates of from the XGBoost library, as well as an ensemble method major- gaze points (for standard deviation (std) of screen gaze coordi- ity vote of the aforementioned classifiers. We compared all results nates), 3D coordinates of gaze points (exclusively for saccade against a baseline majority classifier. Each model was trained and calculations), pupil sizes (a and b of the pupil ellipse), and eye tested using its default hyperparameters. To evaluate the models’ IDs (each eye has its own pupil size features). Secondly, rows performance, we implemented multiple CV techniques. containing any NaN values were removed, as there were no large The first CV technique was Leave-One-Subject-Out (LOSO). consecutive blocks of such rows and downsampling of the data Secondly, we implemented a marginally personalised 10-fold CV was planned. Finally, we further downsampled the data to 60 “within video.“ In this approach, a standard 10-fold CV was per- Hz, matching the sampling frequency of a mid-tier eye-tracker. formed where 90% of temporally sequential windows were used However, we acknowledge that downsampling might lead to the for training and 10% for testing. The splits were done separately loss of high-frequency information, which could be important for for each video within every subject. All the training data from capturing subtle dynamics in gaze behaviour and pupil responses. every video was combined to train a single model, and all the This is particularly relevant considering that recent studies, such test data was combined to evaluate the model, ensuring that as those by Collins et al. [3] and the SEED project [4, 17], have the model was exposed to data from all subjects and videos. We utilized data collected at much higher frequencies to preserve named the experiment “marginally“ personalised because most these subtle dynamics. Therefore, while downsampling makes training data does not come from any single subject and is thus the data more meaningful to our research and more computa- not very personalised. Finally, we explored a completely person- tionally manageable, it is important to keep in mind the reduced alised 10-fold CV “within subject.“ Here, training and testing were temporal resolution when discussing the results. done only on data of one subject. In all three CV methods, the Following the preprocessing, window segmentation was ap- instances were never shuffled to preserve temporal and subject plied to the data. This step is essential for analyzing temporal sequential information and to minimize overfitting. patterns within the data, as it allows for the capture of trends We attempted to merge certain classes in a way to group and behaviours over specific time intervals. By segmenting the negative emotions – anger, disgust, and sadness – under the cat- data into windows, we can improve the robustness of feature egory “negative,“ while labelling tenderness as “positive.“ The extraction and model training, enabling the detection of mean- label for neutral remained unchanged, while the undefined la- ingful patterns that might be obscured in raw, unsegmented data. bel was changed to “negative“ because it always resulted from Additionally, with window segmentation, the number of training multiple negative emotions scoring equally. Lastly, the feature im- instances increases which is commonly better for learning more portances were analysed for different combinations of data splits robust ML models and conducting rigorous evaluation. Hence, and models in order to identify potential consistently important multiple window lengths were examined, namely: 1, 3, 5 and 10 s. features. 84 Feature-Based Emotion Classification Using Eye-Tracking Data Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia 4 Results similarly well, with the absolute best being RF on 10 s windows which outperformed the majority classifier by 0.05 and 0.13 for The results described in the following subsections are summarised accuracy and F1-score, respectively. When grouping the nega- in Table 1. tive emotions, we observe an absolute improvement in models’ 4.1 Feature Correlations performance, but a relative decline toward the majority classi- fier benchmark. The best model, in this case, did not surpass The first important observation from correlation matrices was the majority classifier in terms of accuracy, with the majority that no output class is closely correlated to any other singular fea- classifier achieving 0.67 ± 0.16 accuracy and 0.61 ± 0.16 F1-score, ture. Secondly, we noticed some strong correlations, for example, while SVM, the best-performing model, scored an accuracy of a 1.0 correlation between a number of fixations and a number of 0.64 ± 0.13 and an F1-score of 0.63 ± 0.12. saccades, because one simply equals the other increased by one. More importantly, we noticed little-to-no correlation between fea- tures that proved to be most important in some best-performing 4.5 Feature Importances models, meaning each of these features brought some novel infor- Following the completion of model training, we analyzed the fea- mation to the model. The only exceptions of important features ture importances of the best-performing models. For RF this was being correlated are the features representing the mean size of a calculated based on the Mean Decrease in Impurity, summing the pupil i.e., ellipse a and b axes, which are expected to be correlated. impurity reduction each feature contributes across all trees; and They were correlated more than 0.8. However, we decided not to for XGB, feature importances were calculated using the “weight“ remove any features because we assessed the feature count of 22 metric, which counts the number of times each feature is used to to be well-balanced in relation to the number of instances. split the data across all trees. For SVM we did not calculate feature importances. In the completely personalised 10-fold experiments, 4.2 Leave-One-Subject-Out feature importances varied significantly across different subjects With the goal of training a robust general model for our dataset, and even between different runs within the same subject, specif- we first applied the LOSO CV technique. The best performance ically with RF, as the random state was not fixed. In contrast, was achieved by RF on 10 s windows, yielding an accuracy of feature importance was notably consistent in experiments where 0.28 ± 0.13 and an F1-score of 0.28 ± 0.16. It outperformed the ma- models were trained on data from multiple subjects, such as in jority classifier by 0.03 in accuracy and 0.13 in F1-score. In a sub- the LOSO and the 10-fold within video, even with a variable sequent experiment, the negative emotions were grouped. This random state of the RF model. adjustment led to an overall increase in performance. However, The most important features of best-performing models were with such grouping the majority classifier score also increased to those related to average pupil sizes, followed by fixation duration. 0.59 accuracy, which is the same as the best-performing model. These results partially align with those of Collins et al., who Further analysis revealed that high accuracy mainly implied found features relating to pupil diameter and saccades statistically the subject predominantly reported “neutral“ feelings and low significant [3]. accuracy implied little-to-no “neutral“ labels. However, not every subject with a high “neutral“ count achieved outstanding results 5 Conclusion and not every subject with a wide range of emotions yielded poor results. A comparison was made between the number of windows Our research explored emotion classification using eye-tracking in the left-out subject to their performance and no correlation data with classical ML models and hand-crafted features. The was found. 10 s window length performed better than the shorter data was downsampled to a lower-than-standard frequency i.e., windows with lengths 1-5 s. We also tested longer (60 s) windows to 60 Hz, which was more realistic for consumer contact-free and the resulting accuracies were higher than those from 10 s eye-tracker data. This made the problem harder, making it not windows, but we evaluated that the number of instances was directly comparable with other studies working on eSEEd, but insufficient for the results to be representative. valuable from a practical perspective. Window segmentation significantly impacted model perfor- 4.3 Marginally Personalised 10-fold mance, with the best results constantly obtained using the largest Cross-Validation Within Video window length. This suggests that longer observation periods capture more comprehensive information, making smaller win- Given that the LOSO yielded relatively poor results, the next dows less effective for emotion classification. We hypothesize step was to explore 10-fold CV. Experiments showed an average that this does not transfer to realistic scenarios, as users might accuracy of 0.60 ± 0.07 and an F1-score of 0.60 ± 0.08, produced experience emotions in short bursts while being neutral for the with RF on 10 s windows, the best-performing model. This should majority of the time. In specifically designed cases where emo- be compared to the results given by the majority classifier – tion is consistently induced for longer periods of time (like our average accuracy of 0.21 ± 0.01 and F1-score of 0.07 ± 0.01. With dataset), this is more expected. negative emotions grouped, the accuracy and F1-score raised to The LOSO validation strategy, which tests model generaliza- 0.76 ± 0.04 and 0.73 ± 0.04, respectively, for the best-performing tion across different subjects, yielded poor results. The variability XGB on 10 s windows. The majority class classifier yielded an in performance across subjects indicates the challenge of cap- accuracy of 0.66 ± 0.02 and an F1-score of 0.52 ± 0.02. turing general relationships between eye features and emotions. While both 10-fold CV approaches showed an increase in perfor- 4.4 Personalised 10-fold Cross-Validation mance, their generalizability is limited. Completely personalised Even though 10-fold CV within video resulted in much better per- 10-fold showed worse results than the marginally personalised formance compared to LOSO, we wanted to see the performance one presumably because of the low number of videos per emotion of completely personalised models. All the models performed within an individual subject. 85 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Tomi Božak, Mitja Luštrek, and Gašper Slapničar Table 1: Best-performing models and their corresponding results along the results of the Majority Class Classifier for the same parameters. Window lengths are 10 s. Settings Model Acc Model F1 Majority Class Acc Majority Class F1 LOSO, RF 0.28 ± 0.13 0.28 ± 0.16 0.25 ± 0.25 0.15 ± 0.26 LOSO, SVM, negative emotions grouped 0.59 ± 0.19 0.46 ± 0.18 0.59 ± 0.19 0.46 ± 0.18 10-fold within video, RF 0.60 ± 0.07 0.60 ± 0.08 0.21 ± 0.01 0.07 ± 0.01 10-fold within video, XGB, negative emotions grouped 0.76 ± 0.04 0.73 ± 0.04 0.66 ± 0.02 0.52 ± 0.02 10-fold within subject, RF 0.38 ± 0.20 0.42 ± 0.19 0.33 ± 0.26 0.29 ± 0.26 10-fold within subject, SVM, negative emotions grouped 0.64 ± 0.13 0.63 ± 0.12 0.67 ± 0.16 0.61 ± 0.16 An important issue with the eSEEd data is that all participants The authors acknowledge the use of OpenAI’s ChatGPT for watched the same 10 emotion-evoking videos in the exact same generating text suggestions during the preparation of this paper. order. This uniformity raises concerns that, given the small num- All the generated content has been reviewed and edited by the 1 ber of videos (two intended per emotion), the models might authors to ensure accuracy and relevance to the research. learn to associate features unrelated to emotions, such as video dynamics or illumination. We circumvented the problem with References video dynamics by dropping the mean gaze coordinate features [1] Zeeshan Ahmad and Naimul Khan. 2022. A survey on physiological signal- based emotion recognition. Bioengineering 2022. https://www.mdpi.com/23 and not using them in our experiments. 06- 5354/9/11/688. Despite these challenges, our experiments offer valuable in- [2] Aracena Claudio, Basterrech Sebastián, Snáel Václav, and Velásquez Juan. sights into the feasibility of emotion recognition from low-frequency 2015. Neural networks for emotion recognition based on eye tracking data. In 2015 IEEE International Conference on Systems, Man, and Cybernetics, eye-tracker data, providing a foundation for future work. We 2632–2637. doi: 10.1109/smc.2015.460. opted for classical models initially due to their explainability, [3] Mackenzie L. Collins and T. Claire Davies. 2023. Emotion differentiation through features of eye-tracking and pupil diameter for monitoring well- lower computational complexity, and efficiency, which are in our being. In 2023 45th Annual International Conference of the IEEE Engineering in opinion essential for understanding the data before transitioning Medicine & Biology Society (EMBC). doi: 10.1109/embc40787.2023.10340178. to more complex deep learning models. [4] Ruo-Nan Duan, Jia-Yi Zhu, and Bao-Liang Lu. 2013. Differential entropy feature for EEG-based emotion classification. In 6th International IEEE/EMBS In future work, several enhancements could be explored to Conference on Neural Engineering (NER). IEEE, 81–84. improve the robustness and accuracy of emotion classification [5] Ralf Engbert, Lars Rothkegel, Daniel Backhaus, and Hans A. Trukenbrod. models using eye-tracking data. One approach could involve ana- 2016. Evaluation of velocity-based saccade detection in the smi-etg 2w system. Technical report, Allgemeine und Biologische Psychologie, Uni-versität lyzing distinct fixation areas as an additional feature, potentially Potsdam. offering deeper insights into visual attention patterns. Moreover, [6] Atefeh Goshvarpour, Ataollah Abbasi, and Ateke Goshvarpour. 2017. An accurate emotion recognition system using ECG and GSR signals and match- considering that each emotion is (in some cases) represented ing pursuit method. Biomed J. 2017. doi: 10.1016/j.bj.2017.11.001. by two videos, a valuable experiment would be to train models [7] Jiang-Jian Guo, Rong Zhou, Li-Ming Zhao, and Bao-Liang Lu. 2019. Multi- on one video and test on the other. This could help assess the modal emotion recognition from eye image, eye movement and EEG using deep neural networks. In 2019 41st Annual International Conference of the model’s ability to generalize across different stimuli within the IEEE Engineering in Medicine and Biology Society (EMBC), 3071–3074. doi: same emotional category. 10.1109/embc.2019.8856563. Further analysis could focus on demographic factors by exam- [8] Robert Jenke, Angelika Peer, and Martin Buss. 2014. Feature extraction and selection for emotion recognition from EEG. IEEE Transactions on Affective ining the LOSO results for potential correlations between model Computing, 5, 3, 327–339. doi: 10.1109/taffc.2014.2339834. predictions and participant characteristics such as gender, age, [9] Lim Jia Zheng, Mountstephens James, and Jason Teo. 2020. Emotion recog- nition using eye-tracking: taxonomy, review and current challenges. Sensors and education. This might reveal underlying biases or trends that 2020. https://doi.org/10.3390/s20082384. affect model performance. Additionally, rather than downsam- [10] Fjorda Kazazi. 2022. Detect saccades and saccade mean velocity in python pling and removing rows with missing data, future work could from data collected in pupil labs eye tracker. Accessed: 25. 7. 2024. https: //www.f jordakazazi.com/detect_saccades. explore retaining or imputing these rows. [11] Yousif Khaireddin and Zhuofa Chen. 2021. Facial emotion recognition: state Furthermore, exploring the training of neural networks on of the art performance on FER2013. arXiv preprint arXiv:2105.03588. doi: raw, non-downsampled data from multiple modalities is another 10.48550/arXiv.2105.03588. [12] Yifei Lu, Wei-Long Zheng, Binbin Li, and Bao-Liang Lu. 2015. Combining promising direction, as other studies already observed promising eye movements and EEG to enhance emotion recognition. In Ijcai. Vol. 15. results with such approaches. Moreover, we should address the Buenos Aires, 1170–1176. [13] Vasileios Skaramagkas and Emmanouil Ktistakis. 2023. Esee-d: emotional issue of overlapping emotions which could involve developing a state estimation based on eye-tracking dataset. Brain Sciences, 13, 4. doi: multiclass output model, reflecting a real-world scenario where 10.3390/brainsci13040589. multiple emotions can be present simultaneously. This approach [14] Mohammad Soleymani, Maja Pantic, and Thierry Pun. 2012. Multimodal emotion recognition in response to videos. IEEE Transactions on Affective could also help reduce the number of undefined labels, increasing Computing, 3, 2, 211–223. doi: 10.1109/t-affc.2011.37. the amount of useful data. [15] Paweł Tarnowski, Marcin Kołodziej, Andrzej Majkowski, and Remigiusz Jan Rak. 2020. Eye-tracking analysis for emotion recognition. Computational Acknowledgements Intelligence and Neuroscience. https://onlinelibrary.wiley.com/doi/10.1155/2 020/2909267. [16] Shiqing Zhang, Shiliang Zhang, Tiejun Huang, Wen Gao, and Qi Tian. This work was supported by bilateral Weave project, funded by 2018. Learning affective features with a hybrid deep model for audio–visual the Slovenian Agency of Research and Innovation (ARIS) under emotion recognition. IEEE Transactions on Circuits and Systems for Video grant agreement N1-0319 and by the Swiss National Science Technology, 28, 10, 3030–3043. doi: 10.1109/tcsvt.2017.2719043. [17] Wei-Long Zheng and Bao-Liang Lu. 2015. Investigating critical frequency Foundation (SNSF) under grant agreement 214991. bands and channels for EEG-based emotion recognition with deep neural networks. IEEE Transactions on Autonomous Mental Development, 7, 3, 162– 175. doi: 10.1109/TAMD.2015.2431497. 1 The average percentage of the videos for which the participants had reported the target emotion (also known as the “hit rate“) was 71.8% [13]. 86 Indeks avtorjev / Author index Andova Andrejaana ...................................................................................................................................................................... 51 Anžur Zoja ................................................................................................................................................................................... 31 Avdić Elma ................................................................................................................................................................................... 15 Bengeri Katja ............................................................................................................................................................................... 11 Bohanec Marko ............................................................................................................................................................................ 59 Božak Tomi .................................................................................................................................................................................. 83 Cigoj Primož .................................................................................................................................................................................. 7 Cork Jordan .................................................................................................................................................................................. 51 Đoković Lazar .............................................................................................................................................................................. 19 Džeroski Sašo ............................................................................................................................................................................... 67 Filipič Bogdan .............................................................................................................................................................................. 51 Gams Matjaž ................................................................................................................................................................................ 27 Gašparič Lea ................................................................................................................................................................................. 67 Gjoreski Hristijan ......................................................................................................................................................................... 35 Gjoreski Martin ............................................................................................................................................................................ 35 Gradišek Anton ...................................................................................................................................................................... 23, 79 Hafner Miha ................................................................................................................................................................................. 59 Halbwachs Helena ........................................................................................................................................................................ 23 Jelenc Matej ................................................................................................................................................................................. 79 Jonnagaddala Jitenndra ................................................................................................................................................................ 79 Jordan Marko ............................................................................................................................................................................... 71 Kalin Jan....................................................................................................................................................................................... 27 Kizhevska Emilija ........................................................................................................................................................................ 75 Kokalj Anton ................................................................................................................................................................................ 67 Kolar Žiga .................................................................................................................................................................................... 27 Konečnik Martin .......................................................................................................................................................................... 27 Kramar Sebastjan ................................................................................................................................................................... 35, 71 Krstevska Ana .............................................................................................................................................................................. 35 Kukar Matjaž ................................................................................................................................................................................ 39 Kulauzović Bajko ......................................................................................................................................................................... 27 Kuzman Taja .................................................................................................................................................................................. 7 Lukan Junoš ........................................................................................................................................................................... 11, 35 Luštrek Mitja .................................................................................................................................................. 11, 31, 35, 71, 75, 83 Mehanović Dželila ....................................................................................................................................................................... 15 Nedić Mila .................................................................................................................................................................................... 63 Pavleska Tanja ............................................................................................................................................................................... 7 Pejanovič Nosaka Tomo ............................................................................................................................................................... 27 Piciga Aleksander ......................................................................................................................................................................... 39 Poljak Lukek Saša ........................................................................................................................................................................ 47 Prestor Domen .............................................................................................................................................................................. 27 Ratajec Mariša .............................................................................................................................................................................. 23 Reščič Nina .................................................................................................................................................................................. 71 Robnik-Šikonja Marko ................................................................................................................................................................. 19 Rupnik Urban ................................................................................................................................................................................. 7 Sadikov Aleksander...................................................................................................................................................................... 43 Shulajkovska Miljana ................................................................................................................................................................... 79 Skobir Matjaž ............................................................................................................................................................................... 27 Slapničar Gašper .............................................................................................................................................................. 31, 35, 83 Smerkol Maj ................................................................................................................................................................................. 23 Šoln Kristjan ................................................................................................................................................................................. 55 Susič David .................................................................................................................................................................................. 27 Susič Rok ..................................................................................................................................................................................... 23 Trojer Sebastijan .................................................................................................................................................................... 31, 35 Tušar Tea ................................................................................................................................................................................ 51, 63 Vladić Ervin ................................................................................................................................................................................. 15 87 Založnik Marcel ..................................................................................................................................................................... 55, 71 Zirkelbach Maj ............................................................................................................................................................................. 43 88 Slovenska konferenca o umetni inteligenci Slovenian Conference on Artificial Intelligence Uredniki > Editors: Mitja Luštrek, Matjaž Gams, Rok Piltaver Document Outline 02 - Naslovnica - notranja - A - DRAFT 03 - Kolofon - A - DRAFT 04 - IS2024 - Predgovor 05 - IS2024 - Konferencni odbori 07 - Kazalo - A 08 - Naslovnica - notranja - A - DRAFT 09 - Predgovor podkonference - A 10 - Programski odbor podkonference - A 11 - Prispevki - A SCAI_2024_paper_001 (0538) Abstract 1 Introduction 2 Related Work 3 Methodology 3.1 PandaChat-RAG-sl Dataset 3.2 RAG System 4 Experiments and Results 4.1 Chunk Size 4.2 Number of Retrieved Sources 4.3 Embedding Models 5 Conclusion and Future Work SCAI_2024_paper_002 (0991) Abstract 1 Introduction 2 Data collection 3 Target and feature extraction 3.1 Target variable 3.2 Features 4 Prediction models 4.1 Model performance and validation 4.2 Baseline model 4.3 Correlation-Based Feature Reduction 4.4 Feature Selection using the mutual information scoring function 4.5 Recursive Feature Elimination with Cross-Validation (RFECV) 4.6 Sequential Forward Selection 4.7 Boruta method 5 Results 5.1 Selecting the best correlation threshold 6 Conclusions SCAI_2024_paper_003 (1642) Abstract 1 Introduction 2 Literature Review 3 Materials and Methods 3.1 Dataset 3.2 Machine Learning Prediction 4 Results and Discussion 4.1 Model Performance 5 Conclusion SCAI_2024_paper_004 (4212) Abstract 1 Introduction 2 Sarcasm Detection Dataset 2.1 iSarcasmEval Dataset 2.2 Translating iSarcasmEval 2.3 Translation Results 3 Model Training 3.1 Encoder Models Under 1B Parameters 3.2 Llama 3.1 Models 3.3 GPT 3 and 4 Models 3.4 Sarcasm Detection Ensembles 4 Sarcasm Detection Results 5 Conclusion Acknowledgements SCAI_2024_paper_005 (4550) Abstract 1 Introduction 2 System Architecture 2.1 Speech-to-Service ASR 2.2 Recommender System 3 Dataset 4 Methods 4.1 Clustering 4.2 Recommender System 4.3 Speech Recognition and Information Extraction 5 Results and Discussion 5.1 LLM Based Infromation Retrieval Model 5.2 Recommender System 6 Conclusions Acknowledgements SCAI_2024_paper_006 (4752) Abstract 1 Introduction 2 Related Work 3 Data Preprocessing 4 Methodology 5 Results 6 Conclusion and Discussion Acknowledgements SCAI_2024_paper_007 (6883) Abstract 1 Introduction 2 Related Work 3 Methodology 3.1 Datasets 3.2 Feature Computation 3.3 Emotion Classification 4 Experiments and Results 4.1 Audio Emotion Classification 4.2 Image Emotion Classification 4.3 Discussion 5 Conclusion SCAI_2024_paper_008 (6961) SCAI_2024_paper_009 (7260) Abstract 1 Introduction 2 Materials 3 Methods 3.1 Exploratory Data Analysis (EDA) 3.2 Data augmentation/Feature Engineering 3.3 Machine Learning Models 3.4 Model Evaluation and Selection 3.5 Model Interpretation 3.6 Data Splitting 4 Results 4.1 Performance Trends and Impact of Additional Data per Employee 4.2 Interpretability and Additional Insights 5 Discussion and Conclusion Acknowledgements SCAI_2024_paper_010 (7299) Abstract 1 Introduction 2 Related Work 3 Application Details 4 AI Agents and Rating System 5 Puzzle Description and Methodology 5.1 Puzzles 5.2 Tactical Puzzle Generation 5.3 Strategic Puzzle Generation 6 Evaluation and Results 7 Discussion 8 Conclusion Acknowledgements SCAI_2024_paper_011 (8236) SCAI_2024_paper_012 (8265) Abstract 1 Introduction 2 Real-World Use Case 3 Optimization Problem Inspector Features 3.1 Problem Specification 3.2 Sample Generation 3.3 Data 3.4 Comparison to Reference Problems 3.5 Data Visualization 4 Conclusions Acknowledgements SCAI_2024_paper_013 (8341) Abstract 1 Introduction 2 Competition Setup and System Description 2.1 Simulation Framework 2.2 Sensor Data 2.3 Actuator Outputs 3 Related Work 4 MAS Approach to Autonomous Table Football Control 4.1 Agent Architecture 4.2 Role Description 4.3 Role Assignment 4.4 Behavior of the System as a Whole 5 Conclusion SCAI_2024_paper_014 (8463) SCAI_2024_paper_015 (8587) Abstract 1 Introduction 2 Optimization Problem 2.1 Variables 2.2 Constraints 2.3 Objective Functions 2.4 Weighted Sum Approach 3 Optimization Approach 3.1 Setting Weights in the Weighted Sum 3.2 Linearization 3.3 Tool and Solver 4 Experiments 4.1 Experimental Setup 4.2 Results and Discussion 5 Conclusions Acknowledgements SCAI_2024_paper_016 (8689) Abstract 1 Introduction 2 Materials and Methods 2.1 DFT Calculations and Datasets 2.2 Machine-Learning Methods 3 Results and Discussion 3.1 Machine-Learning Models 3.2 Feature Ranking 4 Conclusion SCAI_2024_paper_017 (8844) Abstract 1 Introduction 2 Methodology 2.1 Datasets 2.2 Data Imputation Through Datasets 2.3 Longitudinal Data Imputation 2.4 Risk Models 3 Evaluation 4 Demo Application 4.1 Risk Prediction using Demo Application 4.2 Further Development of the Application 5 Conclusion Acknowledgements SCAI_2024_paper_018 (9356) Abstract 1 Introduction 2 Materials and Data Collection Process 2.1 Materials and Setup for Empathy Elicitation in VR 2.2 Dataset Description 3 Methodology 3.1 Preprocessing 3.2 Feature Engineering 4 Experiments and Results 4.1 Experimental setup 4.2 Results 4.3 Conclusion Acknowledgements SCAI_2024_paper_019 (9705) Abstract 1 Introduction 2 Methods 2.1 Preprocessing 2.2 Feature Extraction Methods 2.3 Aggregation Methods 2.4 MSI Classification 3 Data 4 Results 5 Discussion and Conclusion SCAI_2024_paper_020 (9988) Abstract 1 Introduction 2 Related Work 3 Methodology 3.1 Data 3.2 Experiments 4 Results 4.1 Feature Correlations 4.2 Leave-One-Subject-Out 4.3 Marginally Personalised 10-fold Cross-Validation Within Video 4.4 Personalised 10-fold Cross-Validation 4.5 Feature Importances 5 Conclusion Acknowledgements 12 - Index - A Blank Page Blank Page SCAI_2024_paper_016 - NEW.pdf Abstract 1 Introduction 2 Materials and Methods 2.1 DFT Calculations and Datasets 2.2 Machine-Learning Methods 3 Results and Discussion 3.1 Machine-Learning Models 3.2 Feature Ranking 4 Conclusion 11 - Prispevki - A.pdf SCAI_2024_paper_001 (0538) Abstract 1 Introduction 2 Related Work 3 Methodology 3.1 PandaChat-RAG-sl Dataset 3.2 RAG System 4 Experiments and Results 4.1 Chunk Size 4.2 Number of Retrieved Sources 4.3 Embedding Models 5 Conclusion and Future Work SCAI_2024_paper_002 (0991) Abstract 1 Introduction 2 Data collection 3 Target and feature extraction 3.1 Target variable 3.2 Features 4 Prediction models 4.1 Model performance and validation 4.2 Baseline model 4.3 Correlation-Based Feature Reduction 4.4 Feature Selection using the mutual information scoring function 4.5 Recursive Feature Elimination with Cross-Validation (RFECV) 4.6 Sequential Forward Selection 4.7 Boruta method 5 Results 5.1 Selecting the best correlation threshold 6 Conclusions SCAI_2024_paper_003 (1642) Abstract 1 Introduction 2 Literature Review 3 Materials and Methods 3.1 Dataset 3.2 Machine Learning Prediction 4 Results and Discussion 4.1 Model Performance 5 Conclusion SCAI_2024_paper_004 (4212) Abstract 1 Introduction 2 Sarcasm Detection Dataset 2.1 iSarcasmEval Dataset 2.2 Translating iSarcasmEval 2.3 Translation Results 3 Model Training 3.1 Encoder Models Under 1B Parameters 3.2 Llama 3.1 Models 3.3 GPT 3 and 4 Models 3.4 Sarcasm Detection Ensembles 4 Sarcasm Detection Results 5 Conclusion Acknowledgements SCAI_2024_paper_005 (4550) Abstract 1 Introduction 2 System Architecture 2.1 Speech-to-Service ASR 2.2 Recommender System 3 Dataset 4 Methods 4.1 Clustering 4.2 Recommender System 4.3 Speech Recognition and Information Extraction 5 Results and Discussion 5.1 LLM Based Infromation Retrieval Model 5.2 Recommender System 6 Conclusions Acknowledgements SCAI_2024_paper_006 (4752) Abstract 1 Introduction 2 Related Work 3 Data Preprocessing 4 Methodology 5 Results 6 Conclusion and Discussion Acknowledgements SCAI_2024_paper_007 (6883) Abstract 1 Introduction 2 Related Work 3 Methodology 3.1 Datasets 3.2 Feature Computation 3.3 Emotion Classification 4 Experiments and Results 4.1 Audio Emotion Classification 4.2 Image Emotion Classification 4.3 Discussion 5 Conclusion SCAI_2024_paper_008 (6961) SCAI_2024_paper_009 (7260) Abstract 1 Introduction 2 Materials 3 Methods 3.1 Exploratory Data Analysis (EDA) 3.2 Data augmentation/Feature Engineering 3.3 Machine Learning Models 3.4 Model Evaluation and Selection 3.5 Model Interpretation 3.6 Data Splitting 4 Results 4.1 Performance Trends and Impact of Additional Data per Employee 4.2 Interpretability and Additional Insights 5 Discussion and Conclusion Acknowledgements SCAI_2024_paper_010 (7299) Abstract 1 Introduction 2 Related Work 3 Application Details 4 AI Agents and Rating System 5 Puzzle Description and Methodology 5.1 Puzzles 5.2 Tactical Puzzle Generation 5.3 Strategic Puzzle Generation 6 Evaluation and Results 7 Discussion 8 Conclusion Acknowledgements SCAI_2024_paper_011 (8236) SCAI_2024_paper_012 (8265) Abstract 1 Introduction 2 Real-World Use Case 3 Optimization Problem Inspector Features 3.1 Problem Specification 3.2 Sample Generation 3.3 Data 3.4 Comparison to Reference Problems 3.5 Data Visualization 4 Conclusions Acknowledgements SCAI_2024_paper_013 (8341) Abstract 1 Introduction 2 Competition Setup and System Description 2.1 Simulation Framework 2.2 Sensor Data 2.3 Actuator Outputs 3 Related Work 4 MAS Approach to Autonomous Table Football Control 4.1 Agent Architecture 4.2 Role Description 4.3 Role Assignment 4.4 Behavior of the System as a Whole 5 Conclusion SCAI_2024_paper_014 (8463) SCAI_2024_paper_015 (8587) Abstract 1 Introduction 2 Optimization Problem 2.1 Variables 2.2 Constraints 2.3 Objective Functions 2.4 Weighted Sum Approach 3 Optimization Approach 3.1 Setting Weights in the Weighted Sum 3.2 Linearization 3.3 Tool and Solver 4 Experiments 4.1 Experimental Setup 4.2 Results and Discussion 5 Conclusions Acknowledgements SCAI_2024_paper_016 (8689) Abstract 1 Introduction 2 Materials and Methods 2.1 DFT Calculations and Datasets 2.2 Machine-Learning Methods 3 Results and Discussion 3.1 Machine-Learning Models 3.2 Feature Ranking 4 Conclusion SCAI_2024_paper_017 (8844) Abstract 1 Introduction 2 Methodology 2.1 Datasets 2.2 Data Imputation Through Datasets 2.3 Longitudinal Data Imputation 2.4 Risk Models 3 Evaluation 4 Demo Application 4.1 Risk Prediction using Demo Application 4.2 Further Development of the Application 5 Conclusion Acknowledgements SCAI_2024_paper_018 (9356) Abstract 1 Introduction 2 Materials and Data Collection Process 2.1 Materials and Setup for Empathy Elicitation in VR 2.2 Dataset Description 3 Methodology 3.1 Preprocessing 3.2 Feature Engineering 4 Experiments and Results 4.1 Experimental setup 4.2 Results 4.3 Conclusion Acknowledgements SCAI_2024_paper_019 (9705) Abstract 1 Introduction 2 Methods 2.1 Preprocessing 2.2 Feature Extraction Methods 2.3 Aggregation Methods 2.4 MSI Classification 3 Data 4 Results 5 Discussion and Conclusion SCAI_2024_paper_020 (9988) Abstract 1 Introduction 2 Related Work 3 Methodology 3.1 Data 3.2 Experiments 4 Results 4.1 Feature Correlations 4.2 Leave-One-Subject-Out 4.3 Marginally Personalised 10-fold Cross-Validation Within Video 4.4 Personalised 10-fold Cross-Validation 4.5 Feature Importances 5 Conclusion Acknowledgements Blank Page Blank Page