Zbornik 27. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2024 Zvezek C Proceedings of the 27th International Multiconference INFORMATION SOCIETY – IS 2024 Volume C Odkrivanje znanja in podatkovna skladišča - SiKDD Data Mining and Data Warehouses - SiKDD Urednika / Editors Dunja Mladenić, Marko Grobelnik http://is.ijs.si 7. oktober 2024 / 7 October 2024 Ljubljana, Slovenia Urednika: Dunja Mladenić Department for Artificial Intelligence Jožef Stefan Institute, Ljubljana Marko Grobelnik Department for Artificial Intelligence Jožef Stefan Institute, Ljubljana Založnik: Institut »Jožef Stefan«, Ljubljana Priprava zbornika: Mitja Lasič, Vesna Lasič, Lana Zemljak Oblikovanje naslovnice: Vesna Lasič Dostop do e-publikacije: http://library.ijs.si/Stacks/Proceedings/InformationSociety Ljubljana, oktober 2024 Informacijska družba ISSN 2630-371X Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni knjižnici v Ljubljani COBISS.SI-ID 214428163 ISBN 978-961-264-301-0 (PDF) PREDGOVOR MULTIKONFERENCI INFORMACIJSKA DRUŽBA 2024 Leto 2024 je hkrati udarno in tradicionalno. Že sedaj, še bolj pa v prihodnosti bosta računalništvo, informatika (RI) in umetna inteligenca (UI) igrali ključno vlogo pri oblikovanju napredne in trajnostne družbe. Smo na pragu nove dobe, v kateri generativna umetna inteligenca, kot je ChatGPT, in drugi inovativni pristopi utirajo pot k superinteligenci in singularnosti, ključnim elementom, ki bodo definirali razcvet človeške civilizacije. Naša konferenca je zato hkrati tradicionalna znanstvena, pa tudi povsem akademsko odprta za nove pogumne ideje, inkubator novih pogledov in idej. Letošnja konferenca ne le da analizira področja RI, temveč prinaša tudi osrednje razprave o perečih temah današnjega časa – ohranjanje okolja, demografski izzivi, zdravstvo in preobrazba družbenih struktur. Razvoj UI ponuja rešitve za skoraj vse izzive, s katerimi se soočamo, kar poudarja pomen sodelovanja med strokovnjaki, raziskovalci in odločevalci, da bi skupaj oblikovali strategije za prihodnost. Zavedamo se, da živimo v času velikih sprememb, kjer je ključno, da s poglobljenim znanjem in inovativnimi pristopi oblikujemo informacijsko družbo, ki bo varna, vključujoča in trajnostna. Letos smo ponosni, da smo v okviru multikonference združili dvanajst izjemnih konferenc, ki odražajo širino in globino informacijskih ved: CHATMED v zdravstvu, Demografske in družinske analize, Digitalna preobrazba zdravstvene nege, Digitalna vključenost v informacijski družbi – DIGIN 2024, Kognitivna znanost, Konferenca o zdravi dolgoživosti, Legende računalništva in informatike, Mednarodna konferenca o prenosu tehnologij, Miti in resnice o varovanju okolja, Odkrivanje znanja in podatkovna skladišča – SIKDD 2024, Slovenska konferenca o umetni inteligenci, Vzgoja in izobraževanje v RI. Poleg referatov bodo razprave na okroglih mizah in delavnicah omogočile poglobljeno izmenjavo mnenj, ki bo oblikovala prihodnjo informacijsko družbo. “Legende računalništva in informatike” predstavljajo slovenski “Hall of Fame” za odlične posameznike s tega področja, razširjeni referati, objavljeni v reviji Informatica z 48-letno tradicijo odličnosti, in sodelovanje s številnimi akademskimi institucijami in združenji, kot so ACM Slovenija, SLAIS in Inženirska akademija Slovenije, bodo še naprej spodbujali razvoj informacijske družbe. Skupaj bomo gradili temelje za prihodnost, ki bo oblikovana s tehnologijami, osredotočena na človeka in njegove potrebe. S podelitvijo nagrad, še posebej z nagrado Michie-Turing, se avtonomna RI stroka vsakoletno opredeli do najbolj izstopajočih dosežkov. Nagrado Michie-Turing za izjemen življenjski prispevek k razvoju in promociji informacijske družbe je prejel prof. dr. Borut Žalik. Priznanje za dosežek leta pripada prof. dr. Sašu Džeroskemu za izjemne raziskovalne dosežke. »Informacijsko limono« za najmanj primerno informacijsko tematiko je prejela nabava in razdeljevanjem osebnih računalnikov ministrstva, »informacijsko jagodo« kot najboljšo potezo pa so sprejeli organizatorji tekmovanja ACM Slovenija. Čestitke nagrajencem! Naša vizija je jasna: prepoznati, izkoristiti in oblikovati priložnosti, ki jih prinaša digitalna preobrazba, ter ustvariti informacijsko družbo, ki bo koristila vsem njenim članom. Vsem sodelujočim se zahvaljujemo za njihov prispevek k tej viziji in se veselimo prihodnjih dosežkov, ki jih bo oblikovala ta konferenca. Mojca Ciglarič, predsednica programskega odbora Matjaž Gams, predsednik organizacijskega odbora i PREFACE TO THE MULTICONFERENCE INFORMATION SOCIETY 2024 The year 2024 is both ground-breaking and traditional. Now, and even more so in the future, computer science, informatics (CS/I), and artificial intelligence (AI) will play a crucial role in shaping an advanced and sustainable society. We are on the brink of a new era where generative artificial intelligence, such as ChatGPT, and other innovative approaches are paving the way for superintelligence and singularity—key elements that will define the flourishing of human civilization. Our conference is therefore both a traditional scientific gathering and an academically open incubator for bold new ideas and perspectives. This year's conference analyzes key CS/I areas and brings forward central discussions on pressing contemporary issues—environmental preservation, demographic challenges, healthcare, and the transformation of social structures. AI development offers solutions to nearly all challenges we face, emphasizing the importance of collaboration between experts, researchers, and policymakers to shape future strategies collectively. We recognize that we live in times of significant change, where it is crucial to build an information society that is safe, inclusive, and sustainable, through deep knowledge and innovative approaches. This year, we are proud to have brought together twelve exceptional conferences within the multiconference framework, reflecting the breadth and depth of information sciences: • CHATMED in Healthcare • Demographic and Family Analyses • Digital Transformation of Healthcare Nursing • Digital Inclusion in the Information Society – DIGIN 2024 • Cognitive Science • Conference on Healthy Longevity • Legends of Computer Science and Informatics • International Conference on Technology Transfer • Myths and Facts on Environmental Protection • Data Mining and Data Warehouses – SIKDD 2024 • Slovenian Conference on Artificial Intelligence • Education and Training in CS/IS. In addition to papers, roundtable discussions and workshops will facilitate in-depth exchanges that will help shape the future information society. The “Legends of Computer Science and Informatics” represents Slovenia’s “Hall of Fame” for outstanding individuals in this field. At the same time, extended papers published in the Informatica journal, with over 48 years of excellence, and collaboration with numerous academic institutions and associations, such as ACM Slovenia, SLAIS, and the Slovenian Academy of Engineering, will continue to foster the development of the information society. Together, we will build the foundation for a future shaped by technology, yet focused on human needs. The autonomous CS/IS community annually recognizes the most outstanding achievements through the awards ceremony. The Michie-Turing Award for an exceptional lifetime contribution to the development and promotion of the information society was awarded to Prof. Dr. Borut Žalik. The Achievement of the Year Award goes to Prof. Dr. Sašo Džeroski. The "Information Lemon" for the least appropriate information topic was given to the ministry's procurement and distribution of personal computers. At the same time, the "Information Strawberry" for the best initiative was awarded to the organizers of the ACM Slovenia competition. Congratulations to all the award winners! Our vision is clear: to recognize, seize, and shape the opportunities brought by digital transformation and create an information society that benefits all its members. We thank all participants for their contributions and look forward to this conference's future achievements. Mojca Ciglarič, Chair of the Program Committee Matjaž Gams, Chair of the Organizing Committee ii KONFERENČNI ODBORI CONFERENCE COMMITTEES International Programme Committee Organizing Committee Vladimir Bajic, South Africa Matjaž Gams, chair Heiner Benking, Germany Mitja Luštrek Se Woo Cheon, South Korea Lana Zemljak Howie Firth, UK Vesna Koricki Olga Fomichova, Russia Mitja Lasič Vladimir Fomichov, Russia Blaž Mahnič Vesna Hljuz Dobric, Croatia Alfred Inselberg, Israel Jay Liebowitz, USA Huan Liu, Singapore Henz Martin, Germany Marcin Paprzycki, USA Claude Sammut, Australia Jiri Wiedermann, Czech Republic Xindong Wu, USA Yiming Ye, USA Ning Zhong, USA Wray Buntine, Australia Bezalel Gavish, USA Gal A. Kaminka, Israel Mike Bain, Australia Michela Milano, Italy Derong Liu, Chicago, USA Toby Walsh, Australia Sergio Campos-Cordobes, Spain Shabnam Farahmand, Finland Sergio Crovella, Italy Programme Committee Mojca Ciglarič, chair Marjan Heričko Baldomir Zajc Bojan Orel Borka Jerman Blažič Džonova Blaž Zupan Franc Solina Gorazd Kandus Boris Žemva Viljan Mahnič Urban Kordeš Leon Žlajpah Cene Bavec Marjan Krisper Niko Zimic Tomaž Kalin Andrej Kuščer Rok Piltaver Jozsef Györkös Jadran Lenarčič Toma Strle Tadej Bajd Borut Likar Tine Kolenik Jaroslav Berce Janez Malačič Franci Pivec Mojca Bernik Olga Markič Uroš Rajkovič Marko Bohanec Dunja Mladenič Borut Batagelj Ivan Bratko Franc Novak Tomaž Ogrin Andrej Brodnik Vladislav Rajkovič Aleš Ude Dušan Caf Grega Repovš Bojan Blažica Saša Divjak Ivan Rozman Matjaž Kljun Tomaž Erjavec Niko Schlamberger Robert Blatnik Bogdan Filipič Stanko Strmčnik Erik Dovgan Andrej Gams Jurij Šilc Špela Stres Matjaž Gams Jurij Tasič Anton Gradišek Mitja Luštrek Denis Trček Marko Grobelnik Andrej Ule Nikola Guid Boštjan Vilfan iii iv KAZALO / TABLE OF CONTENTS Odkrivanje znanja in podatkovna skladišča - SiKDD / Data Mining and Data Warehouses - SiKDD .... 1 PREDGOVOR / FOREWORD ............................................................................................................................... 3 PROGRAMSKI ODBORI / PROGRAMME COMMITTEES ............................................................................... 5 Integrating Knowledge Graphs and Large Language Models for Querying in an Industrial Environment / Kenda Klemen, Hočevar Domen ................................................................................................................................... 7 Comparative Analysis of Machine Learning Models for Groundwater Level Forecasting: The Impact of Contextual Data / Klančič Rok, Kenda Klemen .............................................................................................. 11 Interactive Tool for Tracking Open-source Artificial Intelligence Progress on Hugging Face / Šinik Bogdan, Vake Domen, Vičić Jernej, Tošić Aleksander.................................................................................................. 15 Multilingual Hate Speech Modeling by Leveraging Inter-Annotator Disagreement / Grigor Patricia-Carla, Kralj Novak Petra, Evkoski Bojan ............................................................................................................................ 19 Predicting Pronunciation Types in the Sloleks Morphological Lexicon of Slovene / Čibej Jaka ........................ 23 Higher-order bibliographic services based on bibliographic networks / Batagelj Vladimir, Pisanski Jan, Pisanski Tomaž ............................................................................................................................................................... 27 Are papers all that counts? A bibliometric analysis of the Slovenian scientific community / Dupuis Aymeric, Džeroski Sašo, Koloski Boshko, Martinc Matej .............................................................................................. 31 Empowering Open Education Methodologies with AI-based Strategies for the Customization of Education / Amiel Tel, Mores Neto Antonio J., Pita Costa Joao, Polajnar Anja, Jermol Mitja .......................................... 35 Addressing Water Sustainability Challenges in North Africa with Artificial Intelligence / Zaouini Mustafa, Pita Costa Joao, Cherakaoui Manal, Hachimi Hanaa, Abkari M. Wahib, Gourari Kamal, Lachheb Hatim, Tounsi El Azzoiani Jad ................................................................................................................................................. 39 Predicting poverty using regression / Urbanč Luka, Grobelnik Marko, Pita Costa Joao ..................................... 43 Fact Manipulation in News: LLM-Driven Synthesis and Evaluation of Fake News Annotation / Golob Luka, Sittar Abdul ...................................................................................................................................................... 47 Borrowing Words: Transfer Learning for Reported Speech Detection in Slovenian News Texts / Fijavž Zoran 51 Connecting company performance to ESG terms in financial reports / Andrenšek Luka, Sitar Šuštar Katarina, Pollak Senja, Purver Matthew .......................................................................................................................... 55 Classification of Patents Into Knowledge Fields: Using a Proposed Knowledge Mapping Taxonomy (KnowMap) / Motamedi Elham, Novalija Inna, Rei Luis ............................................................................... 59 Enhancing causal graphs with domain knowledge: matching ontology concepts between ontologies and raw text data / Stegnar Jernej, Rožanec Jože M., Leban Gregor, Mladenić Dunja ....................................................... 63 Measuring and Modeling CO2 Emissions in Machine Learning Processes / Hrib Ivo, Šturm Jan, Topal Oleksandra, Škrjanc Maja ................................................................................................................................ 67 Enhancing Ontology Engineering with LLMs: From Search to Active Learning Extensions / Kholmska Ganna, Kenda Klemen, Rožanec Jože M. .................................................................................................................... 73 On the Brazilian Observatory for Artificial Intelligence / Meira Silva Rafael, Godoy Oliveira Cristina, Costa Luiz, Candia Vieira Joao Paulo, Pita Costa Joao ............................................................................................. 77 Pojavljanje incidentov ob uporabi Umetne Inteligence / Grobelnik Marko, Massri M. Besher, Guček Alenka, Mladenić Dunja ................................................................................................................................................ 81 Perception of AI in Slovenia / Sittar Abdul, Guček Alenka, Mladenić Dunja ..................................................... 85 Naslov / Šker Tesia, Rožanec Jože M., Leban Gregor, Mladenić Dunja ............................................................. 89 Generating Non-English Synthetic Medical Data Sets / Dolinar Lenart, Calcina Erik, Novak Erik ................... 93 LLNewsBias: A Multilingual News Dataset for Lifelong Learning / Swati, Mladenić Dunja ............................ 97 Creating Local World Models using LLMs / Longar Mark David, Novak Erik, Grobelnik Marko .................. 101 Semantic video content search and recommendation / Longar Mark David, Fir Jakob, Pangeršič Bor ............ 105 Continuous Planning of a Fleet of Shuttle Vans as Support for Dynamic Pricing / Stavrov Filip, Stopar Luka 109 Knowledge graph Extraction from Textual data using LLM / Gilliani Khasa, Novak Erik, Kenda Klemen, Mladenić Dunja .............................................................................................................................................. 113 Solving hard optimization problems of packing, covering, and tiling via clique search / Szabo Sandor, Zavalnij Bogdan ........................................................................................................................................................... 117 v Indeks avtorjev / Author index ................................................................................................................. 121 vi Zbornik 27. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2024 Zvezek C Proceedings of the 27th International Multiconference INFORMATION SOCIETY – IS 2024 Volume C Odkrivanje znanja in podatkovna skladišča - SiKDD Data Mining and Data Warehouses - SiKDD Urednika / Editors Dunja Mladenić, Marko Grobelnik http://is.ijs.si 7. oktober 2024 / 7 October 2024 Ljubljana, Slovenia 1 2 PREDGOVOR Tehnologije, ki se ukvarjajo s podatki so močno napredovale. Iz prve faze, kjer je šlo predvsem za shranjevanje podatkov in kako do njih učinkovito dostopati, se je razvila industrija za izdelavo orodij za delo s podatkovnimi bazami in velikimi količinami podatkov, prišlo je do standardizacije procesov, povpraševalnih jezikov. Ko shranjevanje podatkov ni bil več poseben problem, se je pojavila potreba po bolj urejenih podatkovnih bazah, ki bi služile ne le transakcijskem procesiranju ampak tudi analitskim vpogledom v podatke. Pri avtomatski analizi podatkov sistem sam pove, kaj bi utegnilo biti zanimivo za uporabnika – to prinašajo tehnike odkrivanja znanja v podatkih (knowledge discovery and data mining), ki iz obstoječih podatkov skušajo pridobiti novo znanje in tako uporabniku nudijo novo razumevanje dogajanj zajetih v podatkih. Slovenska KDD konferenca SiKDD, pokriva vsebine, ki se ukvarjajo z analizo podatkov in odkrivanjem znanja v podatkih: pristope, orodja, probleme in rešitve. Dunja Mladenić in Marko Grobelnik 3 FOREWORD Data driven technologies have significantly progressed. The first phases were mainly focused on storing and efficiently accessing the data, resulted in the development of industry tools for managing large databases, related standards, supporting querying languages, etc. After the initial period, when the data storage was not a primary problem anymore, the development progressed towards analytical functionalities on how to extract added value from the data; i.e., databases started supporting not only transactions but also analytical processing of the data. In automatic data analysis, the system itself tells what might be interesting for the user - this is brought about by knowledge discovery and data mining techniques, which try to obtain new knowledge from existing data and thus provide the user with a new understanding of the events covered in the data. The Slovenian KDD conference SiKDD covers topics dealing with data analysis and discovering knowledge in data: approaches, tools, problems and solutions. Dunja Mladenić and Marko Grobelnik 4 PROGRAMSKI ODBOR / PROGRAMME COMMITTEE Janez Brank, Jožef Stefan Institute, Ljubljana Marko Grobelnik, Jožef Stefan Institute, Ljubljana Alenka Guček, Jožef Stefan Institute, Ljubljana Branko Kavšek, University of Primorska, Koper Dunja Mladenić, Jožef Stefan Institute, Ljubljana Erik Novak, Jožef Stefan Institute, Ljubljana Inna Novalija, Jožef Stefan Institute, Ljubljana Joao Pita Costa, Quintelligence, Ljubljana Lui Rei, Event Registry, Ljubljana Jože Rožanec, Jožef Stefan Institute, Ljubljana Abdul Sitar, Jožef Stefan Institute, Ljubljana Luka Stopar, SolvesAll, Ljubljana Swati Swati, Bundeswehr University Munich, Munich Jan Šturm, Jožef Stefan Institute, Ljubljana Oleksandra Topal, Jožef Stefan Institute, Ljubljana 5 6 Integrating Knowledge Graphs and Large Language Models for Querying in an Industrial Environment Domen Hočevar Klemen Kenda domenhocevar1@gmail.com klemen.kenda@ijs.si Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia Abstract To overcome these challenges, we propose a system that en- ables users to interact with knowledge graphs through natural Knowledge graphs have traditionally required the use of specific language queries. The system leverages LLMs’ capabilities to query languages, such as SPARQL, to retrieve relevant data. In interpret knowledge graphs while compensating for their limited this paper, we present a system capable of performing natural ability to generate fully syntactically and semantically correct language queries on knowledge graphs by leveraging retrieval- SPARQL queries. Proposed system, depicted in Figure 1, lever- augmented generation (RAG) and large language models (LLMs). ages large language models (LLMs) [11] to process natural lan- Our system can ingest large knowledge graphs and answer queries guage inputs and provide responses in natural language. Our using two approaches: first, by utilizing LLMs to extract informa- approach integrates retrieval-augmented generation (RAG) tech- tion directly from subgraphs; and second, by generating SPARQL niques alongside the automatic generation of SPARQL queries queries with LLMs and using the results to inform further infer- based on natural language input [2]. ence, such as counting the number of items. Keywords knowledge graph, semantic inference, Industry 4.0, LLM, RAG 1 Introduction In the context of Industry 4.0, knowledge graphs play a crucial role in mapping and describing the entire production vertical, from supply and demand dynamics to intricate details within the production process. This includes the configuration of shop floors, production lines, machines, and data setups, extending even to specific datasets generated during operations. Knowledge graphs can also include relevant information about the tools required for particular processes, as well as details about personnel, including their skills and roles. A key standard for representing such data within the Industry 4.0 initiative is the Asset Administration Shell (AAS) [3], which provides a logical representation for a factory asset (can also be a piece of software, etc.). By adopting AAS, industries can en- Figure 1: Intended usage of the system: AAS instances are sure interoperability and standardization, enabling more efficient converted into a knowledge graph, enabling natural lan- data exchange and integration across various systems, ultimately guage queries by the user. enhancing the agility and responsiveness of manufacturing pro- cesses. Querying knowledge graphs can be a challenging task for end users, as it often requires expertise in specialized query languages By doing so, our system not only simplifies the querying pro- such as SPARQL [8] — a skill that is not widely known among cess but also ensures that the responses are accurate and con- non-experts. Working with SPARQL SELECT queries remains a textually relevant, making knowledge graphs more accessible challenge also for LLMs, with performance varying significantly and usable for a broader range of users. Additionally, the use of depending on the specific model and task complexity. While the LLMs in combination with SPARQL querying enables the system leading LLMs can reliably address basic syntax errors, generating to handle complex tasks, including those that require logical rea- semantically accurate SPARQL SELECT queries remains difficult soning, aggregation, or interpretation of data, thus enhancing in many cases [10]. Similar work has been done on interaction its utility in real-world applications. For example, our system is with databases, however even with SQL query generation the able to answer queries such as: “Give me all machines that results of GPT-4 are still far behind human ability (approx. 55% are capable of drilling a hole with 2cm perimeter”. execution accuracy) [9]. Finally, question answering with the help of knowledge graphs and language models has been tackled before [16], however, the Permission to make digital or hard copies of all or part of this work for personal development of retrieval-augmented generation (RAG) systems or classroom use is granted without fee provided that copies are not made or has seen significant growth recently. In 2024, several preprints distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this have emerged showcasing the application of the RAG approach work must be honored. For all other uses, contact the owner /author(s). to knowledge graphs [12, 13, 14]. This paper contributes to this Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia rapidly evolving field by presenting our own advancements and © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2024.sikdd.5 findings. 7 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Domen Hočevar and Klemen Kenda 2 Data query generation, which are essential for responding to complex queries by the user. This study uses a generated dataset representing a hypothetical factory with various machine models, designed to test the capabil- ities of the developed application. The work is part of the Smart Manufacturing pilot in the EU-funded HumAIne project [7], with the aim of eventually using real-world data from participating factories. The mock factory includes models of "drillers", "circle cut- ters", and "circular saws", each with unique names, manu- facturers, and descriptions. These models are represented using AASs with relevant submodels for energy consumption, man- ufacturer details, and operation-specific parameters like hole diameter or depth of cut. We created AASs for 7 drilling machine models, 7 circle cutter models, and 10 circular saw models, along with 1,000 machine instances randomly assigned to these models. Numerical values and availability were populated randomly for testing, reflecting potential real-world variations. The initial step after acquiring AAS data is to convert it into a knowledge graph. This process involves transforming JSON- serialized AASs into RDF triples, which represent the semantic information of the data. Once the RDF triples are generated, they 1 are stored in a GraphDB repository. To enable semantic data re- trieval, we employ a connector that interfaces with the ChatGPT 2 Retrieval Plugin , which operates alongside the server application. When new triples are added to the GraphDB repository, the connector triggers the plugin to generate vector embeddings of Figure 2: System architecture for retrieval augmented gen- the text representations of the new nodes. These embeddings eration with knowledge graphs in Industry 4.0. are created using a language model and are stored in a separate vector database. The ChatGPT Retrieval Plugin enables interacte to a selection of different vector databases, in our case we em- In summary, the architecture is designed to streamline the ployed the Milvus vector database. The system is also designed process of building a knowledge graph from AAS data and en- to maintain consistency; if any triples are removed from the ables users to query this graph with retrieval-augmented gener- GraphDB repository, the corresponding vector embeddings are ation (RAG) using natural language, with the system handling automatically deleted from the vector database. the complexities of data storage, retrieval, and natural language processing in the background. 3 Methodology The sequence diagram in Figure 3 illustrates the interaction between system components during query processing. Our sys- The system architecture is illustrated in Figure 2. The user inter-tem enables two distinct approaches to handle natural language acts with the system through a client application, developed using queries, often combining both to generate a comprehensive an- ReactJS, which serves as the graphical user interface (GUI). This swer for the user. client application communicates with the system’s middleware, which is built on the Flask framework. Users have the capability to upload AAS data to construct and enhance the knowledge graph, as well as to issue natural language queries. The middleware acts as the core of the system, facilitating com- munication between the client application, the knowledge graph stored in a GraphDB database, and OpenAI’s GPT models. The AAS data uploaded by the user is first converted into RDF triples and then stored in the GraphDB repository. The Flask-based mid- dleware also integrates with the ChatGPT Retrieval Plugin, which is responsible for generating vector embeddings of the knowledge graph nodes using OpenAI’s text-embedding-ada-002 model. These vector embeddings are stored in the Milvus vector data- base [15]. The ChatGPT Retrieval Plugin allows the system to efficiently retrieve the most relevant embeddings in response to user queries, ensuring that the system can provide accurate and contextually appropriate answers. Additionally, the middle- Figure 3: Sequence diagram of different approaches for 3 ware leverages LlamaIndex to manage sub-graph retrieval and data extraction. The blue box represents the RAG approach and the red box represents the SPARQL query generation 1 https://graphdb.ontotext.com/ approach. Note that RAG approach utilizes results from 2 https://github.com/openai/chatgpt-retrieval-plugin SPARQL queries on the knowledge graph. 3 https://www.llamaindex.ai/ 8 Querying with KG and LLMs for Industry 4.0 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia The first approach utilizes a Retrieval-Augmented Generation the number of machines that met the voltage criteria and identi- (RAG) method. Upon receiving a query, the system analyzes the fying any errors, such as incorrect voltage values or unnecessary query to identify relevant concepts and generates vector embed- machine retrievals. Results are depicted in Figures 4 and 5. dings for these concepts [5]. These embeddings are then matched against the knowledge graph stored in GraphDB to find the most relevant nodes. Once the relevant nodes are identified, a naive neighborhood expansion is performed, capturing additional re- lated nodes to ensure a more complete context. The search is parameterized using parameters: scope, how many nodes from the graph to retrieve; breadth, from how many relevant nodes to start the neighborhood expansion; score weight, how many more nodes are visited from the identified relevant nodes that are deemed more relevant using embedding similarity. This sub- graph, along with a few examples for context, is then fed into the Large Language Model (LLM) using a few-shot [1] learning technique to generate a response [4]. The LlamaIndex framework provides a general context query for turning triples into natural language. This method is particularly effective for queries requiring contextual understanding and extraction of complex information from the knowledge graph. The second approach involves generating a SPARQL query based on the natural language query and the ontology used within the knowledge graph. The system attempts to execute this Figure 4: Performance of the system by the type of the SPARQL query in the GraphDB database. If the query runs suc- machine and query. cessfully, the resulting data is passed to the LLM to formulate the final answer. This approach is especially beneficial for tasks that In Figure 4, each table contains four columns: "V" (voltage involve counting instances or performing specific data aggrega-specified in the query), "R" (percentage of correctly retrieved ma- tion operations, where LLMs alone might struggle. This approach chines), "W" (number of machines with incorrect voltage), and benefits from the first approach as it can use it as backup or to "A" (number of unnecessary machine retrievals). Figure 5 summa-enrich the SPARQL query results with additional context. rizes the results: "Fully Correct Answers" shows the percentage of queries that returned all requested information without errors; "Share of Expected Information Found" indicates the proportion 4 Results of requested information retrieved; and "Share of Incorrectly Dis- To thoroughly evaluate the system, we employed three different played Voltages" represents the percentage of retrieved voltages evaluations: (a) assessing the accuracy of data retrieval based on that were incorrect. query parameters (not using query generation), (b) evaluating the system’s ability to correctly fetch the number of instances (testing query generation), and (c) conducting a manual assessment of most relevant user queries. 4.1 Accuracy of Data Retrieval The first approach involved testing the system’s ability to accu- rately retrieve data that met specific query conditions without employing SPARQL query generation. We focused on queries where the user requested a list of machines of a particular type with a voltage requirement less than or equal to a specified value. An example query would be: “Return all drilling machines that consume at most 4 volts and specify their consump- Figure 5: Combined performance. tion.” We conducted these tests on three types of machines: "drilling machines", "circle cutters", and "circular saws". The voltage The results show that sometimes the LLM would incorrectly values specified in the queries ranged from 0 to 10 volts, inclu- generate a different voltage requirement for a machine, making sive. The evaluation was designed to measure how accurately it appear to satisfy the query conditions. However, the retrieved the system could identify and return the correct set of machines machines were always of the correct type. For example, a query based on these voltage constraints. like “Name all drilling machines and specify their voltage For these tests, the following parameters were used (scope: 100, reqirements” correctly retrieves all machines with the right breadth: 1000, score weight: 100, model: gpt-4-1106-preview, specifications, suggesting the issue may lie with the LLM rather query generation strategy: disabled). than the knowledge retrieval process. The system’s performance was assessed by comparing the To address this, users can try adjusting query parameters or retrieved data against the expected results, specifically checking rewording the query to verify the information’s accuracy. If this 9 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Domen Hočevar and Klemen Kenda type of query is crucial, incorporating voltage-specific queries balance between subgraph retrieval and SPARQL generation to into the query generation strategy could improve reliability, al- ensure even more robust and comprehensive query handling. though the LLM may struggle with large lists due to its context window limitations. As shown in Figure 5, these types of queries Acknowledgements often do not reliably provide all requested information in one This work was supported by the European Commission under answer, so users should run multiple queries to increase the the Horizon Europe project HumAIne, Grant Agreement No. likelihood of retrieving all necessary data. 101120218. We would like to express our gratitude to all project partners for their contributions and collaboration. 4.2 Instance Fetching Accuracy References In these tests, we tested query generation strategy. The following [1] Tom B Brown. 2020. Language models are few-shot learners. arXiv preprint parameters were used (scope: 100, breadth: 1000, score weight: arXiv:2005.14165. 100, model: gpt-4-1106-preview, query generation strategy: [2] Diego Bustamante and Hideaki Takeda. 2024. Sparql generation with entity enabled). pre-trained gpt for kg question answering. arXiv preprint arXiv:2402.00969. [3] 2022. Details of the asset administration shell. https://www.plattf orm- i40.d The queries asked for the number of available instances for se- e/IP/Redaktion/EN/Downloads/Publikation/Details_of _the_Asset_Adm lected machine models, such as "Get the number of available inistration_Shell_Part1_V3.pdf ?__blob=publicationFile&v=1 (visited on [name of the machine 1], [name of the machine 2] machine 02/22/2024). (2022). [4] Chao Feng, Xinyu Zhang, and Zichu Fei. 2023. Knowledge solver: teach- instances. Specify the number for each machine type sepa- ing llms to search for domain knowledge from knowledge graphs. ArXiv, rately.". The query format was picked such that the LLM will abs/2309.03118. https://api.semanticscholar.org/CorpusID:261557137. [5] Luis Gutiérrez and Brian Keith. 2019. A systematic literature review on word benefit from query generation (availability property is specified embeddings. In Trends and Applications in Software Engineering: Proceedings in the schema supplied for query generation). of the 7th International Conference on Software Process Improvement (CIMPS A total of 100 queries were run, with 10 queries for each num- 2018) 7. Springer, 132–141. [6] Domen Hočevar. 2024. Integrating Knowledge Graphs and Large Language ber of specified machine models (ranging from 1 to 10 models). Models for Querying in an Industrial Environment. Bachelor’s Thesis. Uni- The share of fully correct answers for each query type was be- versity of Ljubljana, Faculty of Computer, Information Science, Faculty of tween 80 and 100%. The overall accuracy was 96%. This supports Mathematics, and Physics, Ljubljana, Slovenia, (Aug. 2024). Interdisciplinary University Study Program, First Cycle, Computer Science and Mathematics. our hypothesis that the query generation strategy provides more [7] Humaine Horizon. 2024. Humaine horizon. https://humaine- horizon.eu/. accurate answers for slightly more complex queries. Accessed: 2024-08-26. (2024). [8] Pérez Jorge. 2006. Semantics and complexity of sparql. In Proc. 5th Int. Semantic Web Conference (ISWC2006). 4.3 Manual Evaluation of Example Queries [9] Jinyang Li et al. 2024. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural This evaluation was initially performed to identify several short- Information Processing Systems, 36. comings in our methodologies as mentioned in the previous sub- [10] Lars-Peter Meyer, Johannes Frey, Felix Brei, and Natanael Arndt. 2024. sections. By manually evaluating specific queries relevant to end Assessing sparql capabilities of large language models. (2024). https://arxiv .org/abs/2409.05925 arXiv: 2409.05925 [cs.DB]. users, we were able to partially address these issues and fine-tune [11] Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed An- parameters to achieve more accurate results. For instance, while war, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. 2023. A comprehensive overview of large language models. arXiv preprint the system’s initial results were often incomplete (e. g., query did arXiv:2307.06435. not return all the machines satisfying certain criteria), increasing [12] Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong the breadth parameter to include a larger subgraph and allowing Wu. 2024. Unifying large language models and knowledge graphs: a roadmap. IEEE Transactions on Knowledge and Data Engineering. LLMs to traverse a broader neighborhood improved the results. [13] Diego Sanmartin. 2024. Kg-rag: bridging the gap between knowledge and Additionally, we demonstrated that subgraph retrieval and query creativity. arXiv preprint arXiv:2405.12035. generation can complement each other, further enhancing overall [14] Bhaskarjit Sarmah, Benika Hall, Rohan Rao, Sunil Patel, Stefano Pasquali, and Dhagash Mehta. 2024. Hybridrag: integrating knowledge graphs and performance. All the results are commented in detail in [6]. vector retrieval augmented generation for efficient information extraction. arXiv preprint arXiv:2408.04948. 5 Conclusions [15] Jianguo Wang et al. 2021. Milvus: a purpose-built vector data management system. In Proceedings of the 2021 International Conference on Management In this paper, we presented a system that bridges the gap between of Data, 2614–2627. [16] Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut, Percy Liang, and Jure natural language processing and querying knowledge graphs, Leskovec. 2021. Qa-gnn: reasoning with language models and knowledge specifically within the context of Industry 4.0. By leveraging graphs for question answering. arXiv preprint arXiv:2104.06378. large language models (LLMs) and retrieval-augmented gener- ation (RAG), our system allows users to interact with complex knowledge graphs using natural language queries, thereby sim- plifying access to detailed manufacturing data. Our evaluation demonstrated the usability of our system, how- ever with the integration of LLMs for natural language under- standing, some challenges remain. These include occasional inac- curacies in data retrieval and the LLM’s limited ability to handle large datasets or specific queries. By adjusting subgraph retrieval parameters such as breadth and scope, and by combining it with SPARQL query generation, we were able to significantly enhance the system’s accuracy and reliability. This work highlights the potential of combining knowledge graphs with LLMs to create more intuitive and effective query systems in industrial environments. Future improvements could focus on refining query strategies and further optimizing the 10 Comparative Analysis of Machine Learning Models for Groundwater Level Forecasting: The Impact of Contextual Data Rok Klančič Klemen Kenda rok.klancic@gmail.com klemen.kenda@ijs.si Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia Abstract 2 Methods This paper presents a comparative evaluation of three distinct In our experiments, we employed three categories of methods: categories of models applied to groundwater level data: tradi- traditional batch learning techniques, time series deep learning tional batch learning methods, time series deep learning methods, models, and time series foundation models. and time series foundation models. By enriching the water level data with weather-related features, we significantly improved 2.1 Traditional Batch Learning Methods the effectiveness of simpler models. The results demonstrate that, In the context of data-driven modelling of environmental is- despite their state-of-the-art performance on univariate datasets sues, traditional batch learning methods have historically demon- and the corresponding publicity, advanced models without con- strated significant success [5]. In this study, we employed linear textual feature support are still surpassed by traditional methods regression alongside two tree-based approaches: random forest trained on enriched datasets. and gradient boosting [7] as baselines to evaluate whether the newer, more prominent techniques, which have recently gathered Keywords a considerable amount of attention, can perform competitively groundwater level prediction, time series forecasting, deep learn- in this specific setting. ing, foundation models, contextual data All of the chosen batch learning techniques are regression- based and are valued for their simplicity, speed, and ease of 1 Introduction use. However, they often lack the complexity necessary to fully capture intricate patterns in the data. To mitigate this limitation, Accurate water level prediction is crucial for mitigating the im- we incorporated contextual features, such as weather data and pacts of climate change on water resources. By forecasting water forecasts (e.g., precipitation, cloud cover, temperature). While the levels, we can better prepare for potential floods and droughts, data fusion problem is solved [8], this approach raises concerns and more effectively manage our water supplies. However, pre-about the availability and relevance of the contextual data. dicting water levels presents a significant challenge due to the dynamic nature of the data. As climate change leads to prolonged 2.2 Time Series Deep Learning Methods droughts and increasingly erratic precipitation patterns, the need for reliable forecasting methods becomes even more important Time series deep learning models are explicitly designed for [2]. forecasting time-dependent data. In our study, we employed N- In this paper, we aim to compare the performance of various BEATS [12] and PatchTST [10], both of which have architectures models in forecasting groundwater levels. Specifically, we focus tailored to capture trends and seasonalities inherent in time se- on the differences between traditional batch learning methods ries data. Despite their advanced capabilities, these models have that utilize relevant contextual data and newer univariate time drawbacks, including longer training and inference times, the ne- series deep learning and foundation models. cessity for extensive hyperparameter tuning to achieve optimal The main contributions of this paper are: performance, and limited support for incorporating additional • features. Although certain models support multivariate time se- A comparative analysis of the performance of traditional ries, they were not utilized in our experiments. batch learning methods against state-of-the-art time series deep learning techniques and time series foundation mod- 2.3 Time Series Foundation Models els, particularly in the context of feature vectors enriched with relevant contextual data. While deep learning methods require separate training and pre- • The application of time series foundation models and deep diction phases, time series foundation models aim to eliminate learning methods to the domain of groundwater level fore- the training step. Inspired by large language models, these models casting. are pretrained on extensive time series datasets, enabling zero- shot predictions on new time series without additional training. The groundwater dataset used in this study has previously been We used CHRONOS [1], an open source foundation model. The employed for predictive modeling with traditional batch learning advantages of this approach include ease of use with minimal pa- methods [9], where extensive feature engineering was also per- rameter adjustments and no need for training. However, similar formed. Our work builds upon and extends this earlier research to deep learning models, they lack support for multivariate time by incorporating a different set of models. series. Permission to make digital or hard copies of all or part of this work for personal Several studies have already evaluated the performance of or classroom use is granted without fee provided that copies are not made or various deep learning and foundation models for time series fore- distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this casting [1] [13]. However, this research extends the application work must be honored. For all other uses, contact the owner /author(s). of these forecasting models to groundwater level data, therefore Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia contributing to the better understanding of their effectiveness in © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2024.sikdd.6 this domain. 11 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Rok Klančič and Klemen Kenda 3 Experiment Setting and gradient boosting regressor as our baseline methods. These models were previously applied to the groundwater dataset [9], The experiments were conducted on a dataset of groundwater necessitating a reproduction of the results as a benchmark. levels in Slovenia. Due to the cumulative nature of water levels and to facilitate comparison with the original study [9], predictions were made on daily changes in water levels rather than on 3.4 Implementation Details absolute values. The prediction pipelines varied slightly between the different 3.1 Dataset types of models: The groundwater dataset is a subset of the larger dataset used • For CHRONOS, we utilized the dataset without weather in the study [9]. It consists of groundwater level measurements features, as it only supports univariate time series. Since taken daily from multiple stations across Slovenia. To apply tra- no hyperparameter tuning was required, the data was ditional batch learning methods, we enriched the dataset with divided into training and test sets, omitting the validation weather data, associating each water measurement station with set. The model generated the predictions directly from the the nearest weather station. Due to the availability of weather water level data. We used the chronos-t5-large model from data, only data from the years 2010 to 2017 was included in our the chronos library. study. For consistency and ease of comparison with previous • For N-BEATS and PatchTST, the same dataset was used, study [9], we focused on data from two water measurement given the same limitation as mentioned previously. How- stations located in Ljubljana. ever, a validation set was required for hyperparameter In traditional batch learning within the environmental domain, tuning. After selecting appropriate hyperparameters, the it is essential to not only use the raw data but also to engineer models were trained on the training set and evaluated relevant features. Initially, we removed the pressure and dew on the test set. Implementations from the NeuralForecast point features, as they were either unrelated to the target variable library were used for both models. or highly correlated with other features [9]. We then created • For the linear regression, random forest regressor, additional features by shifting the data from 1 to 10 days, making and gradient boosting regressor models, we included historical values available, and by computing the averages of both water level and weather data. Feature selection was features over a 2- to 10-day window. This process resulted in conducted to reduce the number of features, resulting in approximately 2,000 features. Given the excessive number of 42 features for linear regression, 30 for random forest, and features, which could degrade model performance, we employed 36 for gradient boosting. After feature selection, hyper- a feature selection algorithm to identify the most informative parameters for the random forest and gradient boosting subset. models were tuned, and the data for linear regression was We used a genetic feature selection algorithm from scikit-learn, normalized. The models were then trained on the train- evaluated on 365-day part of training dataset, with the maximum ing set and evaluated on the test set using scikit-learn’s number of features set to 40. The algorithm was executed sepa- implementations. rately for each model, focusing on one station and a prediction The hyperparameters used for training are listed in Appendix horizon of three days, resulting in distinct feature vectors. Sub- A, while a description of the selected features is provided in sequently, weather forecast features with longer offsets were Appendix B. manually added to the selected feature set. 3.2 Evaluation Metrics 4 Results The dataset was split into a training set (approx. 2,500 days), The results for all tested models across various prediction hori- a validation set (100 days), and a test set (365 days) for model 2 zons are presented in Table 1. The reported R scores were calcu- 2 evaluation. Model performance was evaluated using the R score, lated based on the differences in water levels; if absolute water averaged across all tested stations. Although alternative metrics 2 levels had been used, the R scores would have been significantly such as root-mean-squared error (RMSE), and mean absolute higher. For example, in the case of CHRONOS with 1-day ahead percentage error (MAPE) were considered, they, for this dataset, 2 predictions, the R score is 0.725 for relative level differences and 2 produce results that are closely related to the R . This metric was 0.998 for absolute water levels. selected due to its robustness against variations in data offset Among the models, linear regression achieved the highest per- and amplitude, and for direct comparability with the results in formance, followed by the random forest. In contrast, the more 2 the original study [9]. The R score is defined as: complex methods, including deep learning models and the foun- Í𝑛 (𝑦 − ˆ 𝑦 )2 dation model, showed generally lower performance, with the 𝑖 𝑖 2 𝑖 =1 𝑅 = 1 − , Í𝑛 ( exception of the 1-day prediction horizon, where N-BEATS out- 𝑦 − ¯ 𝑦 )2 𝑖 𝑖 =1 2 performed the tree-based models. Notably, the R scores decrease where 𝑦 is the i-th true value, ˆ 𝑦 is the i-th predicted value and 𝑖 𝑖 as the prediction horizon lengthens, with a more pronounced ¯ 𝑦 is the average of true values. decline observed in the deep learning and the foundation models compared to the traditional batch learning methods. 3.3 Baseline Methods Figures 2 and 3 display the predictions from CHRONOS, Patch-The primary objective of our research was to compare the per- TST, and linear regression compared to the true data for the 1-day formance of traditional batch learning methods, enriched with and 5-day prediction horizons. It is evident that the predictions relevant contextual features, against that of modern deep learn- from CHRONOS and PatchTST begin to exhibit a rightward shift 2 ing techniques and foundation models for time series forecasting. as the horizon extends. Figure 1 visualizes the R scores for all Therefore, we selected linear regression, random forest regressor, models across the different prediction horizons. 12 The Impact of Contextual Data Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Table 1: R2 Scores for Different Prediction Horizons and Models. Methods 1 day ahead 2 days ahead 3 days ahead 4 days ahead 5 days ahead Chronos-large 0,725 0,365 0,175 0,04 -0,09 GradientBoostingRegressor 0,640 0,603 0,527 0,556 0,545 RandomForestRegressor 0,726 0,697 0,701 0,706 0,691 N-BEATS 0,742 0,397 0,17 -0,03 -0,143 PatchTST 0,721 0,394 0,215 0,109 -0,02 LinearRegression 0,792 0,781 0,785 0,784 0,780 The best and second-best results are bolded and underlined respectively. R² Scores for Different Models and Prediction Horizons 0.8 Models Chronos-large GradientBoostingRegressor 0.6 RandomForestRegressor N-BEATS es 0.4 PatchTST LinearRegression R² Scor 0.2 0.0 1 day 2 days 3 days 4 days 5 days Prediction Horizons Figure 1: R2 Scores for All of the Methods and Prediction Horizons. Predictions for Horizon 1 Predictions for Horizon 5 0.10 True data True data Chronos 0.08 Chronos 0.08 PatchTST PatchTST LinearRegression 0.06 LinearRegression 0.06 0.04 0.04 0.02 0.02 Water level change (m) 0.00 0.00 Water level change (m) 0.02 0.02 2017-01-15 2017-02-01 2017-02-15 2017-03-01 2017-03-15 2017-01-15 2017-02-01 2017-02-15 2017-03-01 2017-03-15 Time Time Figure 2: Example Predictions for Three Models for 1-Day Figure 3: Example Predictions for Three Models for 5-Day Prediction Horizon. Prediction Horizon. The results indicate that traditional methods, when supple- mented with relevant contextual features, outperform more com- plex models that do not incorporate such data. While the 1-day predictions. This likely occurs due to the absence of contextual ahead predictions show comparable performance across all meth- information, causing these models to lag in capturing the true ods, as the prediction horizon extends, the accuracy of CHRONOS, trajectory of water levels. In contrast, models with access to PatchTST, and N-BEATS declines sharply. In contrast, the tradi- weather data can predict further ahead by accounting for factors tional models, supported by contextual features, maintain their such as the impact of rainfall patterns on water levels. predictive accuracy much more effectively, as shown in Figure 1. An unexpected finding is that among the baseline models, A closer examination of the predictions in Figures 2 and 3 linear regression outperforms the more sophisticated methods. reveals that for 1-day ahead predictions, all models track the true For instance, in the article [9], while linear regression produced data closely. However, in the 5-day ahead predictions, models strong results, it did not surpass the performance of the other lacking contextual data begin to exhibit a rightward shift in their two methods. 13 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Rok Klančič and Klemen Kenda 5 Conclusion and Future Work A Hyperparameters After evaluating all models on the groundwater level dataset, Table 2: Hyperparameters Used for Gradient Boosting Re- we observed that traditional methods, when equipped with rel- gressor and Random Forest Regressor. evant features, consistently outperformed newer and more so- phisticated techniques, particularly as the prediction horizon lengthened. This suggests that the emphasis on developing the Hyperparameter GradientBoosting RandomForest most powerful deep learning or foundation models for time se- n_estimators 28 164 ries predictions may be overstated. With thoughtful selection of max_features ’log2’ 0.5 contextual features, even the simplest models can outperform max_depth 10 20 modern approaches, which is a significant finding for fields with sufficient contextual data, such as data-driven environmental modelling. Table 3: Hyperparameters Used for N-BEATS and To enhance the robustness of our evaluation, future work could PatchTST. involve testing additional methods, expanding the analysis to include more measurement stations and surface water level data, Hyperparameter N-BEATS PatchTST and incorporating deep learning models that support multivariate loss HuberLoss / time series, such as N-BEATSx [11] and N-HiTS [3]. Further n_harmonics 5 / insights could be gained by exploring foundation models with n_polynomials 5 / multivariate support, such as TimesFM [4], as well as some more scaler_type ’robust’ / univariate models, like TimeGPT-1 [6]. Future research could n_blocks [3, 3, 1] / also compare the inference times of various models and assess mlp_units [[128, 128]] / performance across different time series lengths. horizon 5 5 input_size 15 71 learning_rate 0.001 0.001 Acknowledgements max_steps 25 1323 This work was supported by the European Commission under the encoder_layers / 12 Horizon Europe project Plooto, Grant Agreement No. 101092008. n_heads / 16 We would like to express our gratitude to all project partners for hidden_size / 64 their contributions and collaboration. linear_hidden_size / 512 Furthermore, we would like to thank Erik Novak for his assis- dropout / 0.2 tance in completing this research. fc_dropout / 0.1 head_dropout / 0.1 attn_dropout / 0.2 References patch_len / 16 [1] Abdul Fatir Ansari et al. 2024. Chronos: learning the language of time series. stride / 8 arXiv preprint arXiv:2403.07815. revin / True [2] ARSO. 2009. Freshwater. Retrieved August 27, 2024 from https://www.arso .gov.si/en/soer/f reshwater.html. [3] Cristian Challu, Kin G Olivares, Boris N Oreshkin, Federico Garza Ramirez, B Selected Features Max Mergenthaler Canseco, and Artur Dubrawski. 2023. NHiTs: neural hierarchical interpolation for time series forecasting. In Proceedings of the Due to the large number of features selected by the feature selec- AAAI conference on artificial intelligence number 6. Vol. 37, 6989–6997. [4] Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. 2023. A decoder- tion algorithm, we provide a summarized description of the most only foundation model for time-series forecasting. arXiv preprint arXiv:2310- frequently chosen features. The features that appeared most often .10688. include shifts and averages of precipitation, precipitation fore- [5] Fan Feng, Hamzeh Ghorbani, and Ahmed E. Radwan. 2024. Predicting groundwater level using traditional and deep machine learning algorithms. casts, temperature, altitude difference, cloud cover, humidity, and Frontiers in Environmental Science, 12. doi: 10.3389/fenvs.2024.1291327. snow accumulation. Notably, the majority of selected features [6] Azul Garza and Max Mergenthaler-Canseco. 2023. TimeGPT-1. arXiv preprint arXiv:2310.03589 were derived features we generated, with only approximately . [7] Trevor Hastie, Robert Tibshirani, and Jerome H Friedman. 2009. The elements one original feature being selected per model. of statistical learning: data mining, inference, and prediction. Vol. 2. Springer. In Table 4, the most common shifts and averages for each [8] Klemen Kenda, Blaž Kažič, Erik Novak, and Dunja Mladenić. 2019. Streaming data fusion for the internet of things. Sensors, 19, 8. doi: 10.3390/s19081955. individual model are presented. The table indicates that shifts [9] Klemen Kenda, Jože Peternelj, Nikos Mellios, Dimitris Kofinas, Matej Čerin, and averages of varying lengths were selected, with a slight and Jože Rožanec. 2020. Usage of statistical modeling techniques in surface preference for shorter ones. and groundwater level prediction. Journal of Water Supply: Research and Technology-Aqua, 69, 3, (Apr. 2020), 248–265. doi: 10.2166/aqua.2020.143. [10] Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. Table 4: Most Frequently Selected Shifts and Averages for 2022. A time series is worth 64 words: long-term forecasting with trans- Various Methods. formers. arXiv preprint arXiv:2211.14730. [11] Kin G. Olivares, Cristian Challu, Grzegorz Marcjasz, Rafał Weron, and Artur Dubrawski. 2023. Neural basis expansion analysis with exogenous variables: Method Shifts (days) Averages (days) forecasting electricity prices with nbeatsx. International Journal of Forecast- ing, 39, 2, 884–900. doi: https://doi.org/10.1016/j.ijforecast.2022.03.001. GradientBoostingRegressor 4, 10 2, 6 [12] Boris N Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. RandomForestRegressor 2, 6 3, 9 2019. N-BEATS: neural basis expansion analysis for interpretable time series forecasting. arXiv preprint arXiv:1905.10437. LinearRegression 2, 10 2, 7 [13] Hongwei Ye et al. 2024. A transformer-based forecasting model for f10.7 Combined 2, 10 2, 3 index and its application study on the chinese langfang dataset. Advances in Space Research. doi: https://doi.org/10.1016/j.asr.2024.08.024. 14 Interactive Tool for Tracking Open-source Artificial Intelligence Progress on Hugging Face Bogdan Šinik Domen Vake bogdan.sinik@f amnit.upr.si domen.vake@f amnit.upr.si UP FAMNIT UP FAMNIT Koper, Slovenia Koper, Slovenia Jernej Vičič Aleksandar Tošić jernej.vicic@upr.si aleksandar.tosic@upr.si UP FAMNIT, UP IAM UP FAMNIT, InnoRenew CoE Koper, Slovenia Koper, Slovenia Abstract to execute your own model, as long as it is of a modest enough size, on a home computer’s graphics processing unit (GP U), even Given its increasing importance in our daily lives, Artificial In- if the GP U is a few years old [9]. The rise in accessibility also telligence has become a prominent subject that needs extensive enables a larger community to test and develop new solutions investigation and understanding. This study presents an analysis and build on top of existing models. We believe that there is a of the open-source community in the field of Artificial Intelli- big lack of tools for monitoring the impact of this movement. gence (AI). Various questions arise anytime AI is introduced. 1 Hugging Face has grown into one of the primary platforms open-source AI introduces additional concerns. Should artifi- for the open-source community. Users are able to download and cial intelligence (AI) be universally accessible, or should it be interact with all significant open-source models. Subsequently, restricted to private use? Is it worthwhile to offer basic models users have the option to publish their models on the platform and to the broad user population? We chose the most important data compare their performance by adding them to the leaderboard, from the primary website in the field, Hugging Face. We have where all the models are benchmarked and ranked. The open- developed a tool that allows for straightforward monitoring of source community relies heavily on the distribution of models the progress of various open-source AI models using data ob- by large corporations, as creating a model from scratch is a hard tained from their leader board. The platform offers accessible undertaking [9]. This tool facilitates collaboration among open-and valuable information about various AI models, including source contributors, enabling them to collectively generate social their architectures and the activities of authors. Through per- media content, exchange ideas, and even publish concise articles. forming a quick review with our tool, it becomes evident that the In addition to the models, they have the ability to generate and open-source community is becoming large and has an undeniable upload useful datasets. It represents the most advanced and inno- impact on the AI community. vative developments in the field of open-source AI and Machine Keywords learning. An issue that has been observed is the absence of effective LLM, open-source, AI, Hugging Face visualization tools on Hugging Face, which would enable users 1 Introduction to easily see patterns and gain a comprehensive understanding of the open-source AI area. In order to address this issue, we Artificial intelligence, particularly large language models (LLMs), have developed a sophisticated tool that offers users various is an important topic in the computer industry today. Despite viewpoints on the data. the numerous fears and dogmas around it, it is certain that AI has become an integral aspect of our lives. This research has 2 Literature review specifically concentrated on the development of a tool for moni- Large Language Models (LLMs) have proven essential in enhanc- toring the impact of the open-source community in the area of ing software engineering (SE) tasks, demonstrating their effec- artificial intelligence. As implied, these models are accessible to tiveness in code comprehension Similar to conventional soft- all individuals. There is considerable debate on whether this type ware engineering tools, open-source cooperation is essential for of technology should be universally accessible. We wanted to achieving superior products in this area. [8] investigate if the open-source community is actively contributing The article authored by Patel et al. [9] emphasizes the sig- to the development of the field, regardless of one’s philosophical nificance of the open-source AI community and elucidates its convictions. Due to the substantial computational requirements, rapid growth in the wake of major industry leaders like Google, it was previously impossible to execute Large Language Mod- Microsoft, and OpenAI. An important milestone in this subject is els on personal computers. As increasingly compact versions often emphasized as the day when the LLama model was initially with impressive capabilities are being produced, this scenario made available to the open-source community. The community undergoes a significant transformation. Currently, it is feasible promptly recognized the possibilities and potential involved in Permission to make digital or hard copies of all or part of this work for personal this release. or classroom use is granted without fee provided that copies are not made or Due to its continuous growth, Hugging Face has emerged distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this as the primary platform for exchanging machine learning (ML) work must be honored. For all other uses, contact the owner /author(s). models, resulting in an increasing level of complexity. A relational Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia © 2024 Copyright held by the owner/author(s). 1 https://doi.org/https://doi.org/10.70314/is.2024.sikdd.1 https://huggingface.co/ 15 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Bogdan Šinik, Domen Vake, Jernej Vičič, and Aleksandar Tošić database called HFCommunity was established to facilitate the data was categorized using several criteria, such as model type, analysis and resolution of this issue [1]. model architecture, and amount of parameters. The data was As previously said, open-source AI models offer an exten- initially selected and aggregated to ensure that all crucial com- sive range of possibilities. At the recent conference, the authors ponents were easily accessible. All models that were categorized [12] demonstrated their effective use of Hugging Face. Due to as flagged have been excluded from the dataset. In addition, we the significant difficulty in developing a model with broad in- have collected data on the authors’ activities and conducted a telligence, researchers have merged ChatGPT capabilities with study on that particular aspect. Once the data had been cleaned models from HuggingFace using agentic architecture to get im- and prepared for visualization, we utilized the R ggplot library pressive results in multiple domains. ChatGPT was tasked with to create visual representations of the data. A comprehensive creating a plan of action and assigning specific duties to each R Shiny app was developed by aggregating all the visuals. We open-source model based on their own areas of expertise. This chose to utilize Shiny because it is a great option for constructing is an excellent demonstration of the influence and capabilities interactive data analysis solutions due to several factors. Firstly, of the open-source community, given the familiarity with open it enables the development of web applications that are capable models and their capabilities. of responding and adapting to real-time changes and user interac- The article [6] examines the vulnerabilities associated with tions. This simplifies the process of exploring and analyzing data. open-source AI. A much higher number of repositories with high Shiny easily incorporates with R, utilizing its robust statistical vulnerabilities has been discovered compared to those with low and graphical functionalities to generate complex, interactive vi- vulnerabilities, particularly in root repositories. This emphasizes sualizations without the need for experience in web technologies the significance of ensuring the security of technology in order such as HTML, CSS, or JavaScript. [13] Finally, our application to facilitate its utilization. was deployed to a server, making it accessible online. In a recent paper [10], authors have analyzed the transparency of Hugging Face pre-trained models regarding database usage 4 Results and licenses. The analysis revealed that there is often a lack of The outcome of this study is the tool we have developed. The transparency regarding the training datasets, inherent biases, 3 link may be accessed via the following URL. . It has six distinct and licensing details in pre-trained models. Additionally, this re- viewpoints, all conveniently accessible inside its tab. The ini- search identified numerous potential licensing conflicts involving tial figure, labeled as 1, displays both the count of new models client projects. 159,132 models were examined. It was found that and the distribution of various model types. Hugging Face has merely 14% of these models explicitly identify their datasets with identified five distinct categories of models: basic mergers and specific tags. Furthermore, a detailed examination of a statisti- moerges, fine-tuned on domain-specific datasets, chat models, cally significant sample comprising 389 of the most frequently continuously pretrained models, and pretrained models. If the downloaded models showed that 61% documented their training model did not belong to any of these classes, its type was classi- data in some form. fied as unknown. The user has the ability to effortlessly choose their preferred categories, along with the desired time frame and 3 Methodology unit of aggregate (daily, weekly, or monthly). This allows the We obtained the data by extracting the Open LLM Leaderboard viewer to clearly observe the evolution of model types and their from Hugging Face [2] by saving the data server sent to the popularity over time. It is evident that fine-tuned models are client. This data contains information about repositories of mod- predominantly utilized. This is logical, as users are adapting base els that are currently on the leaderboard and the models that are models by training them on unique datasets to achieve specializa- waiting to be evaluated for the leaderboard. A Python pipeline tion. Also, we can see that merged models are a relatively recent 2 was developed to clean and enrich this data available on . The phenomenon. leaderboard data includes model architecture and precision as well as the model type and performance on the following bench- marks: ARC[3], HellaSwag[14], MMLU[5], TruthfulQA[7], Wino- grade[11] and GSM8K[4]. In addition to the data provided on the leaderboard, additional information on the given models was obtained by using the HF API client. This included data about repository contributors, tags, base models, used datasets, and repo activity. It is important to note that the data is self-reported by the developers and is not enforced by HuggingFace. Addition- ally, the leaderboard includes duplicates due to developers being able to replace models in the repository with different models under the same name. This means the duplicates have the same repository data but distinct performances. Due to the inability to Figure 1: Popularity by model type over time programmatically determine the current model in the repository, we chose the best-performing model under the repository name as the model representing the repository when removing dupli- The second view, referenced as 2, has two interconnected cates. Thus, all datasets were generated for further utilization. visualizations. The upper section displays the activity of the top The following analysis was conducted using the R programming 10 authors within a specific range of dates. The display showcases language. The data was mostly studied via the perspective of every model they have developed, along with its corresponding time, as our focus was on identifying any obvious trends. The type. The lower section presents the average benchmark score for 2 3 https://github.com/VakeDomen/HF_analysis https://oai.dltlt.famnit.upr.si/ 16 Interactive Tool for Tracking Open-source Artificial Intelligence Progress on Hugging Face Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia each model, organized by author. This visualization enables users Opt, GPT2, and GPT2-NeoX. All architectures that did not fit to effortlessly monitor the most prominent authors and observe into any one category were classed as "Other". This perspective their patterns and accomplishments in model development over has two graphics that depict popularity. The first comparison time. Users have the ability to effortlessly choose a certain range assesses the popularity of a model relative to itself, depending on of dates and also narrow down the list to the top 10 authors the number of new models introduced before. The second one according to their preferences. It is evident that leading authors compares it to the average number of new models created, taking typically do not adhere to trends and consistently provide models into account their architecture. Both are depicted by coloring the of similar type. area, as it is the most convenient way to track. Users may analyze the fluctuation in popularity of well-known model architectures over time and examine how the rising popularity of a particular architecture might impact the popularity of a certain architecture of interest. The lower plot indicates that LLama and Mistral are the predominant models; nonetheless, they have experienced fluctuations throughout time, as visible on the upper plot. Figure 2: Top authors activity over time The following perspective 3 illustrates two aspects. The first aspect is the alteration in the average benchmark score for each model type as time progresses. The display showcases the top- performing model for each category and time interval (daily, Figure 4: Change of popularity of main architectures over weekly, or monthly). In addition to the dots representing each time model, we have incorporated a smooth line to aid the user in see- ing the temporal changes for a particular model type. Following the first visualization, we have included a second visualization The graphic labeled as 5 illustrates the progressive improve- that displays the total number of models for each model type ment of the key base models developed by famous companies. within the chosen period range. Through these visualizations, This was accomplished by isolating each incremental improve- users can easily identify the model type that experienced the ment in score over time, using the base model as a reference. In most improvement and the model types that were mainly pro- order to fulfill this objective, we have chosen five distinct varia- duced. We can see the trend, which indicates that open-source AI tions of LLama, Mistral, and Mixtral, as well as three iterations models are improving, as evidenced by the improvement in aver- of Phi. The user may easily observe the overall improvement in age benchmark scores across most of them. The overall number benchmark scores for each base model. In addition, users have of models is rapidly increasing, indicating a rise in the popularity the ability to view the overall duration required for the model to of open-source AI models. achieve its maximum performance. We have included a feature that enables users to toggle the visibility of model labels, hence enhancing visibility and facilitating more in-depth examination according to their preferences. This allows the user to observe the speed at which specific models reached their peak performance and the extent of their improvement relative to the base models. Figure 3: Change of benchmark score and total models per type over time The fourth perspective, as seen in Figure 4, examines the changing popularity of various model architectures through- out time. The following architectures have been chosen for this Figure 5: Evolution of famous base models specific objective: LLama, Mixtral, Mistral, Qwen2, Gemma, Phi, 17 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Bogdan Šinik, Domen Vake, Jernej Vičič, and Aleksandar Tošić The final view, as depicted in Figure 6, illustrates the impact [4] Karl Cobbe et al. 2021. Training verifiers to solve math word problems. (2021). arXiv: 2110.14168 [cs.CL]. of significant releases on the popularity of various model designs. [5] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, As we have employed identical model designs to those in view Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask lan- four, we have extracted and categorized all significant release guage understanding. (2021). arXiv: 2009.03300 [cs.CY]. [6] Adhishree Kathikar, Aishwarya Nair, Ben Lazarine, Agrim Sachdeva, and dates of these models. The user has the option to choose the Sagar Samtani. 2023. Assessing the vulnerabilities of the open-source artifi- time unit for aggregate, which can be either day, week, or month. cial intelligence (ai) landscape: a large-scale analysis of the hugging face Users may quickly analyze the impact of significant releases and platform. In 2023 IEEE International Conference on Intelligence and Security Informatics (ISI), 1–6. doi: 10.1109/ISI58743.2023.10297271. observe how they influence the popularity and mass creation of [7] Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Truthfulqa: measuring specific models. We can observe the evident impact of the recent how models mimic human falsehoods. (2022). arXiv: 2109.07958 [cs.CL]. [8] Zhihao Lin et al. 2024. Open-source ai-based se tools: opportunities and chal- releases of LLama and Mistral for their popularity. lenges of collaborative software learning. arXiv preprint arXiv:2404.06201. [9] Dylan Patel and Afzal Ahmad. 2023. Google “we have no moat, and neither does openai.”. SemiAnalysis. May, 4, 2023. [10] Federica Pepe, Vittoria Nardone, Antonio Mastropaolo, Gerardo Canfora, Gabriele Bavota, and Massimiliano Di Penta. 2024. How do hugging face models document datasets, bias, and licenses? an empirical study. [11] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2019. WINOGRANDE: an adversarial winograd schema challenge at scale. (2019). arXiv: 1907.10641 [cs.CL]. [12] Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. Hugginggpt: solving ai tasks with chatgpt and its friends in hugging face. In Advances in Neural Information Processing Systems. A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors. Vol. 36. Curran Associates, Inc., 38154–38180. https://proceedings.neurips.c c/paper_f iles/paper/2023/f ile/77c33e6a367922d003f f 102f f b92b658- Paper- Conf erence.pdf . [13] Carson Sievert. 2020. Interactive web-based data visualization with R, plotly, and shiny. Chapman and Hall/CRC. [14] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Figure 6: Effect of big releases on architecture of produced 2019. Hellaswag: can a machine really finish your sentence? (2019). arXiv: 1905.07830 [cs.CL]. models 5 Conclusion and future work Given the growing importance of Artificial Intelligence in mod- ern culture, it is beneficial to explore the free solutions that are accessible rather than just depending on commercial alternatives. This paper offers valuable insights into a tool designed to simplify the examination of trends in open-source AI in a user-friendly manner. It offers various viewpoints and enables users to acquire knowledge and reach certain conclusions about the subject. Hug- ging Face has the capability to function as an excellent tool for finding a certain model. As time progresses, open-source AI is expected to provide a growing contribution to the AI community and provide more specific applications for models that could be ignored by big organizations. We aim to enhance the functionality of our Shiny application by incorporating more perspectives and expanding the range of data interaction options. Our objective is to ensure that the system is as updated as possible. Besides that, we want to conduct a comprehensive analysis of the data to identify patterns and correlations inside this group. We aim to assess the potential of these models and examine their capabilities and potential uses in addressing real-world issues. We would like to analyze the sustained popularity and efficacy of these models over a longer time frame. References [1] Adem Ait, Javier Luis Cánovas Izquierdo, and Jordi Cabot. 2023. Hfcom- munity: a tool to analyze the hugging face hub community. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), 728–732. doi: 10.1109/SANER56733.2023.00080. [2] Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. 2023. Open llm leaderboard. https://huggingf ace.co/spaces/open- llm- leaderboard/open_llm_leaderboard. (2023). [3] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Taf jord. 2018. Think you have solved ques- tion answering? try arc, the ai2 reasoning challenge. (2018). arXiv: 1803.054 57 [cs.AI]. 18 Multilingual Hate Speech Modeling by Leveraging Inter-Annotator Disagreement Patricia-Carla Grigor∗ Bojan Evkoski Petra Kralj Novak University of Vienna evkoski_bojan@phd.ceu.edu novakpe@ceu.edu Vienna, Austria Central European University Central European University Vienna, Austria Vienna, Austria Jožef Stefan Institute Ljubljana, Slovenia Abstract challenge is the subjectivity of hate speech, as annotators often As social media usage increases, so does the volume of toxic disagree due to diverse backgrounds and perspectives. content on these platforms, motivating the Machine Learning To address this challenge, researchers have proposed alterna- (ML) community to focus on automating hate speech detec- tive methodologies to ground-truthing, including the incorpo- tion. While modern ML algorithms are known to provide nearly ration of diverse perspectives into the training and evaluation human-like results for a variety of downstream Natural Lan- pipelines of ML models [1, 14]. One such approach is introduced guage Processing (NLP) tasks, the classification of hate speech by [7], who train monolingual hate speech classifiers in several is still an open challenge, partially due to its subjective anno-languages directly on datasets that include disagreement. As an tation, which often leads to disagreement between annotators. alternative to gold-standard data, such data is referred to as dia- This paper adopts a perspectivist approach that embraces sub- mond standard data, based on the assumption that more than one jectivity, leveraging conflicting annotations to enhance model single truth exists. In terms of evaluation, the researchers focus performance in real-world scenarios. A state-of-the-art multi- on the evaluation of models from the perspective of disagreement, lingual language model for hate speech detection is introduced, with the ultimate goal of estimating the agreement between the trained, and evaluated using diamond standard data with metrics annotators themselves, as well as between models and annotators that consider disagreement. Various strategies for incorporat- by using the appropriate metrics. Their main findings indicate ing disagreement are compared in the process. Results demon- that disagreement between annotators represents an intrinsic strate that the model performs equally or better on all evalu- limitation to the performance that can be achieved by automated ated languages compared to respective monolingual models and systems. drastically outperforms on multilingual data. This highlights This paper aims to explore the potential of training a multilin- the effectiveness of multilingual and perspectivist methods in gual hate speech model, as well as further explore the ideas of addressing the complexities of hate speech detection. The pre- incorporating inter-annotator disagreement in model training. sented multilingual hate speech detection model is available at: Therefore, at the basis of this paper lie the following research https://huggingface.co/IMSyPP/hate_speech_multilingual. questions: - How does the performance of multilingual hate speech classifiers Keywords trained on diamond standard data compare to the performance of monolingual models? hate speech detection, inter-annotator disagreement, multilin- - How can inter-annotator disagreement be effectively incorporated gual language modeling into the classifier fine-tuning process? 1 Introduction In light of these research questions, the expected outcomes are twofold: (1) multilingual classifiers trained on diamond stan- The phenomenon of hate speech, which is typically defined as dard data are anticipated to outperform monolingual models, offensive or derogatory language targeting individuals or groups and (2) incorporating inter-annotator disagreement is expected based on characteristics such as race, religion, ethnic origin, sex- to enhance sensitivity to nuanced hate speech. These findings ual orientation, disability, or gender [2], has become a significant could benefit computational linguistics research and social me-problem on social networks in recent years, with communities dia providers by informing the development of more effective being increasingly exposed to toxic content as the networks content moderation algorithms. grow and become more interconnected [13, 3]. Consequently, the Machine Learning (ML) and computational linguistics com- 2 Related Work munities have begun developing content moderation strategies using advanced algorithms and Natural Language Processing Several methods exist for incorporating disagreement into ML (NLP) techniques to detect hate speech [10, 11]. However, a key training pipelines [12, 5], but few focus on hate speech detection. One approach is presented in [7], where monolingual hate ∗The first author conducted the research with significant input from the second au-speech classifiers were trained for English, Italian, and Slovenian. thor, under the supervision and guidance of the third author. All authors contributed to writing the manuscript. These classifiers utilized diamond standard datasets sourced from YouTube and Twitter, employing a consistent annotation process Permission to make digital or hard copies of all or part of this work for personal for each language. Their main findings indicate that, according to or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the accuracy scores, the annotators demonstrated a high degree the full citation on the first page. Copyrights for third-party components of this of agreement in approximately 80% of the cases across all three work must be honored. For all other uses, contact the owner/author(s). datasets. In terms of Krippendorff’s ordinal alpha score, which Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia considers both agreement by chance and the ordering of classes © 2024 Copyright held by the owner/author(s). https://doi.org/https://doi.org/10.70314/is.2024.sikdd.7 (from least to most severe), the agreement score is approximately 19 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Patricia-Carla Grigor, Bojan Evkoski, and Petra Kralj Novak 0.6 for all three languages. Furthermore, the evaluation results 3.2 Model Selection and Fine-Tuning indicate that the performance of each model aligned with the Our proposed multilingual hate speech model builds on the pre- inter-annotator agreement, both in terms of accuracy and the trained XLM-R transformer model [4], chosen for its proven alpha score. This implies that the performance of models is inher- effectiveness in cross-lingual understanding and its ability to ently constrained by the level of agreement among annotators. handle a wide range of languages. This provides a robust founda- Consequently, when trained on diamond standard data, it is un- tion for fine-tuning and optimization, particularly since English, likely that the performance of these models can significantly Italian, and Slovenian—the languages used for fine-tuning—were surpass human performance. included in XLM-R’s pre-training. To explore various strategies This work was built upon these findings through investigat- for incorporating annotator disagreement during training, three ing the potential of multilingual models to enhance hate speech model variants were fine-tuned on the previously presented detection, with the aim of broadening their applicability across datasets, referred to in the tables as MDA, MDD, and MRD, re- diverse linguistic contexts. Additionally, strategies for incorpo- spectively. rating annotator disagreement were explored, with the goal of To address class imbalance and enhance model performance improving model performance to approach human-level accuracy on minority classes, a custom training loop with a weighted and agreement. cross-entropy loss function was implemented, as proposed in [9]. 3 Method The class weights were calculated to be inversely proportional to the frequency of each hate speech class within the training This section details the methodology for training and evaluating data. The hyperparameters for the fine-tuning process included a the multilingual hate speech classifier presented in this paper. It learning rate of 6 × 10−6, a batch size of 8, and 3 training epochs. begins with a brief overview of the datasets used, followed by During the training phase, the AdamW optimizer was employed an explanation of the chosen pre-trained language model that to optimize the model parameters. The fine-tuning process was serves as the foundation for fine-tuning. The section concludes implemented using PyTorch. with a description of the methods employed for evaluating the models. 3.3 Model Evaluation 3.1 Datasets In terms of evaluation, the approach introduced in [7] was replicated in order to compare the performance of the multilingual Three monolingual datasets, i.e. the English (Youtube), Italian classifiers to human judgment from the perspective of disagree- (Youtube) and Slovenian (Twitter) datasets, introduced in [7] ment. This was achieved by employing identical measures to served as the basis for our multilingual model. Each item was estimate the agreement between human annotators, as well as annotated by two annotators independently, assigned to one of the agreement between annotators and models. Accuracy, F1 four available classes: [Appropriate], [Inappropriate], [Offensive], score and, most notably, Krippendorff’s ordinal alpha were used and [Violent]. In the case of conflicting labels, both annotating to evaluate all models in this research. instances were kept. Rarely used in ML applications, Krippendorff’s alpha is a ro- To explore strategies for incorporating disagreement, three bust measure for assessing inter-rater reliability, accounting for multilingual datasets were created. First, the Duplicate All (DA) agreement beyond what might occur by chance. It is applicable dataset, which contains all instances by their respective two anno- across various data types (nominal, ordinal, interval, and ratio tators from the three monolingual datasets. Second, the Duplicate scales) and is particularly effective in dealing with missing data. Disagreement (DD) dataset, in which instances where annotators The value of Krippendorff’s alpha ranges from -1 to 1, where 1 disagreed appear twice with their respective conflicting labels, indicates perfect agreement and 0 suggests agreement equivalent while instances that they agreed upon appear only once, creat- to chance. Generally, an alpha above 0.80 is considered a strong ing a more balanced training set that reflects both agreement agreement, while in hate speech datasets, the alpha values range and disagreement, potentially preventing the models from be- from 0.25 to 0.65. For a detailed discussion, see Krippendorff [8]. ing biased towards instances where annotators agree. And third, the Remove Disagreement (RD) dataset, which consists only of 4 Results instances where annotators agree. Thus, the first two datasets contain diamond standard data, while the third dataset can be This section presents the evaluation results on the multilingual considered a gold standard dataset in which disagreement has model and its variants. It starts with an evaluation from the been explicitly removed. perspective of inter-annotator and model-annotator agreement. All instances in these datasets have undergone the same pre- Then, the class specific evaluation results, as well as a model processing steps, such as replacing links and usernames with comparison based on the models’ average scores are presented. placeholders. This step was undertaken to mitigate any potential The models are also compared to monolingual baselines fine- biases associated with certain names, as discussed in [6]. Table 1 tuned on data for their respective languages, including the BERT presents an overview of the label distribution across the three model for English, AlBERTo for Italian, and CroSloEngual for multilingual training sets. The datasets used for monolingual Slovenian, as presented in [7]. evaluation are the unmodified evaluation sets presented in [7]. 4.1 Inter-Annotator and Model-Annotator Agreement Table 1: Label distribution of the multilingual train sets The inter-annotator agreement was computed on the evaluation Dataset Acceptable Inappropriate Offensive Violent sets for each language using Krippendorff’s alpha and accuracy. DA 191,677 11,005 112,833 7,145 The same measures were also used to compute the agreement DD 111,324 8,346 72,706 4,992 RD 80,573 2,661 40,255 2,161 between the annotators and the models. The results are presented in Table 2. 20 Multilingual Hate Speech Modeling by Leveraging Inter-Annotator Disagreement Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Table 2: Inter-Annotator Agreement compared to model-annotator agreement in terms of Krippendorff’s ordinal alpha (𝛼) and Accuracy (Acc.) for the models Multilingual Duplicate All (MDA), Multilingual Duplicate Disagreement (MDD), and Multilingual Remove Disagreement (MRD) based on the language-specific evaluation sets Dataset Inter-Annotator Agreement MDA MDD MRD 𝛼 Acc. 𝛼 Acc. 𝛼 Acc. 𝛼 Acc. English 58.19 82.91 55.89 79.97 50.18 76.47 57.90 81.41 Italian 57.00 81.79 58.29 82.00 56.15 80.43 57.84 82.69 Slovenian 56.62 79.43 55.74 78.60 52.95 76.52 55.15 78.84 First, in the case of inter-annotator agreement, annotators Table 3: Model evaluation results in terms of class-specific agree around 80% of the time in terms of accuracy, with an accu- F1 scores on the English dataset. The Total score was calcu- racy score between 79% and 82% across all three datasets. How- lated using the weighted F1 score. The first three models ever, accuracy does not account for class imbalance, nor the represent the monolingual baselines. The subsequent mod- ordering of the classes. A more appropriate estimate of the agree- els represent the multilingual models ment is computed through Krippendorff’s ordinal alpha. Here, Model Appropriate Inappropriate Offensive Violent Total the annotators achieve an agreement score alpha in the values EN 89.38 28.95 68.36 24.17 83.44 between 0.56 and 0.58 across the three languages. IT 85.25 13.81 0.41 0.00 63.39 Second, in terms of agreement between annotators and mod- SL 88.01 25.17 49.69 2.88 77.71 els, the same metrics were applied. The results demonstrate a MDA 86.10 39.16 68.24 27.82 81.09 MDD 83.33 34.16 65.07 24.52 78.20 consistent level of agreement between the models and annotators MRD 87.43 29.90 69.02 27.27 82.18 across all cases. Based on accuracy scores, all models align with at least one annotator approximately 80% of the time, with alpha values comparable to inter-annotator scores. In most instances, the models achieve the upper limit of inter-annotator agreement, [Offensive] were achieved by the MDA variant, once again show- and in some cases, even exceed it (e.g., Italian Multilingual Du- ing the superiority of the Duplicate All (DA) strategy. plicate All MDA). This suggests that the models are effectively In the case of the Slovenian dataset, the observed phenomena learning consistent patterns or biases that align well with one or slightly differ from the previous ones. The evaluation results are more annotators. Such outcomes are expected in scenarios where presented in Table 5. Here, two of the multilingual variants (MDA annotator disagreement is largely due to subjective interpreta-and RD) outperform the Slovenian monolingual model overall, tion. This should not be construed as the model being inherently despite predicting worse on the [Appropriate] class. Notably, the superior, but rather as an indication of its efficiency in modeling monolingual model outperforms all models on the [Violent] class, the predominant patterns present in the training data. which has not been the case for the other languages. This could Third, a comparison between the multilingual variants shows be due to language specifics that the multilingual model fail to that the Duplicate Disagreement (DD) strategy consistently shows capture, or to the specifics of the CroSloEngual BERT which is worse alpha scores, meaning that emphasizing only on disagree- also heavily pre-trained on Croatian and Slovenian data. Once ment might be detrimental in training. No consistent difference again, the DA disagreement strategy shows slight superiority between Duplicate All (DA) and Remove Duplicates (RD) is evident over RD. from the experiments. Finally, Table 6 shows the average scores of all models, achieved by averaging their combined (weighted) F1 scores across all three 4.2 Model Comparison languages. Summarizing the multilingual superiority, these final To evaluate the performance of the models across the four hate results show how monolingual models drastically falter on un- speech classes, the F1 score was used. Additionally, the combined seen languages, while the multilingual models have the capacity (weighted) F1 score was computed for each model to assess their to reach the inter-annotator agreement ceiling for all languages. overall performance. To determine the best-performing model, While overall results show that the Remove Disagreement (RD) the weighted F1 scores were averaged across all three languages. gold standard strategy for incorporating disagreement is best, one Table 3 shows the results achieved by each of the models on the should be cautious when making such conclusions. Class-specific English evaluation set. In the case of the English dataset, the re- results show that the Duplicate All (DA) strategy outperforms sults show that the multilingual model outperforms the baseline in all the classes most relevant to hate speech detection, except monolingual English model across all classes except the [Appro- for [Appropriate], which is the least relevant class. Another dif- priate] class, a case in which it still performs competitively. The ference is that the MDA model involved training longer on the variant which achieved the highest score on the minority classes same data which might have resulted in improvement on mi- is the MDA model, with an F1 score of 39.16 for the [Inappropri- nority classes and saturation on the majority class. For a future ate] class and an F1 score of 27.82 for the [Violent] class. This fairer comparison, the fine-tuning process on gold standard data is most likely due to introducing the weighted cross-entropy should be adjusted accordingly. The MDA variant of the model is loss function, which was effective in improving performance on available at: https://huggingface.co/IMSyPP/hate_speech_multil underrepresented classes, a procedure which was not performed ingual. in [7]. Similar patterns emerge on the Italian dataset (Table 4). The 5 Discussion multilingual model is competitive to the monolingual model In recent years, automated hate speech detection has become while outperforming the Italian baseline on the minority classes. crucial for moderating online content and mitigating the nega- The highest scores on the most important classes [Violent] and tive impact on social dynamics within online communities. This 21 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Patricia-Carla Grigor, Bojan Evkoski, and Petra Kralj Novak Table 4: Model evaluation results in terms of class-specific XLM-R transformer. By leveraging multilinguality, the model F1 scores on the Italian dataset significantly outperforms monolingual baselines, demonstrating its effectiveness across diverse linguistic contexts. This high- Model Appropriate Inappropriate Offensive Violent Total lights the potential of multilingual approaches in improving hate EN 86.27 1.28 1.05 0.00 67.42 IT 91.32 58.46 59.02 40.34 83.22 speech detection, especially in scenarios where content spans SL 86.23 0.76 3.25 0.00 65.95 multiple languages. MDA 89.77 58.45 60.42 44.97 82.38 Additionally, this research incorporates inter-annotator dis- MDD 88.95 56.04 58.31 39.85 81.19 MRD 90.41 55.46 59.49 38.78 82.50 agreement into the fine-tuning process using diamond standard data, offering a valuable alternative to traditional gold-standard Table 5: Model evaluation in terms of class-specific F1 models. By embracing rather than ignoring annotator disagree- scores on the Slovenian dataset ment, the model better reflects the nuances of subjective anno- tations, enhancing its real-world applicability. However, while Model Appropriate Inappropriate Offensive Violent Total this approach shows promise, annotator disagreement continues EN 79.93 3.98 2.34 0.00 53.84 to present challenges, indicating that further work is needed to IT 79.84 3.80 1.24 0.00 53.43 fully address its impact on model performance. SL 85.70 43.69 65.26 29.12 78.39 MDA 84.30 45.22 69.69 24.79 78.88 Future research could extend this work by evaluating the mod- MDD 82.33 43.39 68.59 23.84 77.19 els on additional languages, exploring alternative baseline models, MRD 84.98 38.47 68.40 15.50 78.80 refining strategies for incorporating annotator disagreement and handling minority classes. As online hate speech extends its im- Table 6: Average performance of models based on class- pact, developing robust, multilingual content moderation systems weighted F1 scores across three languages is crucial to maintaining safe and inclusive digital environments. Model Avg. Weighted F1 Score (all languages) 7 Acknowledgments EN 68.23 The authors acknowledge partial financial support from the Slove- IT 66.68 nian Research Agency (research core funding no. P2-103). SL 74.02 MDA 80.78 References MDD 78.86 [1] Aymé Arango, Jorge Pérez, and Barbara Poblete. 2019. Hate speech detec- MRD 81.16 tion is not as easy as you may think: a closer look at model validation. In Proceedings of the 42nd international acm sigir conference on research and research proposes a novel multilingual hate speech model to ad- development in information retrieval, 45–54. [2] Alexander Brown. 2017. What is hate speech? Part 2: Family resemblances. dress these challenges on a broader scale. The following discusses Law and Philosophy, 36, 561–613. the main findings. [3] Naganna Chetty and Sreejith Alathur. 2018. Hate speech review in the context of online social networks. Aggression and violent behavior, 40, 108– First, the inter-annotator agreement and the agreement be- 118. tween annotators and models suggest that inter-annotator agree- [4] Alexis Conneau et al. 2019. Unsupervised cross-lingual representation learn- ment sets an intrinsic limit on model performance. Models are ing at scale. CoRR, abs/1911.02116. [5] Tommaso Fornaciari, Alexandra Uma, Silviu Paun, Barbara Plank, Dirk limited by the quality and consistency of the annotated data, Hovy, Massimo Poesio, et al. 2021. Beyond black & white: leveraging an- which directly affects their ability to accurately predict unseen notator disagreement via soft-label multi-task learning. In Proceedings of data. However, incorporating areas of disagreement into model the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for development can lead to more robust models capable of han- Computational Linguistics. dling ambiguous cases by employing one of the several available [6] Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou. 2018. Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings strategies for incorporating disagreement. of the National Academy of Sciences, 115, 16, E3635–E3644. Second, the multilingual model consistently surpassed the [7] Petra Kralj Novak, Teresa Scantamburlo, Andraž Pelicon, Matteo Cinelli, monolingual baselines, achieving the inter-annotator agreement Igor Mozetič, and Fabiana Zollo. 2022. Handling disagreement in hate speech modelling. In International Conference on Information Processing and Man- ceiling across all languages. This success can be attributed partly agement of Uncertainty in Knowledge-Based Systems. Springer, 681–695. to the ability to leverage patterns learned from multiple lan- [8] Klaus Krippendorff. 2018. Content analysis: An introduction to its methodology. guages, partly to vast amounts of data incorporated into state-of- Sage publications. [9] Andraž Pelicon, Syrielle Montariol, and Petra Kralj Novak. 2023. Don’t start the-art pre-trained multilingual models, and partially to the class your data labeling from scratch: opsala-optimized data sampling before weighting scheme employed in the fine-tuning. These findings labeling. In International Symposium on Intelligent Data Analysis. Springer, 353–365. support the first research question, demonstrating that a multi- [10] Juan Manuel Pérez et al. 2023. Assessing the impact of contextual informa- lingual hate speech classifier trained on diamond standard data tion in hate speech detection. IEEE Access, 11, 30575–30590. outperforms its monolingual counterparts. [11] Fabio Poletto, Valerio Basile, Manuela Sanguinetti, Cristina Bosco, and Viviana Patti. 2021. Resources and benchmark corpora for hate speech Finally, this research contributes substantially to hate speech detection: a systematic review. Language Resources and Evaluation, 55, 477– classification in a multilingual context by introducing a novel 523. multilingual hate speech detection model and making it avail- [12] Alexandra N Uma, Tommaso Fornaciari, Dirk Hovy, Silviu Paun, Barbara Plank, and Massimo Poesio. 2021. Learning from disagreement: a survey. able on the Hugging Face platform. Our model underscores the Journal of Artificial Intelligence Research, 72, 1385–1470. importance of incorporating inter-annotator disagreement into [13] William Warner and Julia Hirschberg. 2012. Detecting hate speech on the world wide web. In Proceedings of the second workshop on language in social model development, challenging the reliance on gold standard media, 19–26. data in subjective tasks, such as hate speech detection. [14] Wenjie Yin and Arkaitz Zubiaga. 2021. Towards generalisable hate speech detection: a review on obstacles and solutions. PeerJ Computer Science, 7, 6 Conclusions e598. This paper advances automatic hate speech detection by introduc- ing a novel multilingual model fine-tuned on the state-of-the-art 22 Predicting Pronunciation Types in the Sloleks Morphological Lexicon of Slovene 1,2 Jaka Čibej jaka.cibej@f f.uni- lj.si jaka.cibej@ijs.si 1Faculty of Arts, University of Ljubljana 2Jožef Stefan Institute Ljubljana, Slovenia Abstract representation, with some exceptions and several predictable phoneme assimiliations (such as the assimilation of voiceless We present an experiment dealing with the automatic prediction consonant phonemes to their voiced equivalents glasba ‘music’, of pronunciation types for lemmas in the Sloleks Morphologi- cal Lexicon of Slovene IPA: /"gla:zba/, or vice-versa, voiced-to-voiceless, podpreti ‘to . We perform a statistical analysis on a support’, IPA: /pOt"pre:ti/). number of mostly 𝑛-gram-based features and use a set of sta- However, not all entries in Sloleks follow Slovene G2P prin- tistically significant features to train and test several machine ciples. For a number of words, particularly proper nouns de- learning models to discriminate between lemmas for which a pho- noting people (Shakespeare, Sharon), locations (Sydney, Birm- netic transcription can be generated automatically using Slovene ingham), inhabitants (Newyorčan ‘New Yorker’), etc.; as well as grapheme-to-phoneme (G2P) conversion rules (e.g. Novak), and adjectives derived from proper nouns (aachenski ‘pertaining to lemmas with pronunciation that follows other G2P rules (e.g. Shakespeare Aachen’, Acronijev ‘belonging to Acroni’), the phonetic transcrip- ). tion cannot be generated using Slovene G2P rules. In such cases Keywords with foreign orthographic elements that indicate relations be- tween graphemes and phonemes that are unusual for Slovene, grapheme-to-phoneme conversion, pronunciation types, mor- Slovene linguistic and lexicographic practice (see e.g. [5]) first re-phological lexicon, proper nouns, Slovene quires a transliteration into the closest equivalent using Slovene graphemes, which can then be used to generate the phonetic tran- 1 Introduction scription using Slovene G2P rules (e.g. Newyorčan → njújórčan The Sloleks Morphological Lexicon of Slovene [2] is the largest → IPA: /"nju:"jo:rtSan/). open-access database containing machine-readable information Because of this, it is necessary to discriminate between differ- on the morphological properties of Slovene lemmas (e.g. miza ent pronunciation types: categories of words that follow Slovene ‘table’, noun, common, feminine) and their inflected forms (e.g. G2P rules (Slovene G2P ) and those that do not (e.g. Other G2P ; mize, singular, genitive; mizo, singular, accusative). Since version more on this in Section 2). Pronunciation types denote the manner 2.0 [3], each lemma and inflected form also contains accentuated in which the phonetic transcription of the word can be generated. forms (e.g. míza) and phonetic transcriptions in the International In some cases, assigning the pronunciation type to a lemma is Phonetic Alphabet (IPA) and its equivalent X-SAMPA (e.g. IPA: trivial – if the lemma contains a grapheme that is not part of /"mi:za/, X-SAMPA: /"mi:za/). Both transcriptions were generated 2 the Slovene alphabet (e.g. x, y, w, q), it belongs into the Other automatically from accentuated forms, first in version 2.0 using a G2P category (e.g. Byron, Oxford). There are, however, many rudimentary rule-based system, then again in 3.0 with a greatly exceptions that belong in the Other G2P category despite being improved and linguistically informed rule-based grapheme-to- comprised entirely of Slovene graphemes (e.g. Matt, Sharon). 1 phoneme (G2P) conversion tool for Slovene. In Sloleks 3.0, the first cca. 100,000 lemmas that had been part Rule-based G2P conversion for Slovene (particularly from ac- of version 2.0 were manually annotated with pronunciation types, centuated forms) yields very good results and leaves only a mi- whereas the 264,000 new entries (added automatically from the nority of issues to be resolved manually because in terms of its Gigafida 2.0 Corpus of Modern Standard Slovene [6]) still lack this orthographic depth, Slovene features a shallow orthography ([9]) information. Because manual annotation from scratch is time- in which each grapheme in the alphabet generally corresponds consuming, we performed an experiment to determine to what to one phoneme (see e.g. [4]) and the spelling-sound correspon-degree the pronunciation type can be predicted automatically by dence is relatively direct ([1]; [11]): the pronunciation rules allow relying on the scarce linguistic and morphosyntactic information for words to be pronounced correctly based on their graphemic that can be extracted from an individual lemma. The paper is structured as follows: we describe the dataset 1 The Slovene G2P tool is part of Pregibalnik, a piece of software used for the that was used for the statistical analysis and machine learning automatic expansion of the Sloleks Morphological Lexicon of Slovene: https://github .com/clarinsi/SloInf lector It was developed within the Development of Slovene in experiment (Section 2), as well as the process of feature selection the Digital Environment project. The Slovene G2P converter is also available as an (Section 3). We train several machine-learning models and evalu-API-service: https://orodja.cjvt.si/pregibalnik/g2p/docs ate their performance using 10-fold cross-validation (Section 4). Finally, we manually evaluate a sample of automatically anno- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or tated entries (Section 5) and conclude the paper with our plans distributed for profit or commercial advantage and that copies bear this notice and for future work (Section 6). the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner /author(s). Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia 2 Although ć and đ are not part of the Slovene alphabet, they are phonemically © 2024 Copyright held by the owner/author(s). transparent and frequently occur in names of Slovene citizens, so they are not https://doi.org/https://doi.org/10.70314/is.2024.sikdd.2 counted as foreign characters for the purposes of this task. 23 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Čibej Table 1: Lemmas in Sloleks 3.0 by Pronunciation Type Table 3: Statistically Significant Features by Category Pronunciation Type Frequency % Feature Category Number - 264,538 72.41% Percentage of Slovene G2P characters 1 Slovene G2P 94,750 25.93% Morphosyntactic features 3 Other G2P 3,066 0.84% General character-level 𝑛-grams 1,119 Numeral 1,840 0.50% Initial character-level 𝑛-grams 398 Acronym 845 0.23% Final character-level 𝑛-grams 468 Slovene G2P with minor deviation 113 0.03% General robust CVC 𝑛-grams 66 Abbreviation 70 0.02% Initial robust CVC 𝑛-grams 44 Ambiguous G2P 69 0.02% Final robust CVC 𝑛-grams 39 Symbol 49 0.01% General finegrained CVC 𝑛-grams 157 Initial finegrained CVC 𝑛-grams 102 Total 365,340 100.00% Final finegrained CVC 𝑛-grams 93 Total 2,490 Table 2: Lemmas in Sloleks 3.0 with Other G2P pronuncia- tion type by Morphosyntactic Properties Morphosyntactic Properties Frequency % alphabet as well as ć and đ ); (b) morphosyntactic features (e.g. noun 5 , proper, masculine); (c) relative frequencies of character- Adjective, possessive 1,092 35.62% level uni-, bi-, and trigrams within the lower-cased lemma (e.g. Noun, proper, masculine 958 31.25% Matt → 𝑓 (𝑚), 𝑓 (𝑎), ..., 𝑓 (𝑚𝑎), 𝑓 (𝑎𝑡 ), ..., 𝑓 (𝑚𝑎𝑡 ), ...); (d) rela- 𝑟 𝑟 𝑟 𝑟 𝑟 Noun, proper, feminine 713 23.26% tive frequencies of character-level uni-, bi-, and trigrams from Adjective, general 142 4.63% a robust CVC-conversion of the lemma, substituting consonant Noun, common, masculine 127 4.14% graphemes with C and vowel graphemes with V (e.g. Matt → Noun, common, feminine 20 0.65% CVCC → 𝑓 (𝐶), 𝑓 (𝑉 ), ..., 𝑓 (𝐶𝑉 ), 𝑓 (𝑉 𝐶), ..., 𝑓 (𝐶𝑉 𝐶), ...); (e) rel- 𝑟 𝑟 𝑟 𝑟 𝑟 Adverb, general 10 0.33% ative frequencies of character-level uni-, bi-, and trigrams from a Noun, common, neuter 2 0.07% 6 finegrained CVC-conversion of the lemma (e.g. Matt → ZVKK Verb, main, imperfective 2 0.07% → 𝑓 (𝑍 ), 𝑓 (𝑉 ), ..., 𝑓 (𝑍𝑉 ), 𝑓 (𝑉 𝐾 ), ..., 𝑓 (𝑍𝑉 𝐾 ), ...) 𝑟 𝑟 𝑟 𝑟 𝑟 Total 3,066 100.00% For (c), (d), and (e), the initial and final uni-, bi-, and trigrams of the lemma were extracted separately as well, as in some cases the position of the 𝑛-gram in the word can be indicative of one 2 Dataset class over another. For general character-level 𝑛-grams, the first 1,498 with a fre- Sloleks 3.0 contains a total of 365,340 entries, but only approxi- quency of at least 500 across all Sloleks 3.0 lemmas were analyzed; mately 28% have been manually assigned one of 8 pronunciation 3 these cover cca. 88.34% of all 𝑛-gram occurrences. For robust CVC types (as shown in Table 1). For the classification task, we focus and finegrained CVC 𝑛-grams, all were analyzed. We performed only on the two most frequent pronunciation types (Other G2P 4 the Kruskal–Wallis H test [7] (k=2, n=97,056) on a total of 6,148 and Slovene G2P ). 7 features, out of which 2,490 (40%) were statistically significant. In terms of their morphosyntactic features, the Other G2P Statistically significant features by categories are shown in Table lemmas mostly consist of possessive adjectives and proper nouns, 3. 1,146 features are more indicative of Slovene G2P and 1,344 are collectively accounting for cca. 90% of the category (as shown in more indicative of Other G2P. As shown in Table 4, only three Table 2), but only 15% of the portion of Sloleks annotated with of the top 10 general 𝑛-grams indicative of Other G2P actually pronunciation types. contain non-Slovene G2P characters, confirming that detecting The final dataset for statistical analysis and machine learning lemmas from the Other G2P category is more complex and re- consisted of 94,863 Slovene G2P lemmas (e.g. dekadentnost, Košak, prefiltriran quires more than simply taking into account non-Slovene G2P ) and 3,066 Other G2P lemmas (e.g. Elizabeth, Presley, Sinclaire graphemes. ). 3 Statistical Analysis and Feature Selection 4 Pronunciation Type Prediction From each lemma, we extracted a series of features that could help The identified features (along with several placeholder 𝑛-grams discriminate between the two classes: (a) percentage of Slovene to take into account any graphemes not covered in the initial G2P graphemes within the lemma (i.e. graphemes of the Slovene dataset) were taken into account to develop a custom vectorizer that converts a given lemma and its lexical features based on 3 It should be noted that all the inflected forms within the entry effectively inherit the MulText-East v6 (MTE-6) Morphosyntactic Specifications for the pronunciation type. 4 Symbols in Sloleks are rare, along with entries within the Ambiguous G2P category (where an entry can either follow Slovene G2P rules or not, depending on the 5 context – e.g. Amanda as a Slovene name: /am"a:nda/; or as an English name with Relative frequencies were calculated as 𝑓 (𝑥 ) = 𝑓 (𝑥 )/Í 𝑓 (𝑦 ), e.g. the 𝑟 𝑛 𝑎 𝑛 𝑎 𝑛 a pronunciation adjusted to the Slovene set of phonemes: /9m"E:nda/). Abbrevi-absolute frequency of 𝑛-gram 𝑥 of length 𝑛 within the lemma divided by the sum ations and numerals are easily identifiable, and while acronyms have a separate of absolute frequencies of each 𝑛-gram 𝑦 of length 𝑛 within the lemma. 6 manner of generating phonetic transcriptions which also depends on their morpho- In the finegrained CVC-conversion, consonant graphemes were generalized into logical patterns, they are also mostly identifiable with rules. Because of its rarity more finegrained categories, e.g. graphemes denoting Slovene sonorants (M), voiced and similarity to Slovene G2P, the Slovene G2P with minor deviation category was (G) and voiceless obstruents (K), foreign consonants (X), etc. 7 2 merged into Slovene G2P for the classification task. Effect size was calculated as 𝜂 = (𝐻 − 𝑘 + 1)/(𝑛 − 𝑘 ), as reported in [10]. 24 Predicting Pronunciation Types in Sloleks Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Table 4: Top 10 Statistically Significant General Character- Table 6: Confusion Matrix for Linear Support Vector Clas- Level 2 𝑛-Grams by Effect Size (𝜂 ) sifier 2 𝑛-Gram H p 𝜂 Means True → ↓ Í Predicted Slovene G2P Other G2P y 11509.36 p ≤ 0.0001 0.1186 𝜇 < 𝜇 𝑆 𝑂 w 9595.25 p ≤ 0.0001 0.0989 𝜇 < 𝜇 Slovene G2P 18,939 140 19,079 𝑆 𝑂 ch 7558.60 p ≤ 0.0001 0.0778 𝜇 < 𝜇 Other G2P 34 473 507 𝑆 𝑂 Í ll 6295.96 p ≤ 0.0001 0.0649 𝜇 < 𝜇 18,973 613 - 𝑆 𝑂 ss 3804.26 p ≤ 0.0001 0.0392 𝜇 < 𝜇 𝑆 𝑂 nn 3220.65 p ≤ 0.0001 0.0332 𝜇 < 𝜇 𝑆 𝑂 Table 7: Confusion Matrix for Manual Evaluation th 2973.89 p ≤ 0.0001 0.0306 𝜇 < 𝜇 𝑆 𝑂 wa 2761.53 p ≤ 0.0001 0.0284 𝜇 < 𝜇 𝑆 𝑂 tt 2745.10 p ≤ 0.0001 0.0283 𝜇 < 𝜇 𝑆 𝑂 True → co 2571.20 p ≤ 0.0001 0.0265 𝜇 < 𝜇 Í 𝑆 𝑂 ↓ Predicted Slovene G2P Other G2P Slovene G2P 86 9 95 Table 5: Model Performance Based on 10-Fold Cross- Other G2P 14 91 105 Validation Í 100 100 - Model A BA P R F1 ROC AUC ‘a’ is pronounced as /E/, but this cannot be discerned from the LinearSVC 99.08 87.87 96.36 87.87 91.64 98.89 graphemic representation itself. Other misclassified examples are Multin. NB 97.38 79.17 78.12 79.17 78.62 96.55 more obviously pertaining to Other G2P, e.g. Dorfmeister, Faulkn- kNN (k=5) 98.25 75.17 93.67 75.17 81.74 91.63 erjev, Flaubertov, Heisenbergov, Balfourjev. This might indicate Majority 96.87 - - - - - that not all indicative 𝑛-grams have been included as features (e.g. ‘ei’, ‘ou’), possibly for lack of evidence in the original dataset or because they are less frequent and have not been included in 8 Slovene into a 2,500-dimensional numerical vector. The entire the initial batch of statistical tests. As the lexicon expands with dataset was converted into vectors and split into a training set new entries, the model will be updated with new examples and 9 (80%) and a test set (20%), both stratified by class. Three models new features to potentially improve performance. (Linear Support Vector Classifier (LinearSVC), Multinomial Naive Bayes Classifier (Multin. NB), and k Nearest Neighbors Classifier 5 Manual Evaluation (kNN)) were trained and evaluated with 10-fold cross-validation. 10 We trained a new instance of the LinearSVC model on the entire The results are listed in Table 5 and show that LinearSVC out- dataset and used it to annotate the remaining cca. 264,000 lemmas performs the other two models. All three exhibit above-baseline from Sloleks 3.0 with no pronunciation type, resulting in 86,730 accuracy compared to the majority classifier, but Multinomial lemmas with Other G2P and 177,808 lemmas with Slovene G2P. NB and kNN perform much worse in terms of balanced accuracy We performed a preliminary manual evaluation consisting of as well as precision and, in case of kNN, recall. Recall is also a random sample of 100 examples from each class. The results somewhat lower with LinearSVC, which is to be expected – some Other G2P are shown in the confusion matrix in Table 7. Although the lemmas might contain no indicative 𝑛-grams and are sample is too small to be representative of the whole, it indicates thus hard to detect; on the other hand, once identified, the model that the model performs well even on unseen data, achieving is very precise in its prediction. an accuracy of 88.50% (P=0.91, R=0.87, F1=0.89) over a majority Table 6 shows the confusion matrix for the LinearSVC model baseline accuracy of 50.00%. tested on the 20% stratified test dataset. The model very rarely The misclassifications of Other G2P as Slovene G2P include misclassifies Slovene G2P lemmas, and more frequently errs with Other G2P examples such as Mukhamedov, Beatli, Livenza, and Preidler, with lemmas. A closer inspection of the misclassified Slovene G2P limited indicators that the words belong to the Other G2P cat- examples reveals several errors in the original dataset: Beethoven, Ratzinger egory. Most graphemes in these examples are pronounced ac- , Rotterdam, Franco, Oberstdorf, and Keller were in fact cording to Slovene G2P criteria, with the exception of individ- correctly classified as Other G2P, but they are miscategorized as Slovene G2P ual 𝑛-grams (‘nz’, ‘ei’, ‘kh’), some of which have not been in- in the original dataset. Other misclassifications in- cluded in the set of features. In other examples, only one or two clude examples of foreign proper nouns and possessive adjectives vowel graphemes are indicative of Other G2P pronunciation (e.g. that contain unusual grapheme combinations for Slovene (e.g. Andreas Trendlina, which is also a lemmatization error; the correct lemma , Aurelio, Hilton, Simpsonov), but their pronunciation can is Trendline; and Sanberg), and the pronunciation of single vowel still be derived from their graphemic representation (e.g. Andreas → graphemes appears harder to predict than consonant graphemes IPA: /and"re:as/). or combinations thereof. On the other hand, Other G2P lemmas misclassified as Slovene G2P Similarly, the misclassifications of Slovene G2P lemmas as include Andersonov, Atkinsov, Batmanov, in which the grapheme Other G2P lemmas include examples such as Doneck, Barson, 8 MTE-6: https://nl.ijs.si/ME/V6/msd/html/msd-sl.html The vectorizer uses Slovene Bronson, Piersanti, and Faustini. While these are proper nouns of morphosyntactic tags, e.g. Slz (S – noun, l – proper, z – feminine). foreign origin, their Slovene pronunciation can either be fully 9 All models were trained using the Python library scikit-learn. [8] 10 discerned from their graphemic representation (e.g. Doneck → A, BA, P, R, and F1 refer to accuracy, balanced accuracy, macro-precision, macro- recall and macro-F1, respectively. IPA: /dO"ne:tsk/), or it only differs slightly from what Slovene 25 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Čibej grapheme-to-phoneme conversion would produce (e.g. Faus- even the new orthographic manual anticipates that all transliter- tini → automatically converted IPA: /faus"ti:ni/; correct IPA: ation should be done manually, which begs the question whether /faus"ti:ni/). at least part of the work can be automatized. This would be an “ important step in the development of a modern, digital infrastruc- 6 Conclusion ture for Slovene orthography, and would facilitate the automatic expansion of modern digital dictionary databases and datasets In the paper, we presented the results of an attempt to automatize for automatic speech recognition. the assignment of pronunciation types to lemmas in the Sloleks In addition, although our preliminary experiments with LLMs Morphological Lexicon of Slovene. The results show that a model (ChatGPT 3.5 and 4.0) classifying Slovene G2P and Other G2P lem- based on a series of mostly 𝑛-gram features can provide good mas have yielded much worse results than the best performing results when discriminating between Slovene G2P and Other G2P LinearSVC model, more systematic experiments are warranted. categories, with the best performance achieved by the Linear As part of our future work, we intend to implement the model Support Vector Classifier. However, there is still room for im- 12 into Pregibalnik, which is used for automatically extending the provement, particularly in terms of recall – a number of Other lexicon and currently assigns no pronunciation type. The model G2P lemmas from the test set were misclassified as Slovene G2P, 13 itself is available under the Apache 2.0 license on Github , while while those classified correctly were classified with a relatively the pronunciation type annotations will be included in future high precision score. 𝑛-grams that are statistically significant versions of Sloleks and, eventually, manually validated. as indicative of one class have proven to be useful features for model development, but because they are not evenly distributed Acknowledgements and occur sporadically in different lemmas, it would make sense The research presented in this paper was conducted within the re- to further improve the model by performing the same statistical search project titled Basic Research for the Development of Spoken analysis (as described in Section 3) on the long tail of less fre-Language Resources and Speech Technologies for the Slovenian Lan- quent 𝑛-grams to prepare a more comprehensive list of indicative guage (J7-4642), the research programme Language Resources and 𝑛-grams. The current version of the model is very light-weight Technologies for Slovene (P6-0411), and the CLARIN.SI research and additional features should not cause the model to become infrastructure, all funded by the Slovenian Research and Inno- overencumbered. vation Agency (ARIS). The author also thanks the anonymous There are several possibilities for further development of the reviewers for their constructive comments. model. Firstly, instead of using relative frequencies of 𝑛-grams as features, it would be useful to test how different measures References such as TF–IDF, absolute frequencies, or even Boolean values influence the performance of the model, and potentially also test [1] Derek Besner and Marilyn Chapnik Smith. 1992. Chapter 3 basic processes in reading: is the orthographic depth hypothesis sinking? In Orthography, several other machine learning algorithms (e.g. Random Forest Phonology, Morphology, and Meaning. Advances in Psychology. Vol. 94. Ram Classifier). Secondly, while the other pronunciation types from Frost and Leonard Katz, editors. North-Holland, 45–66. doi: https://doi.org /10.1016/S0166- 4115(08)62788- 0. Sloleks 3.0 (acronyms, abbreviations, etc.) are relatively easily [2] Jaka Čibej et al. 2022. Morphological lexicon sloleks 3.0. Slovenian language identifiable (but much less frequent), in the next step, it would resource repository CLARIN.SI. (2022). http://hdl.handle.net/11356/1745. be informative to include them in the training set and test out [3] Kaja Dobrovoljc, Simon Krek, Peter Holozan, Tomaž Erjavec, Miro Romih, Špela Arhar Holdt, Jaka Čibej, Luka Krsnik, and Marko Robnik-Šikonja. 2019. the model’s performance on the full set of categories. Thirdly, Morphological lexicon sloleks 2.0. Slovenian language resource repository a statistical analysis should be performed on the probabilities CLARIN.SI. (2019). http://hdl.handle.net/11356/1230. with which the model makes decisions and to what degree they [4] Florina Erbeli and Karmen Pižorn. 2012. Reading ability, reading fluency and orthographic skills: the case of l1 slovene english as a foreign language correlate with the percentage of graphemes that differ from the students. English. Center for Educational Policy Studies Journal, 2(3), 119–139. shallow orthographical Slovene G2P rules (e.g. Anderson, with https://f iles.eric.ed.gov/f ulltext/EJ1130208.pdf . [5] Nataša Gliha Komac et al. 2015. Koncept novega razlagalnega slovarja arguably only ‘a’ not following Slovene G2P rules; vs. Châteaux, slovenskega knjižnega jezika. Inštitut za slovenski jezik Frana Ramovša where the majority of graphemes are pronounced completely ZRC SAZU. (2015). https://f ran.si/179/novi- slovar- slovenskega- knjiznega- j differently compared to Slovene G2P rules). This would require ezika/datoteke/Potrjeni_koncept_NoviSSKJ.pdf . [6] Simon Krek et al. 2019. Corpus of written standard slovene gigafida 2.0. the preparation of a separate dataset in which graphemes are Slovenian language resource repository CLARIN.SI. (2019). http://hdl.handl manually aligned to either the graphemes of their transliter- e.net/11356/1320. ated Slovene graphemic forms (Newyorčan → njújórčan) or their [7] William H. Kruskal and W. Allen Wallis. 1952. Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47, 260, Slovene IPA transcriptions. By assigning scores that reflect the 583–621. eprint: https://www.tandf online.com/doi/pdf /10.1080/01621459.19 degree of orthography depth for the individual lemma, it would 52.10483441. doi: 10.1080/01621459.1952.10483441. [8] F. Pedregosa et al. 2011. Scikit-learn: machine learning in Python. Journal be possible to use the dataset to train a regression model. of Machine Learning Research, 12, 2825–2830. Similarly, Other G2P lemmas from Sloleks 3.0 can be manually [9] Anja Schüppert, Wilbert Heeringa, Jelena Golubovic, and Charlotte Gooskens. annotated with their language of origin and transliterated ac- 2017. Write as you speak? a cross-linguistic investigation of orthographic transparency in 16 germanic, romance and slavic languages. English. From cording to the recently published transliteration rules of Pravopis semantics to dialectometry, 32, 303–313. isbn: 9781848902305. 8.0 11 , the new orthographic manual of Slovene, which at the time [10] Maciej Tomczak and Ewa Tomczak. 2014. The need to report effect size of writing this paper is still in development. Such a dataset would estimates revisited. an overview of some recommended measures of effect size. Trends in Sport Sciences, 1(21), 19–25. enable the development of a model for language identification [11] Antal van den Bosch, Alain Content, Walter Daelemans, and Beatrice de for individual lemmas, and, ultimately, a model for automatizing Gelder. 1994. Analysing orthographic depth of different languages using data-oriented algorithms. In Proceedings of the 2nd International Conference transliteration of lemmas of foreign origin into their Slovene on Quantitative Linguistics. equivalents. As of now, no such tool yet exists for Slovene, and 12 Pregibalnik: https://github.com/clarinsi/SloInf lector; the entire tool is also 11 Pravopis 8.0: Pravila novega slovenskega pravopisa za javno razpravo. https://prav available as an API-service: https://orodja.cjvt.si/pregibalnik/docs 13 opis8.fran.si/, 9 August 2024 GitHub: https://github.com/jakacibej/sikdd2024_predicting_pronunciation_types 26 Higher-Order Bibliographic Services based on bibliographic networks Vladimir Batagelj Jan Pisanski Tomaž Pisanski IMFM Faculty of Arts, UL FAMNIT, UP Ljubljana, Slovenia Ljubljana, Slovenia Koper, Slovenia IAM and FAMNIT, UP jan.pisanski@f f.uni- lj.si IMFM Koper, Slovenia Ljubljana, Slovenia vladimir.batagelj@fmf.uni- lj.si tomaz.pisanski@upr.si Figure 1: The largest co-author groups at level 10 at the University of Primorska until 2024. Abstract of characteristics describing works. Besides these networks, we can also get the partition of works by their publication years, Bibliographic databases only provide basic services to users, but the partition of works by journals or publishers, the vector of they could provide much richer information for specific user the number of pages, and, in some cases, the (one-mode works × needs. The main reason for the delay in developing such higher- works) citation network. order bibliographic services is the limited access to data in propri- When constructing any of these networks, the first task is etary databases. We expect the new open bibliographic databases to specify the nodes and which relations are linking them. In like OpenAlex will encourage faster development of these ser- short, the network boundary problem [16] has to be solved. This vices. We describe an approach based on a collection of biblio-includes deciding whether a network is one-mode or two-mode graphic networks as a foundation to support the development of and which node properties are important for the intended analy- higher-order bibliographic services. ses. For specifying links, this amounts to answering a series of Keywords questions: bibliographic database, open access, network analysis, higher- (1) Are the links directed? order bibliographic service, prototype, OpenAlex (2) Are there different types of links (relations) to include? (3) Can a pair of nodes be linked with multiple links? 1 Introduction (4) What are the weights on the links? From special bibliographies (BibTEX, EndNote) and bibliographic (5) Is the network static, or is it changing through time? databases, it is possible to obtain data about works (papers, books, Another problem that often occurs when defining the set of reports, etc.) on selected topics. A typical work description con- nodes is the identification of nodes. The unit corresponding to a tains the following data: authors; title; publisher/journal; pub- node can have different names (synonymy), or the same name can lication year and pages. In some sources, additional data are denote different units (homonymy or ambiguity). For example available including languages, classification of documents, key- in the BibT words, authors’ institution/country affiliation, lists of references, EX bibliography from the Computational Geometry Database [14] the same author appears under 7 different names: and the abstract. This data can be transformed into a collection R.S. Drysdale, Robert L. Drysdale, Robert L. Scot Drysdale, R.L. of compatible two-mode networks on selected topics [5]: works × Drysdale, S. Drysdale, R. Drysdale, and R.L.S. Drysdale. Insider authors; works × keywords; works × countries, and other pairs information is needed to decide that Otfried Schwarzkopf and Permission to make digital or hard copies of all or part of this work for personal Otfried Cheong are the same person. At the other extreme, there or classroom use is granted without fee provided that copies are not made or are at least 57 different mathematicians with the name Wang, and distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this Li in the MathSciNet Database [20]. Its editors have tried hard, work must be honored. For all other uses, contact the owner /author(s). from 1985, to resolve the identification of the author’s problem Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia during the data-entry phase. The significant growth of contri- © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2024.sikdd.12 butions by Chinese scientists and their full name similarity in 27 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Batagelj, Pisanski & Pisanski Roman transcriptions adds additional complexity to the problem. our publications. Similarly, we get the report on the publication In the future, the problem could be eliminated by implementing activity of the selected institution. initiatives such as using ORCID or resolving the identification problem in bibliographic databases (Scopus, OpenAlex). 3.1 API An application programming interface (API) is a way for two or 2 Higher-Order Bibliographic Services more computer programs or components to communicate with The data collected in different bibliographic databases can be each other. It is a type of software interface, offering a service to used to provide higher-order bibliographic and bibliometric ser- other pieces of software [21]. In our case, API enables us to use vices such as what to read (contact/visit)? – a list of relevant the database data from our programs. An R package supporting articles/books (authors, institutions) on selected topic; where to the use of OpenAlex is openalexR [1]. publish? – a list of journals suitable for the publication of an The OpenAlex API is available at https://api.openalex.org. Its article, automatic suggestion of keywords; reviewer selection – a response is returned in JSON format. Here is an R code using the list of reviewers suitable for a submitted article; possible partners OpenAlex API for the IMFM institution search for research collaboration; a career application – a candidate’s setwd(wdir <- "C:/work/OpenAlex/API") activity report draft; etc.) for different types of users (students, re- library(httr); library(jsonlite) searchers, teachers, decision-makers, funding agencies, research res <- GET("https://api.openalex.org/institutions", query = list(search="imfm")) institutions, database managers, etc. . To support this goal we str(res) have to use high-quality data often obtained by combining data cont <- fromJSON(rawToChar(res$content)) from different databases. names(cont); str(cont) For the development of higher-order bibliographic and biblio- The response data are available in the variable cont. Similarly, metric services, open bibliographic databases such as OpenAlex the API can be used also from other programming languages. are particularly welcome, as the developed services can remain The OpenAlex query can be composed of different components. open. Using search we can search for a given search text across titles, abstracts, and full-text. Using a filter we can limit our search 3 OpenAlex to units satisfying given conditions. Using select we can select The basic type of unit in a bibliographic database is the work. data fields that will appear in results. The query can be further A user searching the database gets a list of works satisfying the controlled by some parameters. For example query. Usually, some operations with such lists (inspection, fil- wd <- GET("https://api.openalex.org/works", tering, merging, intersection, statistics, etc.) are supported. Only query = list( basic services are provided to users. search="handball", Some web services also supporting some other types of units filter="publication_year:2015", select="id,title", (authors, institutions, research fields, conferences, etc.) were de- page="2", per_page="200")) veloped such as Google Scholar [19], Scholar GPS [12], and DBLP names(wd) – computer science bibliography [10]. wc <- fromJSON(rawToChar(wd$content)); names(wc) Our approach is based on OpenAlex [18, 9] but this informa- names(wc$meta); wc$meta$count; str(wc$results) tion can be obtained from most bibliographic databases [13, 11]. returns the second page (with up to 200 entries) on works on OpenAlex indexes more than twice as many scholarly works as handball published in the year 2015. Only information about the leading proprietary products and the entirety of the knowl- works ID and title is returned. edge graph and its source code are openly licensed and freely The OpenAlex API uses paging – the list data are provided available through data snapshots, an easy-to-use API, and a by pages. The basic paging (up to 10 000 units) is based on nascent user interface. two parameters page and per_page). The cursor paging is a bit OpenAlex is based on 7 types of units (entities): W(ork), A(uthor), more complicated than basic paging, but it allows us to access as S(ource), I(nstitution), C(oncept), P(ublisher), or F(under) (and many records as we like. some additional ones such as topics, keywords, countries, con- tinents, languages, etc.). Each unit gets its OpenAlex ID – we 4 A collection of bibliographic networks assume that the identification problem is solved by the database. The simplest use of OpenAlex is through its web interface We developed an R package OpenAlex2Pajek to support the cre- (service) https://openalex.org/ or using a direct URL request in ation of bibliographic networks from OpenAlex [4]. We get a the browser URL line. For example collection of bibliographic networks (citation network Cite, au- thorship network WA, sources network WJ, keywords network • Author’s name: search the OpenAlex web service WK, countries network WC), some partitions and vectors (prop- • Known author ID: URL https://openalex.org/A5001676164 erties of nodes) (publication year, type of publication, language of • Work with DOI: URL https://api.openalex.org/works/ publication, cited by count, countries distinct count, referenced https://doi.org/10.1007/s11192-012-0940-1 works, and additionally two files containing names of works • Known work ID: URL https://openalex.org/W2083084326 xyzW.nam and names of authors xyzA.nam. Most acquired net- • Name of the institution: search the Openalex Web service works are 2-mode – they link units of two different types; an • Known institution ID: URL ordinary or 1-mode network links units of the same type. https://openalex.org/institutions/I4210106342 Currently, OpenAlex2Pajek contains three main functions This way, the OpenAlex web interface provides basic inspec- OpenAlex2PajekCite, OpenAlex2PajekAll, and coAuthorship. tions of the selected unit. For example, by including a link with We split the process of creating the collection of bibliographic our OpenAlex author ID on our web page we get a report on networks into two parts: 28 Higher-Order Bibliographic Services Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia • determining the set 𝑊 of relevant works using the satu- UM MH NR PN TO TV HM TF BJ GH TZ IQ KY GD NI WF TK PN LYSD AE SY BN SY GN GM NP ZW PG TK UG VU WF NF KI FM SL MY AQ BV GM QA TZ BD NG MU LK SC QA ZM NF UM AS BN DO ration approach SS BD WS NG OM AS PW CK MP VU NA GS IE NA BW LY [7, page 506], GG BS AO MH SS SZ KI TV NU MT JM NU FK PT PT IE AR MM GI ZA GB MT GB ZA IM GI • SB PY IM 1990 1995 ZM LS LI PW creation of the network collection for the works from 𝑊 . KE SZ BO EG CL CV GL LR MZ GL NR EC FJ CH KE ST GG SD GU BR CO UG EG AU VE FJ FO BW FO CO EC LR ZW VE CM DK GE MU CM SB UY AR RW LS AU DK MW GR RW CL ER MW PE NZ CW GY AM NZ ST PG BE GU CY GH CY SX BE AT PH FI PL CK UY SK GR NC AT BR SI CZ SG CX CZ WS TO PE NO IS VN AF HR HK DE FM DE VA MK TG MK FI IL TW RS MP GN MN CN TH TL GY LV SG BT GW SI PL PK YE MC US NL US TH GW NL PK PH SL CH HU JO HU SK KR TW LA LT JE YE DM VA CW SE KW ME CG EE LA VN RO AM SE SM CN MN BH ET MZ IT RO IL IT IN JP AL BA ID GD PS SA SO MO MV ER TR NP AG LV MY NO HK ES IQ KN AE KR The set ES IR 𝑊 is determined iteratively using the function OpenAlex- BZ SO NI LB KH BT IS KW CD GT ID PR BH CU JO HR VI CU SV DO BG KG ET IN HT RS PR GQ TR CRPA IR VI CRBB PA MM IO VG BMJMGT PY HT LK CA JP MC XK BA SV SA AW MS KN LU DM KP OM LU BM RU VG LB PS 2PajekCite FR TT KP AZ GA BB TTBS BZ CA AL and the collection is finally created using the func- LC CF KY MF UZ TC CI AW GP KG CC MO FR RU KH BQ SR SN VC UZ AI TJ SX SJ AD AX MQ MX MX GE TM DZ BY TJ MV AF CI PF BG KZ TN AG AZ AN GF TM PM BL LTUA EE HM YT SH EH KM SN GF BJ BO NC BY MR tion OpenAlex2PajekAll. HN HN UA KZ DZ GA MD SC KM EH CV PF MD SM LI ME CF RE CG DJ AQ GS MQ NE TF BV NE BQ TC MS BL MG BI GP TG BF ML TD MF MA DJ MG ML JE AD TN MA AX SJ RE BI BF SR YT GQ SH MR AO FK TD CX CC TL LC VC PM AN CD IO XK AI Pajek Pajek The function coAuthorship creates a weighted temporal net- BW SO SO KY GM WF TV UM PN GM LS LA LYMU MS MU LK ZM BD QA SZ SC GQ MY TZ BN SS SY MW MY IE MV TD NA ZW CG work describing the co-authorship between world countries in OM NF KI MH GG BN SZ GN GW AS SS SJ BH MK LC BT IE MT AG VU GR JE JE MS PG PT GI NR GG VC GB ZA TO CD TK IM MT MK SB selected time intervals. The weight of an edge is the number of ZA ZMERRW TZ GI KE GB VU SD GH AU FJ SL 2000 IM AONA NR LR 2010 SK AO AZ RWUG AR CK LS TO PG GL CO EG UG MW NG MZ MR NZ SB FK BW WS TR FO CZ NZ ST EC EG KE MG ER NG SD GH ZW AU BO VE GL AR FJ MP PW CM CH PE IO RO CO DK EC SL LC AS NU works co-authored by authors from the linked countries. DK BO AT CL BR PY FO AM CL CK PW MZ GN EE CY GU UM BE UY NO SK CY GU ST DE MH BY LI AM GE MP VE PK PH FM NU BG PE FI AT FM AW BE BH VN TW IS BR PY SI SX MN MD KZ IL TH DE BG FI PK UY VA GQ NL CN CW MO IL PH BT HU US In an analysis of weighted networks, the 1-neighbor skeleton LI SI SR IS AE HR JO US VN HK SG KW SE CW NL MN PL AX UA BD CN GE AF TH TW AW VA KP CH HU MO GW CG MD IR KR AE IT IQ HK LT SC AL NO YE KZ LV KW LB SE JO KR PL PS SM ET PT AX AF YE LV SG CZ MV HT TL KH is often used to get an overall insight into the network’s basic ES XK EE IT LK IN ET GR SV IN LT GT LB MC KN NI QA OM LU PR DMVG JM SM BA GD ID CD PS BZ AD GD VC AI MX SA ES KN RO NP SY SA JP MC DM JP CV GA VI CA AL BM HT VGHN MXTT IR WS LY CC LU MM KP MM CF DO GT BM PR IQ ID TC PA CV CI DJ TT RU DO NP CU KY BS CR HN BB GA VI TC JM LA MG CX structure. In the 1-neighbor skeleton, only its strongest link is CU CR BB BS PA RU KH SN ML TR RS UZ DZ RS UZ MA FR HR KG NF KI TN FR IO CX YT TJ SH KG CI BY BJ CA TJ SN GY TM NI TM DZ AZ TV PN RE ME UA GF ME LR EH FK TL CC KM NE PF BA BI NC kept for each node. The resulting directed network is forest- AG TG NC NE PF CM WF TK MF BL PM AN BF TG BQ GS GY PM BQ YT BV AN SR SJ BZ TN ML MA GF SV BJ GP CF DJ GP KM MQ GS HM XK MQMF MR BI RE BFTD EH SH AQ TF BV TF AQ HM AD AI BL SX Pajek like. Non-trivial connected components in 1-neighbor skeletons Pajek NU LY LY AZ XK FK LA IE SSSO MU BN GG IE GM MU LS are (usually) directed trees with a pair of nodes linked in both BN GG GM LS ZW NA MT PN ZW NA SZ NR MT BW MW JE FK MY ID JE SZ IQ BT TL AX MS MK MK TL EG GI EG IQ VU IM VU SD CY IM SD MY SB PG ER PG GR NE TD CY MV WS TO ZA ZA SA directions with the largest weight in the tree – these two arcs FJ SA GR SB TM TV GB KE KI GB ZM NG KM NG FJ RW SK HN KE RW GW ET SK ET AU CM CK LR CM AU KI YE 2015 2020 SL BH GW AQ SL BH GL LR SS CK AR MW TV AR GL SC VE ZM MZ FO BW KM BO UG CO UG TO NZ TZ AO SO TZ WS FM NR AO FO CZ CO GH AQ CZ PE GH MZ TF NZ PW ST CV are usually replaced by an edge (undirected link). In Figure 2 the SJ CL MP PW ST CL GY LT DK GE GY DK GU SJ MP GU PT AS PY LT PE BR AT CH AS AM MH BR GE MH UY UY LU EE NO WF NO FM UM SR UM LI BE BV AT BE DE IS IS SY PY BG SY PH SG PH PK CW BG FI VN EE PK DE FI VN SG NL SI CN MN TW 1-neighbor skeletons for years 1990, 1995, 2000, 2010, 2015, and SR SI PL US TW PL US MN IL MO LI IL CN AL CH IT TH HR TH MO HR HU AE HK SM AE JO CG HU VA JO CG HK SE KW VA LV KW BT KR YE AX NL LV KR SE BD NP AD ES BD JP KP AL UA KP BZ IT IR LK BZ IO LK IN ID SM IR IN 2020 are presented. We see that the number of isolated nodes GI AG GD PS AF AZ GD JP PS PT SV GT KN ES GT KH KN KH AF ER CV CU MM RO HT DM NI CD RO HT DM NI CD MC VC AD VGVC BQ AI TT PA CW QA RS MM OM VE VG BQ AI RS PA SV OM PR QA TR PR MV BO TR LU VI CU CA CA NP VI EC BL AW DO XK MC AW DO EC JM TC RU BA KY TC MX CR KY LCSX BM MF JMLCSX BS BB BM RU TT BA CR MF BS BB MX (countries not collaborating with other countries) is decreasing. MD MD ME PM FR ME FR PM LB UZ TM AG GP UZ LA GP GF KG GA HN LB KG MQ MQ CF TJ TJ BY DJ PF DJ MS BY KZ BL WF NC UA GN NC KZ MG AM In all analyzed years the US has a leading (hub) position. In the GF ML ML PF CI CI MA CX TK BV MA TF IO GQ EH SN NF CX TK SN CF NU DZ BFGN DZ BF NE TG BI MGTG BI YT RE RE TN YT BJSCMR EH GS TN MR HM SH AN CC HM BJGA NF GS SH AN CC PN TD GQ years 1990, 1995, 2000, and 2010 the edge in the main component Pajek Pajek links US and GB but in the years 2015 and 2020 GB is replaced by Figure 2: 1-neighbors skeletons of world co-authorship for CN. In 1990, stronger secondary hubs were GB, FR, RU, JP, and DE. selected years. In the following years, some other countries SE, ES, AU, CN, BR, ZA, and IN (BRICS) became secondary hubs attracting previously non collaborating countries or geographically or linguistically in the bibliography of works with at least one co-author from close countries. University of Primorska. Most of the ingredients of basic reports are counters, sorted In bibliometric analysis, the citation network Cite has a very lists, (weighted) degrees and their distributions obtained from important role. It collects “votes” about the relevance of previous an adequate network. Sometimes also the time is considered works for a given work. It is often used for solving the network producing time series. boundary problem, and also for identifying the most relevant An important property of a collection of bibliographic net- works in the collected bibliography [2, 6]. The derived network works is that some of them are compatible – they share a com-ACiA = WA𝑇 · Cite · WA describes the citations between authors mon set (most often the set of works W). This allows us to use – its entry 𝐴𝐶𝑖𝐴 [𝑎, 𝑏] counts the number of times author 𝑎 cited network multiplication (defined by the product of network matri- author 𝑏 . The co-citation network is defined as the column pro- ces) to compute the corresponding derived network connecting jection of the citation network coCi = col(Ci) = Ci𝑇 · Ci and the the remaining two sets [5]. For example, in the derived network bibliographic coupling network is defined as the row projection AK = WA𝑇 · WK its entry 𝐴𝐾 [𝑎, 𝑘] tells us in how many works of the citation network biCo = row(Ci) = Ci · Ci𝑇 . the author 𝑎 used the keyword 𝑘 . Similarly, in the derived net- The idea of derived networks can be extended to temporal work ACiK = WA𝑇 · Cite · WK its entry 𝐴𝐶𝑖𝐾 [𝑎, 𝑘 ] tells us how bibliographic networks [8]. Using derived networks we enlarge many times the author 𝑎 cited works described by the keyword the source for different statistics. Additional insight can be gained 𝑘 . by analyzing the structure of networks and identifying important A 2-mode network is always compatible with its transpose subnetworks in them [6]. (on both sets). The corresponding derived networks are called In the following, we present an overview of typical report projections – the row projection row (WA) = WA · WA𝑇 and the ingredients [7, 15]. Because of limited available space, we decided column projection col (WA) = WA𝑇 · WA. Both projections are to put examples on Github/bavla. ordinary weighted 1-mode networks that can be analyzed using standard network analysis methods. 5 Report ingredients For the authorship network WA its column projection Co = 5.1 Statistics WA𝑇 ·WA is the co-authorship network. Its entry 𝐶𝑜 [𝑎, 𝑏] counts the number of works that authors Because the analyzed networks are often large a complete pre- 𝑎 and 𝑏 co-authored. It turns 2 out that a work with sentation is not an option. To describe them we use different 𝑘 co-authors contributes 𝑘 links to the co- authorship network – works with a large number of co-authors statistical descriptors. are overrepresented in it. To treat all authors equally the frac- • sizes of sets (number of nodes, number of links); structural tional approach is used [3]. In Figure 1 the largest co-authorship network properties (number of components, size of the groups at level 10 at the University of Primorska are presented – largest component, etc.) connected components of the link cut at level 10 in the network • top units – ordered lists of units with the largest values of Co. Each pair of linked authors co-authored at least 10 works selected property (degre, weighted degree, link weight, 29 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Batagelj, Pisanski & Pisanski Color Key Europe 2022 / Balassa / Ward 6 Conclusions We have presented an approach to support higher-order biblio- graphic services based on networks. Open access to high-quality bibliographic data is crucial for the faster development of such −6 −4 −2 0 2 4 6 services. The new bibliographic database OpenAlex seems to be Value GB a step in the right direction. It needs the support of science policy LU IT IE ES and also of individual scientists (checking the correctness of their NL BE FR AT DE data). CH GI VA IM GG SJ AX Acknowledgements JE PT SE FI DK NO The computational work reported in this paper was performed LI FO IS AM using a collection of R functions OpenAlex2Pajek and the pro- GE AZ MD BY gram Pajek for analysis of large networks. Code, data, and figures KZ RU UA BA MK are available on Github/Bavla/OpenAlex. ME AL XK AD VB’s work is partly supported by the Slovenian Research MC SK CZ PL Agency ARIS (research program P1-0294, research program Cog- TR BG RO LV niCom (0013103) at the University of Primorska, and research LT EE HU RS projects J1-2481, J5-2557, and J5-4596), and prepared within the HR SI CY GR MT framework of the COST action CA21163 (HiTEc). JP’s work is SM T A LI U A O SI GB LU IT IE ES NL BE FR A DE CH GI V IM GG SJ AX JE PT SE FI DK NO FO IS AM GE AZ MD BY KZ R U BA MK ME AL XK AD MC SK CZ PL TR BG R LV LT EE HU RS HR CY GR MT SM partly supported by ARIS (research program P5-0361 and research projects J1-2551 and J5-4596). TP’s work is partly supported by ARIS (research program P1-0294 and research projects N1-0140, Figure 3: Balassa EU co-authorship for the year 2022. J1-2481, J5-4596). References • distribution of selected property [1] Massimo Aria, Trang Le, Corrado Cuccurullo, Alessandra Belfiore, and • time series describing temporal changes of selected prop- June Choe. 2024. openalexR: an R-tool for collecting bibliometric data from erties OpenAlex. The R Journal, 15, 4, 167–180. • [2] Vladimir Batagelj. 2003. Efficient algorithms for citation network analysis. scatter plots showing a possible relationship between two arXiv preprint cs/0309023. selected properties [3] Vladimir Batagelj. 2020. On fractional approach to analysis of linked net- works. Scientometrics, 123, 2, 621–633. doi: 10.1007/s11192- 020- 03383- y. Often bibliometric properties of units follow laws such as Zipf [4] Vladimir Batagelj. 2024. OpenAlex2Pajek. version 4, June 18. (2024). https: (or power) law, Bradford law, Lotka law, lognormal distribution, //github.com/bavla/OpenAlex/tree/main/code. Hirsch index, etc. [5] Vladimir Batagelj and Monika Cerinšek. 2013. On bibliographic networks. Scientometrics, 96, 3, 845–864. doi: 10.1007/s11192-012-0940-1. [6] Vladimir Batagelj, Patrick Doreian, Anuška Ferligoj, and Nataša Kejžar. 5.2 Network analysis 2014. Understanding Large Temporal Networks and Spatial Networks: Explo- ration, Pattern Searching, Visualization and Network Evolution. Wiley Series Derived networks are weighted. To get readable results of reason- in Computational and Quantitative Social Science. Wiley, Chichester. isbn: able size we usually search for important subnetworks, often a 978-1-118-91537-0; 978-0-470-71452-2. doi: 10.1002/9781118915370. [7] Vladimir Batagelj, Anuška Ferligoj, and Flaminio Squazzoni. 2017. The emer- kind of skeleton – from a given network less important elements gence of a field: a network analysis of research on peer review. Scientometrics, are removed. There are different types of skeletons (spanning 113, 1, 503–532. doi: 10.1007/s11192- 017- 2522- 8. forest, 𝑘 closest neighbors, cuts, cores, islands, etc. [6]). [8] Vladimir Batagelj and Daria Maltseva. 2020. Temporal bibliographic net- works. J. Informetr., 14, 1, Article No. 101006. doi: {10.1016/j.joi.2020.101006}. A traditional graph-based visualization is used if the obtained [9] Dalmeet Singh Chawla. 2022. Massive open index of scholarly papers launches. result network is not dense. For denser networks, the matrix dis- Nature. [10] DBLP – computer science bibliography. 2024. (2024). https://dblp.org/. play is much more readable. In a matrix display, the permutation [11] Lorena Delgado-Quirós and José Luis Ortega. 2024. Completeness degree of of nodes (usually obtained by clustering) can create patterns that publication metadata in eight free-access scholarly databases. Quantitative reveal the network’s internal structure. Science Studies, 5, 1, 31–49. [12] Scholar GPS. 2024. (2024). https://scholargps.com/. Figure 3 presents a matrix display of Balassa co-authorship [13] Chenyue Jiao, Kai Li, and Zhichao Fang. 2023. How are exclusively data indices between European countries in 2022 (yellow cell – no journals indexed in major scholarly databases? an examination of four link, red/blue cell – above/below expectation) [17]. databases. Scientific Data, 10, 1, 737. [14] Bill Jones. 2002. Computational geometry database. (2002). f tp://f tp.cs.usas k.ca/pub/geometry/. 5.3 Special algorithms [15] Daria Maltseva and Vladimir Batagelj. 2019. Social network analysis as a field of invasions: Bibliographic approach to study SNA development. Some properties can require special computational procedures Scientometrics, 121, 2, 1085–1128. doi: 10.1007/s11192-019-03193-x. and direct access to the bibliographic data. In such cases, open [16] Peter V. Marsden. 1990. Network data and measurement. Annu. Rev. Sociol., 16, 435–463. doi: 10.1146/annurev.so.16.080190.002251. access to the bibliographic database is of crucial importance. [17] Nataliya Matveeva, Vladimir Batagelj, and Anuška Ferligoj. 2023. Scien- tific collaboration of post-soviet countries: the effects of different network 5.4 Reports normalizations. Scientometrics, 128, 8, 4219–4242. [18] Jason Priem, Heather Piwowar, and Richard Orr. 2022. Openalex: a fully- The results of analyses can be combined and presented to users open index of scholarly works, authors, venues, institutions, and concepts. in different forms: arXiv preprint arXiv:2205.01833. [19] Google Scholar. 2024. (2024). https://scholar.google.com/. • Booklet report (in PDF). [20] Bert TePaske-King and Norman Richert. 2001. The identification of authors • (Service generated) web pages. in the mathematical reviews database. Issues Sci. Technol. Librariansh., 31. doi: 10.5062/f 4kh0k9m. • Dashboards. [21] Wikipedia. 2024. API. August 22. (2024). https://en.wikipedia.org/wiki/API. • Dataset (JSON, CSV, etc.). 30 Are papers all that counts? A bibliometric analysis of the Slovenian scientific community Aymeric Dupuis Sašo Džeroski Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia aymeric.dupuis@etu.univ- nantes.f r saso.dzeroski@ijs.si Boshko Koloski Matej Martinc Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia boshko.koloski@ijs.si matej.martinc@ijs.si Abstract a discipline as a whole has. More specifically, our contributions are the following: We conduct a bibliometric analysis of the Slovenian science by scraping the data from Slovenian current research information • Using the collected data about the Slovenian scientists system (SICRIS) and using it to build a knowledge graph, repre- and their projects, covering different scientific fields and a senting a network of all Slovenian scientific fields and a large large majority of researchers working in Slovenian science, majority of Slovenian researchers. By analyzing this network us- we conduct a graph analysis of connections between dif- ing different graph measures, we obtain valuable insights into the ferent fields and researchers. By drawing a comprehensive connections between different scientific fields and researchers in map of connections between actors and fields, we iden- Slovenian science. Additionally, we show the importance of graph tify the most important researchers and scientific fields measures as measures of scientific excellence, since they measure that connect others and play a vital role in the Slovenian very different aspects of scientific success than the traditional scientific ecosystem. citation metrics. • We created a new ranked list of Slovenian scientists ac- cording to graph based metrics, which were not available Keywords in any of the previous analyses or databases. We argue that these metrics measure the importance of a role that a bibliometrics, Slovenian scientific community, knowledge graphs specific scientist has in a research community, i.e., their 1 Introduction influence which allows them to act as a bridge or a hub connecting scientists from different fields. With the growth and diversification of the scientific enterprise, obtaining empirical evidence on the research process is crucial for 2 Related work enhancing its efficiency and reliability. Meta-research and biblio- Studies in bibliometrics (see [4] for a comprehensive survey of metrics are developing scientific disciplines, seeking to analyse, techniques used for measuring scientific excellence) have re- evaluate and refine research practices, and several studies have cently gained traction in parallel with the success of the scien- focused on the analysis of the global scientific endeavour, e.g., tific enterprise, which has grown in both size and diversity, and identifying most prominent scientists and fields [7]. These stud-with the availability of data. According to Ioannidis et al. [7], ies also focus on the problem of how to properly rank scientific research on research is becoming important due to the mounting excellence and scientific outputs in general, warning that one evidence suggesting an alarming drop in reproducibility of re- should not rely on just a few metrics to obtain a comprehensive search findings, the growing inefficiency of the scientific process, picture of the actual impact a specific scientist has [8]. and the fact that the number of false positives in the literature Until now, very few studies have tackled the analysis of sci- is exceedingly high. To address these problems, they propose entific ventures at national level, and to our knowledge, there a meta-research divided into five main categories that should has been no study covering the Slovenian scientific landscape be studied: methods, reporting, reproducibility, evaluation, and specifically. This kind of research is nevertheless important and incentives. Studying these five areas would correspondingly al- could potentially influence policies that would improve scientific low for five distinct insights into how to perform, communicate, production and enable effective distribution of research funds verify, evaluate, and reward research. and resources. Recently, several studies also tackled the problem of how to In this study, we try to address the identified research gaps by properly rank scientist and scientific outputs in general. For ex- 1.) drawing the map of Slovenian scientific research that would ample, Ioannidis et al. [8] addressed the increasing prevalence of enable proper decision making and policy formulation, and 2.) multiauthorship observed in several fields and how this phenom- proposing new metrics of scientific excellence that would allow enon affects the effectiveness of the informativeness of citation us to obtain a more complete view of the impact a scientist or metrics. They also explored how sensitive the indicators are to Permission to make digital or hard copies of all or part of this work for personal self-citation and alphabetic ordering of authors. They concluded or classroom use is granted without fee provided that copies are not made or that multiple indicators should be used for ranking, as a com- distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this posite of different metrics gives a more comprehensive picture work must be honored. For all other uses, contact the owner /author(s). of the actual impact that a specific scientist has. They also ac- Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia knowledged that no single or composite citation indicator can © 2024 Copyright held by the owner/author(s). https://doi.org/https://doi.org/10.70314/is.2024.sikdd.11 be expected to select all the best scientists. 31 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Trovato et al. Several studies employed graph-based metrics to enrich the as- publications in exceptional, high quality and important venues, sessment of bibliometric analysis [4, 1]. Network metrics such as respectively. We also extracted the A1 metric, which represents a degree of centrality, betweenness centrality, eigenvector central- weighted sum of these three metrics, a CI10 metric measuring the ity, closeness centrality, and PageRank were used to pinpoint the number of pure citations of scientific work in the last 10 years, relative importance of research constituents (i.e., researchers and the CImax metric measuring the number of citations in the most institution), which may not necessarily be reflected just through cited work, and the h10 metric representing the h-index in the publications. In a large majority of cases, these metrics were last ten years. Furthermore, we extracted the SICRIS points, a calculated on co-authorship graphs. conglomerate metric combing several distinct metrics mentioned The studies that would cover Slovenian scientific environment above, and the A3 metric, which measures the amount of funds a are very scarce. In fact, we are aware of just one, the study by specific researcher received for his research activity outside of [2], where they claim that research performance is highly de- the Slovenian National Research Agency (ARIS). pendent on the conditions of (national) research environments. Finally, the SICRIS database also contains information on They focus on analyzing research activity in six eastern European projects financed by the Slovenian national research agency in countries, namely Croatia, Estonia, Hungary, Latvia, Lithuania, which a specific researcher participated. Scraping this informa- and Slovenia, and try to determine and compare the effectiveness tion provided us with an important insight into collaborations of research in a specific country by obtaining the number of between different scientists and fields, allowing us to build col- articles belonging to the most cited 10% and the most cited 1% laboration graphs, calculate several graph-based ranking criteria articles in the corresponding subject area and publication year and draw the map of the Slovenian scientific community. for each country. Their empirical analysis addresses three levels: cross-country, cross-institution, and cross-researcher compari- 3.2 Methods son. The study concludes that Hungary is the country with the Once the data was obtained, we conduct two distinct analysis highest output, followed by Croatia and then Slovenia, when it steps, namely 1.) graph construction and analysis, and 2.) corre- comes to the number of influential articles published. lation analysis 3 Methodology 3.2.1 Graph construction and analysis. To construct the neces- sary graphs, we used the Python NetworkX library [6]. Using In this section, we describe our methodology, namely 1.) how we the data from SICRIS, which contain information about project gather the data and 2.) how we analyze these data to obtain a collaboration, we created an undirected graph as follows: all re- map of the Slovenian scientific community. searchers who participated in at least one project are represented 3.1 Data Retrieval by a node, and nodes of researchers who worked together on a project are connected by weighted edges, in which the weights Data were retrieved from the Slovenian Current Research Infor- represent the number of shared projects. By removing the iso- 1 mation System (SICRIS) website , which lists more than 35,000 lated nodes, we ended up with a graph with a total of 20,012 researchers working in Slovenian research institutions. Data col- nodes and 618,871 edges. lection from the SICRIS website proved challenging, as informa- In the next step, we apply several graph statistics and mea- tion about a specific researcher can only be obtained by scraping sures in order to obtain several node rankings, each of them his/her Web page on SICRIS. This required finding a solution to measuring a different aspect of the importance a specific node quickly retrieve data from more than 35,000 different pages, and (i.e., a researcher) has in the graph. More specifically, we calculate 2 3 to achieve this, we used the Python Asyncio and BeautifulSoup PageRank (PR), Betweenness centrality (BC), and Eigenvector libraries, which allow the asynchronous connection to several centrality (EC) measures. dozen pages simultaneously and extraction of the required data. In the context of our graph, the PageRank [3] algorithm is Since the script sometimes took several seconds to connect applied to evaluate the influence of researchers within the collab- to a specific page, which could quickly accumulate, resulting in oration network. Thus, researchers who are strongly connected considerable overall slowdowns, we optimized the procedure and to other researchers, who also have many connections (i.e, the so- identified potential slowdowns. Our proposed solution was to called hubs in the graph), will have a higher PR score, reflecting implement a strategy that involved canceling the connection and their importance and influence in the Slovenian research com- adding the URL to a list whenever a page failed to connect within munity. On the other hand, the Betweenness Centrality [5] a 0.5-second time frame. This timeframe was chosen after several measure evaluates the role of each researcher as an intermediary trials and was found to be the best compromise. Once all pages or a bridge between other researchers. This measure is based on had been visited, we repeatedly tried to reconnect to the URLs the idea that researchers who are on many collaboration paths on this list until it was empty. This change significantly reduced between other researchers are considered central and influential the time required to retrieve all our data. Once all the data was in the network. In our contexts, it helps to better understand 2 retrieved, we used the Pandas library for data manipulation, the structure of the collaboration network among researchers. which allowed us to export the results into Excel spreadsheets, Researchers with high BC are those who play a crucial role in appropriate for further processing. creating links between different subgroups of researchers and in- From SICRIS, we extracted research areas for each scientist terdisciplinary connections. In practical terms, BC evaluates the and various bibliometric indicators of their impact, namely A”, number of times a researcher is traversed by the shortest paths A’, A1/2, citation metrics based on a quantitative assessment of connecting other researchers in the network. Thus, researchers who are frequently used as pathways for collaboration among 1 https://cris.cobiss.net/ecris/si/en their peers obtain higher BC scores. 2 https://docs.python.org/3/library/asyncio.html 3 https://www.crummy.com/software/BeautifulSoup 2 1 https://pandas.pydata.org/ https://networkx.org/ 32 A bibliometric analysis of the Slovenian scientific community Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Another graph centrality measure that we applied to the cre- low according to SICRIS points (e.g., the best ranked researcher ated graph is the Eigenvector centrality [9]. This measure according to our novel three measures, Dr. Branimir Leskošek, is evaluates the influence of a researcher taking into account both ranked as 5731th according to the SICRIS points). This finding the quality and the quantity of connections. EC assigns more supports hypothesis 1 that the proposed new measures measure weight to connections that include influential researchers. Thus, different aspects of scientific excellence than the more established a researcher connected to influential researchers will be assigned citation measures. Another important observation is that 7 out a high score, reflecting potentially greater influence within the of 10 best ranked scientists appear to be active in two fields. This network. This measure helps to detect researchers who, even might suggest that they are (or have been) involved in several with fewer direct connections, occupy strategic positions in the interdisciplinary projects, which could have a positive influence collaboration network. While this may seem similar to the PR on the newly proposed graph-based metrics. algorithm, there are some differences. Unlike PR, which primarily In Figure 1, we present the heatmap of the correlations be- focuses on the popularity of links, Eigenvector centrality also tween the different metrics extracted from SICRIS website and takes into account the quality of connections. This means that the newly proposed graph-based metrics. We observe a strong even if a researcher does not have a large number of direct con- correlation between PR and BC, 0.7, which might suggest that nections, if they are connected to influential researchers, their researchers who collaborate with a wide range of colleagues from Eigenvector centrality score can be high. In summary, while different fields are more likely to work with the most important both measures aim to evaluate the influence of researchers in a ones. network, they do so through slightly different approaches, thus offering complementary perspectives for analyzing the structure and importance of actors within the collaboration network. Our second important area of focus in our research is the collaboration between different fields. To build a graph that would represent interdisciplinary collaboration between fields, we grouped all researchers from the same field into a single node, representing an entire field, i.e., we obtain a node for each scien- tific field found on SICRIS. Similar to the previous graph, edges and their weights represent collaborations on a project between researchers in the linked fields. 3.2.2 Correlation analysis. In order to better understand the metrics from SICRIS and to evaluate the relevance of our scores, we deemed it pertinent to explore the correlation between all our data. This analysis has two main purposes. First, we aim to test the hypothesis 1 that the new graph ranking we presented, Figure 1: Heatmap of the Spearman correlation among measure different aspects of scientific excellence than the more metrics. established measures based on number of citations or publica- tions available on the SICRIS web page. This hypothesis would We also observe very strong correlations in the top left corner be deemed correct if one-on-one correlations scores between the of the heatmap. While a strong correlation was expected, as A”, newly proposed graph measures and other measures would be A’, A1/2 and A1 are all scores based on the number of publications low, and incorrect if correlations would be high. (in venues of different qualities), the almost perfec correlation Additionally, we wish to explore the correlation between the between the SICRIS points and A1 (which suggest they measure established measures available on the SICRIS web page. More exactly the same aspect of the scientific impact) is surprising. This specifically, we wish to test the hypothesis 2 that these measures finding supports hypothesis 2 that the current SICRIS measures are strongly correlated, which would indicate that they essentially all measure a very similar aspect of scientific excellence. On the all measure a very similar aspect of scientific excellence, which is other hand, there is no strong correlation between any of the problematic. In order to obtain one-on-one correlations between newly proposed graph-based metrics and metrics extracted from all measures, we calculate the Spearman correlation coefficient the SICRIS website. among all of them and then display it through a heatmap. In Table 2, we present the results of our study of interdis- ciplinary collaboration between different scientific fields. The 4 Results graph metrics were obtained from a graph of nodes representing In Table 1, we present some of the results of the graph analy- fields and edges representing interdisciplinary project collabo- sis conducted on the graph of nodes representing researchers, rations. Note that the field of Computer science and informatics connected by edges representing project collaborations. More ranks first according to all the criteria. On the other hand, most specifically, we present 10 best ranked researchers in the SICRIS interdisciplinary collaborations are conducted by the researchers dataset according to the average between ranks of the three newly from the field of Chemistry, which ranked as third according proposed graph-based measures, their declared scientific fields, to the average (AVG) between the ranks of three graph-based and their ranking (i.e., lower is better) according to the SICRIS metrics, PG, BC and EV. points, BC, EC and PR measures. 5 Conclusions Note that while the table does contain some highly ranked researchers according to the SICRIS points (e.g., Dr. Sašo Džeroski The graph based bibliometric analysis of the Slovenian scientific is ranked as 33rd out of roughly 20K researchers according to community shows that current citations based metrics do not this criteria), several researchers in the table are ranked relatively cover some aspects of scientific excellence, such as researcher’s 33 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Trovato et al. Table 1: 10 best ranked researchers in the SICRIS dataset according to the average between ranks of the three newly proposed measures, BC, EC and PR. We do not show metric scores, but ranks according to scores (i.e., lower value is better). Researcher Field 1 Field 2 SICRIS points BC EC PR AVG 15355 PhD Branimir Leskosek Public health (occupational safety) Computer science and informatics 5731 8 4 31 14 06013 PhD Damjana Rozman Biochemistry and molecular biology Metabolic and hormonal disorders 704 21 2 33 18 11279 PhD Nives Ogrinc Control and care of the environment Animal production 182 7 50 3 20 27733 PhD Tina Kosjek Control and care of the environment Pharmacy 809 2 73 9 28 22459 PhD Tadeja Rezen Neurobiology Microbiology and immunology 1837 61 3 49 37 22621 PhD Polonca Ferk Metabolic and hormonal disorders Pharmacy 5059 13 8 103 41 12688 PhD Kristina Gruden Biotechnology / 219 44 139 6 63 08800 PhD Gregor Sersa Oncology / 71 3 185 1 63 12315 PhD Ester Heath Control and care of the environment Chemistry 208 62 115 23 66 11130 PhD Šašo Dzeroski Computer science and informatics / 33 1 195 20 72 Table 2: Scientific fields as defined in the SICRIS database, sorted according to the average (AVG) between the ranks (lower score is better) of three graph-based metrics, PG, BC and EV. Rank Field Collaborations PR EC BC AVG Rank Field Collaborations PR EC BC AVG 1 Computer science and informatics 81248 1 1 1 1.0 36 Textile and leather 21080 27 41 39 35.67 2 Materials science and technology 88934 4 3 4 3.67 37 Animal production 34982 29 29 50 36.0 3 Chemistry 101139 2 2 12 5.33 38 Political science 13598 46 37 27 36.67 4 Control and care of the environment 52648 5 8 9 7.33 39 Anthropology 9860 53 36 24 37.67 5 Physics 50010 3 9 14 8.67 40 Ethnology 6698 65 39 11 38.33 6 Plant production 74535 6 6 16 9.33 41 Cardiovascular system 20793 28 43 45 38.67 7 Systems and cybernetics 45584 7 10 23 13.33 42 Telecommunications 14068 41 45 31 39.0 8 Biology 58879 12 7 21 13.33 43 Veterinarian medicine 30954 32 34 60 42.0 9 Civil engineering 36466 22 13 6 13.67 44 Metabolic and hormonal disorders 18518 30 46 55 43.67 10 Biochemistry and molecular biology 79725 11 5 25 13.67 45 Metrology 12978 34 52 47 44.33 11 Neurobiology 45680 14 12 19 15.0 46 Law 7480 54 49 32 45.0 12 Biotechnology 87261 8 4 33 15.0 47 Psychology 8583 51 55 29 45.0 13 Interdisciplinary research 22946 9 33 5 15.67 48 Human reproduction 21535 35 42 58 45.0 14 Public health (occupational safety) 30400 10 25 13 16.0 49 Process engineering 15340 36 47 53 45.33 15 Educational studies 23518 33 15 3 17.0 50 Hydrology 12396 40 53 44 45.67 16 Mathematics 30680 17 20 20 19.0 51 Architecture and Design 4242 58 57 22 45.67 17 Manufacturing technologies and systems 38874 18 14 26 19.33 52 Philosophy 7380 57 44 43 48.0 18 Forestry, wood and paper technology 30620 19 28 15 20.67 53 Sport 10013 43 54 49 48.67 19 Geography 18555 39 23 2 21.33 54 Geodesy 7760 45 56 51 50.67 20 Economics 26891 31 16 18 21.67 55 Electric devices 13633 42 51 59 50.67 21 Microbiology and immunology 54175 16 11 42 23.0 56 Literary sciences 6399 61 50 48 53.0 22 Sociology 19922 44 17 10 23.67 57 Traffic systems 4448 48 60 52 53.33 23 Pharmacy 41125 15 18 41 24.67 58 Culturology 7240 60 48 54 54.0 24 Linguistics 18176 49 19 7 25.0 59 Technology driven physics 6876 47 59 64 56.67 25 Chemical engineering 33753 13 27 38 26.0 60 Communications technology 4388 52 63 56 57.0 26 Energy engineering 32762 23 21 40 28.0 61 Psychiatry 2481 55 65 61 60.33 27 Computer intensive methods and applications 26942 20 32 34 28.67 62 Criminology and social work 2324 66 62 62 63.33 28 Mechanics 26444 24 31 36 30.33 63 Mining and geotechnology 2342 59 68 63 63.33 29 Oncology 37101 21 24 46 30.33 64 Theology 2941 67 58 66 63.67 30 Geology 26961 37 26 28 30.33 65 Ethnic studies 2398 63 61 67 63.67 31 Electronic components and technologies 28858 26 30 37 31.0 66 Art history 1408 70 64 57 63.67 32 Historiography 12390 56 22 17 31.67 67 Archaeology 1177 68 66 65 66.33 33 Urbanism 8669 50 40 8 32.67 68 Information science and librarianship 792 62 70 70 67.33 34 Mechanical design 22352 25 38 35 32.67 69 Stomatology 391 64 71 68 67.67 35 Administrative and organisational sciences 18563 38 35 30 34.33 70 Landscape design 1046 69 67 71 69.0 71 Musicology 748 71 69 69 69.67 role of connecting a wider research community. Our correlation [3] Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hyper- textual web search engine. Computer networks and ISDN systems, 30, 1-7, analysis indicates that existing measures of scientific excellence 107–117. extracted from the SICRIS web page are strongly correlated. In [4] Naveen Donthu, Satish Kumar, Debmalya Mukherjee, Nitesh Pandey, and the future, we plan to expand this analysis to also measure the Weng Marc Lim. 2021. How to conduct a bibliometric analysis: an overview and guidelines. Journal of business research, 133, 285–296. impact of Slovenian scientists on the global scientific enterprise [5] Linton C. Freeman. 1977. A set of measures of centrality based on between- and conduct additional research to try to find certain patterns ness. Sociometry, 40, 1, 35–41. Retrieved June 27, 2024 from http://www.jstor across disciplines, or institutions. .org/stable/3033543. [6] Aric Hagberg, Pieter J Swart, and Daniel A Schult. 2008. Exploring network structure, dynamics, and function using networkx. In Proceedings of the 7th 6 Acknowledgments Python in Science Conference (SciPy2008). Los Alamos National Laboratory (LANL), Los Alamos, NM (United States), 11–15. The authors acknowledge the financial support from the Slove- [7] John PA Ioannidis, Daniele Fanelli, Debbie Drake Dunne, and Steven N nian Research Agency for research core funding for the pro- Goodman. 2015. Meta-research: evaluation and improvement of research methods and practices. PLoS biology, 13, 10, e1002264. grammes Knowledge Technologies (No. P2-0103). [8] John PA Ioannidis, Richard Klavans, and Kevin W Boyack. 2016. Multiple citation indicators and their composite across scientific disciplines. PLoS References biology, 14, 7, e1002501. [9] Paul Turán, editor. 1969. Publications of edmund landau. Number Theory and [1] Njål Andersen. 2021. Mapping the expatriate literature: a bibliometric review Analysis: A Collection of Papers in Honor of Edmund Landau (1877–1938). of the field from 1998 to 2017 and identification of current research fronts. The International Journal of Human Resource Management Springer US, Boston, MA, 335–355. isbn: 978-1-4615-4819-5. doi: 10.1007/978 , 32, 22, 4687–4724. - 1- 4615- 4819- 5_23. [2] Lutz Bornmann. [n. d.] Research excellence in eastern europe: a bibliometric study focusing on croatia, estonia, hungary, latvia, lithuania, and slovenia. 34 Empowering Open Education Methodologies with AI-based Strategies for the Customization of Education Tel Amiel Antônio J. Moraes Neto Joao Pita Costa Mitja Jermol, Anja Universidade de Brasilia Instituto Federal de Brasilia IRCAI, Jozef Stefan Institute Poljanar Brasilia, Brazil Brasilia, Brazil Ljubljana, Slovenia IRCAI, Jozef Stefan Institute amiel@unb.br antonio.neto@ifb.edu.br joao.pitacosta@quintelligence.com Ljubljana, Slovenia ABSTRACT The amount and heterogeneity of data generated in the context 1 Introduction of education allied to the rapid progress of scientific research The centralizing piece of the discussions in this paper is an AI- and technological development have created vast amounts of based observatory that allows to explore OER-related topics, data, much of it open data, but significant challenges to particularly those mentioned in the OER Recommendation: gathering, filtering and making sense of this information. In this promoting OER and acknowledging it’s contribution to paper, we discuss the research outcomes of complementary advancing quality education while providing information on Artificial Intelligence (AI)-based strategies monitoring and advances focused on the equity and inclusion qualities of OER, enhancing Open Education, mining online forum interaction as well as on research, activities, projects and news related to student-educator, and empowering mentorship of educators. OER development, including new initiatives and projects while Firstly, the initial results obtained from the construction of an also promoting public infrastructures for education. The OER Observatory focusing Open Education Resources (OERs), Observatory builds on the content made available in UNESCO’s contribute to implement 2019 UNESCO OER Recommendation OER Dynamic Coalition Portal (oerdynamiccoalition.org) providing and advance the Education-focused Sustainable Development the user with access to any of the four proposed views: media; Goal (SDG) 4. It is acting on five verticals, enriching and treating science; policies and training. In each of the views, the user can multilingual data, it displays meaningful information on a access interactive data visualisation summarising the sourced dashboard focused on AI and OERs and serving as a data configured to observe the UNESCO OER collaboration platform focused on existing partnerships within recommendations. As it is fully based on open data, it allows the the international research centre on AI under the auspices of user to click on the resources collected and summarized, being UNESCO (IRCAI), the UNESCO Chair in Distance Education and taken directly to the source in media, journal, policy or training. the UNESCO Chair on Open Technologies for Open Educational Embracing the intersection of AI and education, which has Resources and Open Learning, mobilizing research led to the development of various tools that personalize and collaboration on key AI research challenges relating to enhance learning experiences, we discuss a complementary generating knowledge about OER. Secondly, we will discuss the research based on CA much aligned with the objective of recent development of an Educational Recommender System empowering Community interaction at the SDG 4 (Education) (ERS) that integrates Conversational Analysis (CA) to assess Observatory [6]. AI applications in education often focus on and enhance collaborative learning (CL) in Virtual Learning providing adaptive feedback, facilitating personalized learning Environments (VLEs). This novel system was designed to paths, and analyzing student data to improve outcomes. CA is a identify collaboration among students and provide tailored method that examines the understanding generated through recommendations to promote participation and interaction interactions, offering a framework for analyzing how students within discussion forums. Finally, we will discuss the collaboratively build knowledge. By combining CA with AI, this development and implementation of AI and OERs in alignment research aims to develop a system that not only assesses but with SDGs, addressing topics of significant social impact over also actively promotes collaboration in VLEs [10]. The ERS an international online mentoring initiative. discussed later in this paper, is an example of how IRCAI’s SDG4 KEYWORDS Observatory gains a complex capability towards the engagement with communities such as in Education. This Open Education, Machine Learning, Educational Recommender System, Conversational Analysis, Virtual Learning Environment discussion then expands towards the appropriate mentorship of the professionals that will change the domain’s landscape. While initiatives in this context are diverse and disperse, the Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or authors are not aware of existing similar approaches [5]. distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2024.sikdd.16 35 2 AI-based strategies for the moderation of across five experimental cycles in different classes at two online forums on education Brazilian Federal Institutes , in a Portuguese language context. The results indicated a positive impact on student learning, Entering the age of Big Data, AI is feeding the data-driven digital with 82% of participants acknowledging the relevance of the transformation across industries including Education. CL recommendations. The system motivated increased emphasizes the importance of group tasks and joint participation and collaboration, with a notable trend of participation, wherein students learn by actively engaging in students writing more and systematically organizing their dialogues that facilitate the sharing of ideas and information. ideas in forum posts. Additionally, 90% of students engaged in Even in remote settings, CL enables students to learn together other activities proposed by their teachers, demonstrating the through virtual platforms. AI offers new opportunities as a effectiveness of the recommendations. The results also demonstrate the system's effectiveness in fostering pedagogical tool, providing adaptive and personalized collaboration, with positive feedback from students and environments that can support CL. This research explores the educators. A dashboard was developed for teachers, containing integration of AI into educational contexts, particularly through graphs including one that shows the main terms discussed in the development of an Educational Recommender System the forum by analysis, in which each edge represents a message (ERS) that uses CA to identify and promote collaboration from the student with two of these terms, and the nodes in blue among students in VLEs [1] (see Figure 1). highlight the new terms that emerged in relation to the previous analysis (see Figure 2). Figure 2: Visual analysis of students’ collaboration in a discussion forum where nodes represent actors in the discussion (students/educators) and edges represent interactions. Figure 1: The ERS forum analysis screen The research methodology is divided into three key stages: The development of the ERS represents a significant Conversational Analysis, applying CA to monitor discussion advancement in promoting collaborative learning in forums within the Moodle platform, focusing on interactions educational settings [6,7]. By integrating CA into the system, among students, identifying collaborative behaviors and the ERS effectively identifies and enhances collaboration interaction patterns; Collaboration Assessment, evaluating the among students. The current implementation of this ERS aims level of collaboration among students based on identified to provide personalized recommendations to students, interaction patterns; and Development of ERS, building a teachers, and tutors, fostering a more interactive and mechanism that provides recommendations to students, teachers, and tutors. These recommendations are aimed at collaborative learning environment [6]. Future work will enhancing collaboration and are based on the analysis of forum explore the integration of additional features, such as interactions [15]. The initial dataset comprises 20,976 wikification and visualization tools, to further enhance the messages of Moodle discussion forums, with 15,703 posted by system's capabilities. Furthermore, the research will benefit students from a vocational education school. The analysis from the semi-automatic categorization of educational focuses on these messages to develop and validate the ERS's resources of a range of formats, including videos as in [3]. recommendations. The quality of collaboration is measured through various indicators, which are extracted during 3 An AI-based Observatory to Assess the different stages of CA. Preprocessing applies techniques of Impact of OER Worldwide Natural Language Processing (NLP) to ensure the accuracy of the analysis, preparing data for the Resource Processing stage Although the abundance of information available online, some using Social Network Analysis (SNA) to characterize social of which is labeled as education-related, it is harder and harder dynamics and interactions among students. Moreover, the to find the appropriate resources that can serve education Message Attribute Identification is the CA stage that allows either at an undergraduate or a professional training level. identifying characteristics of students’ messages, , specifically IRCAI’s Open Education Observatory is an initiative dedicated their questions, and then Topic Modeling is employed to to monitoring, analyzing, and promoting the use of OERs identify key terms discussed in the forums [12], using globally. It serves as a hub for research insight and fomenting Tomotopy library (bab2min.github.io/tomotopy) The ERS was tested collaboration, providing valuable insights and data on the 36 adoption, impact, and trends of OER in education systems leveraging equitable access provided by OER, sustainability worldwide. The observatory supports educators, models, or international cooperation. policymakers, and institutions in leveraging open resources to enhance teaching and learning. It is designed to support government and institutional decision-makers dedicated to promoting the goals of the 2019 UNESCO OER Recommendation, which is centred on OER but generally promotes the ideals of Open Education (see Figure 3). Figure 4: The architecture of the OER Observatory as an Elasticsearch-based system that enables the visualization of heterogeneous data on OERs For each area, users can filter and find content specific to their Figure 3: Dashboard of visual modules to analyse the most domain of interest: up-to-date news and research on OER relevant topics under a certain domain or SDG, and the trends developments, academic studies related to professional that can direct the education actors preparedness development, and relevant lectures for capacity building; information on OER policy development; resources and The Open Education Observatory ingests a range of different research focused on effective, inclusive, and equitable access to data sources with heterogeneous nature and different quality OER; strategies for developing sustainable OER models; frequency: (i) worldwide news in almost real-time providing and opportunities for fostering international cooperation information from a vast catalogue of multilingual world news, through potential new partnerships and shared goals. This captured in more than 60 languages and based on a variety of organized approach enhances the ability to pinpoint and utilize wikidata concepts; (ii) published scientific articles, including the most relevant information in each domain. Information journal and conference papers, mostly peer-reviewed, covering generated by the Observatory can be used to aid in the over more than 126 million articles with yearly updates; (iii) resolution of problems related to the promotion of OER, by OER policies from the OER Policy Hub (www.oepolicyhub.org) identifying trends and major areas of discussion, and to explore that needs to be input into the OER DC Portal; subsequent successful scenarios through similar challenges and cases. The extraction and enrichment of metadata; preparation of Observatory provide benefits to a range of stakeholders dashboard related to dashboard based on filters over the including: national governments, providing access to a variety metadata, as well as OECD policies data and metadata on AI and of perspectives on OER trends for decision-making; Education with yearly updates; (iv) lectures and videos educational and research institutions, facilitating the access to selected and filtered on content from Videolectures.net [10] resources and data; civil society, allowing access to information resources related to OER; (v) a snippet of worldwide public and and training materials that explore the knowledge available private initiatives related to AI and SDG 4 captured by IRCAI’s towards the implementation of the UNESCO recommendations; Top100 and related actions; and (iv) a range of worldwide and the general population, empowering open education. indices with yearly updates on Education-related topics such as the percentage of children out of school, or the literacy rate in 4 Open Education for a Better World youth and adults (see Figure 4). To ensure that content is readily available for each focus area, The Open Education for a Better World (OE4BW) program is an materials from the mentioned sources are categorized by international online mentoring initiative aimed at advancing relevant keywords and concepts closely associated with the the development and implementation of open educational resources (OER) that address topics of significant social impact, five key areas of the Recommendation. This organization in alignment with the United Nations Sustainable Development allows users to easily filter and access content based on their Goals (SDG) [2,14]. As part of the Slo2Svet project, the program specific interests within these areas. By doing so, users can received 70 project applications and 87 mentor applications tailor their exploration of resources to match their focus, from six continents and 25 different countries (see Figure 5). whether it's capacity building, supportive policy development, The program's activities are structured into thematic clusters, focusing on areas such as Artificial Intelligence, Displaced 37 Persons, Sustainability, Health and Well-being, Renewable insights provided by automatic text analysis and other AI tools. Energy, Education, and Youth (specifically targeting developers This will allow us to connect the projects produced by OE4BW aged 12-24). Throughout the project development process, to the concrete objectives of the Recommendation, providing progress was closely monitored by a network of mentors and examples of practice that can be leveraged to advance its goals. hub coordinators, providing essential guidance and support to OER developers. Additionally, within the scope of the Slo2Svet project, evaluation rubrics for the OER projects were ACKNOWLEDGMENTS developed and will be utilized during the final conference, We thank the support of the Slovenian Research Agency (ARIS) where developers will present their completed work. and Ministry of Foreign and European Affairs (MZEZ) on the project Slo2Svet - Connecting cultures, informing and learning through Open Educational Resources and AI (V2-2363). REFERENCES [1] Ahmadian Yazdi, H., Seyyed Mahdavi Chabok, S. J., and Kheirabadi, M. (2022). Dynamic Educational Recommender System Based on Improved Recurrent Neural Networks Using Attention Technique. Applied Artificial Intelligence, 36(1), 2005298. [2] Drevensek, M., and Urbancic, T. (2022). The Role of Teamwork in the Creation of Open Educational Resources for Closing SDG-Related Knowledge Gaps. Open Praxis, 14(2). Figure 5: Participants of the OE4BW mentorship in 2023/24. [3] M. Grcar, D. Mladenic., and P. Kese (2009). Semi-automatic categorization of videos on videolectures. net. In Proceedings, of the Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 5 Conclusions and further work 2009, Bled, Slovenia, September 7-11, 2009, Springer, pp. 730-733. [4] Koschmann, T. (2013). Conversation Analysis and Collaborative Learning. In this paper we discussed the research results and In C. Hmelo-Silver, C. Chinn, C. Chan, & A. O’Donnell (Eds.), The Intern. Handbook of Collaborative Learning. Routledge Handbooks, pp. 149–167. opportunities in Open Education, building on an overall [5] Liu, Q., Huang, J., Wu, L., Zhu, K., and Ba, S. (2019). CBET: Design and perspective over the OER landscape, the AI-enhanced student- evaluation of a domain-specific chatbot for mobile learning. Universal Access in the Information Society. educator interaction, and the mentorship for further progress. [6] Moraes Neto, A. J., and Fernandes, M. A. (2019). Chatbot and Conversational We will be exploring further the potential of the OER Analysis to Promote Collaborative Learning in Distance Education. 2019 observatory, particularly in what regards the appropriate use IEEE 19th International Conference on Advanced Learning Technologies (ICALT), 2161-377X, pp. 324–326. of LLMs in analyzing the compliance to AI policies in Education. [7] Moraes Neto, A. J., Fernandes, M. A. & Amiel, T. (2022). Conversational In what regards the future developments of the EduColab, in Analysis to Recommend Collaborative Learning in Distance Education. 14th Intern. Conference on Computer Supported Education, pp. 196–203. alignment with IRCAI’s SDG 4 Observatory and the [8] Moraes Neto A. (2024) Sistema de Recomendação Educacional para Videolectures.net research agenda and the potential for Diagnosticar e Promover a Colaboração em Ambientes Virtuais de Aprendizagem. Doctoral thesis. Federal University of Uberlandia. institutional collaboration, we will focus on: (i) the appropriate [9] Novak E., Novalija I. (2016) Visual and Statistical Analysis of wikification, incorporating suggestions of Wikipedia concepts VideoLectures.NET, Proceedings of the SIKDD’16. identified by Wikifier and related to the main discussion topics; [10] Urbančič, T., Polajnar A., and Jermol, M. 2019. (2019). Open education for a better world: a mentoring programme fostering design and reuse of open (ii) integration of interactive data visualization presenting educational resources for sustainable development goals. Open Praxis. graphical representations of collaboration trajectories, topic Open praxis. ISSN 1369-9997, Vol. 11, no. 4; pp. 1-18. [11] Urbančič, T., Polajnar, A., & Jermol, M. (2019). Open Education for a Better evolution, and other key indicators; (iii) extending the system, World: A Mentoring Programme Fostering Design and Reuse of Open applying the ERS to other datasets, including public and private Educational Resources for Sustainable Develop. Goals. Open Praxis, 11(4). message exchange logs, to validate and enhance its [12] Urbančič, Tanja, et al. (2023) Developing supportive policies and strategies for their implementation: student experience with real-world cases. Open applicability; and (iv) personalized recommendations, Educational Resources in Higher Education: A Global Perspective. developing a user-based collaborative filtering technique to Singapore: Springer Nature Singapore, 2023. pp. 35-53. [13] Uthus, D. C., & Aha, D. W. (2013). Multiparticipant chat analysis: A survey. tailor recommendations more specifically to individual student Artificial Intelligence, 199–200, 106–121. groups. Moreover, we will explore together the pathways of AI- [14] Vayansky, I., & Kumar, S. A. P. (2020). A review of topic modeling methods. Information Systems, 94. based citizen science in the context of Open Education and how [15] Zawacki-Richter, O., Marín, V. I., Bond, M., & Gouverneur, F. (2019). it can be integrated in the wider scope of the SDG4 Observatory. Systematic review of research on artificial intelligence applications in In the context of the Slo2Svet project, we are conducting a higher education – where are the educators? International Journal of Educational Technology in Higher Education, 16(1), 39. comprehensive analysis of the Open Education for a Better World (OE4BW) mentoring program since its inception, examining outcomes and connections to other initiatives [see for example, 12]. Additionally, we will develop an evaluation framework to assess the impact of the projects produced through the program, mapping project outputs to the five action areas of the 2019 UNESCO OER Recommendation, using 38 Addressing Water Sustainability Challenges in North Africa with Artificial Intelligence Mustafa Zaouini, Maurizio Joao Pita Costa*, Davor Manal Cherkaoui, Anas Ait Hanaa Hachimi, Y. Kaddouri, I. Santamicone, Lee Chana Orlic, Mihajela Črnko Aomar, Ikram Chairi, Lirmaqui, A. H. Alaoui, O. IRCAI, Quintelligence AI in Africa Karima Echihabi Ignammas, H. Rahhou Johannesburg, South Africa Ljubljana, Slovenia UM6P Ibn Tofail University mus@fliptin.com joao.pitacosta@quintelligence.co Ben Guerir, Morocco Kénitra, Morocco m candia@usp.br M. Wahib Abkari, R. Rachidi, K. Gourari, I. Annaki, B. Jearani, J. T. El Azzoiani, M. Ait Essibaa, W. Laaleg, Z. Hidila, M. Tabaa S. Trabi, T. Zennouhi, M. Sbaa A. Hamidine, H. Lachheb Moroccan School of Engineering UMP Univesity Al Akhawayn University Sciences, Casablanca, Morocco Oujda, Morocco Ifrane, Morocco ABSTRACT AI Everything section of the GITEX Africa in the end of May 2024. It was mostly directed to PhD/MsC students and young The topic of water sustainability has been leading the priorities entrepreneurs working on AI to solve problems for the good of worldwide where Artificial Intelligence (AI) can position their communities, exploring a wide range of machine learning research institutions, public & private companies and methodologies (from image recognition on satellite imagery, to governments towards evidence-based decision-making in text mining on social media, gamification strategies optimizing regards to water resources. In this particular domain, the water consumption, and application of LLM frameworks for amount and heterogeneity of data generated allied to the rapid RAG and AI Agents in the context of water sustainability), progress of scientific research and technological development engaging experts from global agencies like, e.g., UNESCO, AI have created vast amounts of data, but significant challenges to Movement, and UNESCO’s Water Education Institute, as well as gathering, filtering, and making sense of this information. This national companies, research institutions and government. The paper presents the research outcomes of collaborative effort global challenge of this action, "Water, AI and Sustainability" is engaging a total 51 students mentored by 15 professors across one of the MENA priorities, takes into consideration the UN 11 research institutions in North Africa, distributed by 14 Water Program for 2024-25 [12], and follows the work done by selected projects focusing the appropriate application of IRCAI with the European Commission (EC) on the NAIADES machine learning methods to local and national water Water Observatory [9], as well as the recently opened new sustainability problems. These outcomes were motivated by a IRCAI Committee on AI and Water Resource Management [4] youth challenge co-organized during May 2024 between AI focusing on the impact of AI in SDG 6 [11]. This work aligns with Africa and IRCAI with the support of GITEX. UNESCO’s interests in taking action to capacitate the Youth towards AI, with focus on the recent activities based from KEYWORDS Morocco but with a global scope, including the opening of the Machine learning, text mining, large language models, community new UNESCO AI Centre, the AI Movement (aim.um6p.ma). engagement, water sustainability, competition 1 Introduction Building upon common interests, exciting initiatives and existing projects developed by IRCAI and AI in Africa (aiinafrica.org), focused on AI and Sustainability, this activity aimed to build capacity within African youth to advance the Sustainable Development Goals (SDGs) though AI on challenges within their own communities and in the region. The AI Youth Challenge originated in the context of discussions started in GITEX Dubai in 2023 and forwarded to a concrete event in the Figure 1: Winner of the AI4Water challenge, designed and developed by UM6P students, exposing a water map that Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or pinpoints remote villages with assigned water scores based on distributed for profit or commercial advantage and that copies bear this notice satellite imagery and crowdsourced data. and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2024.sikdd.17 39 2 Finalist innovative ideas on water lies in its data generation and refinement approach. It creates sustainability datasets in areas with data scarcity, starting with an automated baseline from satellite imagery and then enriching it through Attracting the participation of more than 50 PhD and MsC user-generated content. This closed-loop system employs students across 20 teams based in research institutions in active learning, progressively enhancing accuracy and Morocco, this initiative was designed to encourage a relevance of water scores. conversation between the communities, corporate thought leaders, the education visionaries, and the ecosystems builders AquaSense. Water management is a critical issue in many to have constructive conversations around the shifts and needs countries, including Morocco. Severe droughts, poor water of the changing future landscape. The discussions included distribution, and recent natural disasters raise the urgent need researchers, start-up communities, technologists, and for better solutions to manage water resources effectively. government representatives to unite and define the future of AquaSense’s prototype (see Figure 2) offers a smart way to water sustainability as they see it. The selected AI technologies handle water resources by predicting future water situations, and methodologies ranged from the use of satellite imagery to visualizing key data, and engaging citizens and communities. the analysis of news and social media, or the input from water- This helps decision-makers plan better, save resources, and related sensors and the application of Large Language Models respond quickly to local water issues. AquaSense provides (LLMs) to describe good practices. We shall proceed with accurate forecasting of water parameters for informed describing the problems addressed by the finalists of the management and answers water-related questions with AI4Water challenge, their prototypes and the value of the detailed analysis using the latest data and news. It offers innovation they brought with them. transparent data visualization through interactive charts, allowing users to view and upload data easily. The community AquaScore. The Rural communities in Morocco's High Atlas & citizens’ space features real-time news updates, a water Mountains struggle with water management due to limited levels map to locate and help regions in need of water, and a resources and visibility. Despite needing modest funds, these tool to easily report local water issues. villages face significant hurdles in accessing support. The challenge lies in objectively quantifying water issues and connecting these communities with potential supporters. AquaScore creates a water map that pinpoints remote villages and assigns them water scores based on satellite imagery and crowdsourced data. This enables ranking villages by water criticality, helping funders and supporters identify where to direct their assistance effectively. The prototype (described in Figure 1) also offers a platform for discussing water solutions, Figure 2: Screenshot of the prototype of Aquasense defining fostering community engagement through gamification parameters, visualizing data and monitoring engagement. features. By increasing the visibility of rural Moroccan villages and providing objective water criticality assessments, AquaSense combines two distinct parts of AI: DL (LSTM) and AquaScore facilitates efficient resource allocation for donors Generative AI (RAG and AI Agents). AquaSense uses and experts. This AI-driven approach ensures fair and unbiased Multivariate and multistep LSTMs to accurately predict water assistance to communities in need, promoting water parameters’ levels for the coming years, and Retrieval- sustainability and improved water management in constrained Augmented Generation and AI Agents to answer water-related environments. queries with detailed analysis, using the latest data, news, AquaScore employs a hybrid approach combining Computer predicted parameters, and documents from sources like UN, Vision (CV) and Natural Language Processing (NLP). CV UNCCD, and EPA. AquaSense uses Tensorflow and Keras (LSTM algorithms segment satellite images to generate automated model), Pandas and Numpy (data preparation & mgmt.), baseline water scores, while NLP algorithms extract insights Langchain (LLM framework for RAG and AI Agents), Chroma from textual data to enhance score accuracy. This combination (Vector DB), Nomic Embeddings (Open-Source embeddings), allows for objective assessment and continuous improvement GPT3.5-TURBO (LLM model), Streamlit (Web app). Aquasense of water criticality rankings. The team has already aggregated improves water management by helping stakeholders make data on 1,322 High Atlas villages, extracted satellite images, informed decisions, enhancing resource allocation, and and segmented them using Facebook's Segment Anything promoting sustainable practices. Through its innovative model. This process was completed on UM6P servers using features, it bridges the gap between citizens and authorities, 500GB of storage and 80 CPU cores. The system will which fosters collaboration and reduces water crises over time. incorporate user-submitted reports and internet-scraped data Also, AquaSense aligns with several UN Sustainable to further refine water scores. The uniqueness of AquaScore Development Goals (SDGs) such as SDG 6 (Clean Water and 40 Sanitation), SDG 13 (Climate Action), and SDG 11 (Sustainable powered early detection algorithms were prepared to Cities and Communities). constantly monitor for signs of invasive species, triggering immediate alerts to enable rapid response. Based on species- Water Consumption Tracker. This prototype is addressing the global problem of water optimization in the light of the specific data, the system can precisely deploy the most effective already visible consequences of climate change. That is, the eradication methods, from underwater drones to selective large amount of wasted water due to irresponsible water use biocides. As invasive species evolve, the AI-driven platform by the households. The added value lies in the behavioral continuously adapts strategies, ensuring that the interventions approach: the application is designed to make users more remain effective and environmentally responsible. aware of their attitude toward water consumption, and to make water conservation a pleasure rather than a responsibility. YAZ. High unemployment rates in North Africa often translate Introducing a gamification approach as a new strategy should into many individuals employed in low-wage jobs, particularly help make water conservation more appealing. It is based on youth from low-income households. Severe water scarcity an app that tracks real-time water usage, provides leading is decreasing exports and rising prices of vegetables personalized recommendations, and motivates users over a and fruits. Challenges meeting the needs of Morocco's gamification environment, fostering a community focused on population while being a major exporter of produce to global sustainable water use. markets. This AI-based agricultural solution is based on Smart The use of Machine Learning models such as Random Forest Hydroponic Towers designed to efficiently grow crops Regressor to find patterns between the households vertically indoors and outdoors, offering optimal use of characteristics and their water usage behavior. We plan to add GenAI using LLM model as a chatbot to support our vision by available spaces. The adoption of hydroponics in Africa has the providing custom tips to optimize water usage. The approach potential to create millions of new jobs in the coming years. was fundamentally based on: (1) collecting data about the Integrated with GPT architecture, the technology allows real- households using our application UI; (2) providing optimum time monitoring, pest detection, and yield estimation. YAZ water consumption level by the ML model based on the data hydroponics are a shift towards a resilient and sustainable collected; and (3) monitoring water usage through IoT sensors Moroccan agriculture. and the notification system of our App. The data collected is used to optimize the ML model performance. Our approach can The tools and technologies presented in this paper that are potentially reduce household water waste by 20-50% by open source, are available at IRCAI’s SDG Observatory GitHub educating users about their consumption habits through repository (github.com/IRCAI-SDGobservatory). notifications, ranking systems, and feedback mechanisms. 3 From concept to prototype in a month AI in Africa in collaboration with IRCAI conducted a gathering of minds which culminated in a 1-day summit around technologies and shifts of the future, hosted by GITEX in the AI Everything section of the GITEX Africa 2024. Between 26th April and 31st May, 55 PhD and MSc students from 11 research institutions took part of a complete program including expert sessions kicked-off at the AI movement, UNESCO’s new center for AI in Africa, and engaging experts in water-related topics such as Matjaž Mikoš, UNESCO chair for landslide risk reduction, droughts and floods, discussing our recent research on news mining for extreme weather events [5, 6]; Gerald Figure 3: The pitch of one of the top 3 teams – Ghayt – presenting the Water Consumption Tracker at the AI stage of GITEX Africa. Corzo Perez, senior researcher at the UN Water Education IHE Delft, discussing our ongoing research on Water, AI and Twitter Aquatic Biodiversity. The introduction of non-native species [7]; and Ignacio Casals, R&D Manager in Aguas de Alicante into marine ecosystems presents a significant threat to the Spain, providing a industrial perspective on the use of AI to fragile equilibrium of these vital environments. Invasive tackle the challenges of wastewater management [8]. species, often aggressive, can outcompete native organisms, The students were followed across 8 stages including: leading to disrupted food chains, altered habitats, and conceptualization; data collection, analysis and visualization; potentially irreversible ecological harm. From coastal areas to methodology and implementation, prototype building and the open sea, the swift proliferation of invasive plants, animals, pitch (see Figure 3). In order to maximize the impact of the and microorganisms endangers the biodiversity, productivity, programme, the content from the abovementioned and resilience of our marine life. Addressing this escalating opportunities will be organized across UNESCO’s most related global issue requires immediate and decisive action. AI- to the five areas: (1) capacity building; (2) developing 41 supportive policy; (3) effective, inclusive and equitable access 4 Conclusions and further work to quality Education; (4) nurturing and creating sustainability The capacity building to enhance opportunities can benefit models for Water Sustainability; and fostering and facilitating from the engagement of the Youth in AI-driven challenges that international cooperation. start in research problems deriving from issues to address in their communities. Problems they know well and data that they often have privilege access to, with promising impact that can ensure the sustainability of the innovation offered. The initiative served us also to collaboratively discuss sustainable solutions that help large scale recovery and define a better and more hopeful inclusive Africa. The winning outcomes of this challenge will integrate a vibrant worldwide Community of researchers and entrepreneurs focusing on AI and SDGs, Figure 4: The phases of the training curriculum across 5 weeks. starting with SDG 6, and supported by initiatives such as IRCAI’s Top 100 or the SDG Observatory. Ethical considerations The training curriculum included weekly seminars open to are being addressed in the context of the EC project AI4GOV. public, training workshop for participants, showcases and mentoring sessions (see Figure 4). The discussions ACKNOWLEDGMENTS forming the bae concepts of the participant projects were held in the light of IRCAI’s research and research achievements (see This research was partially funded by the European Figure 5), aiming at building research collaboration bridges. Commission’s Horizon research and innovation program under grant agreement 820985 (NAIADES) and 101120237 (ELIAS). REFERENCES [1] Blazhevska V.(2020). United Nations launches framework to speed up progress on water and sanitation goal, United Nations Sustainable Development. [2] Casale G., Cordeiro Ortigara A.R. (2019) Water in the 2030 Agenda for Sustainable Development: How can Europe act? Water Europe, Brussels. (ISBN 978-90-8277064-3) 36p. https://unesdoc.unesco.org/ark:/48223/pf0000372496 [3] International Water Association and Xylem Inc (2019). Digital Water: Industry leaders chart the transformation journey. [Online] https://iwa- network.org/wp- content/uploads/2015/12/IWA_2019_Digital_Water_Report.pdf Figure 5: Selected topics from IRCAI’s research to motivate [4] IRCAI Committee Chair on AI and Water Resource Management [online] challengers in AI and Water research ircai.org/project/ai-and-water-resources-management/ [5] Mikoš M., Bezak N., Pita Costa J., Nassri M. B., Jermol M., Grobelnik M. (2022). Natural-hazard-related web observatories as a sustainable The data and methods generated by the participants development tool. In Progress in Landslide Research and Technology, Vol. 1, No. 1, Springer (in print). programme can be used by companies, government and [6] Pita Costa J., Rei L., Bezak N., Mikoš M., Massri M.B., Novalija I. and Leban, G. research institutions to aid in the resolution of problems (2024) Towards improved knowledge about water-related extremes based related to Water Sustainability, by identifying trends and major on news media information captured using artificial intelligence. International Journal of Disaster Risk Reduction, 100, p.104172. areas of discussion, and to explore successful scenarios [7] Perez, G., Pita Costa J., Novalija I., Rei L., Senožetnik M., Casals del Busto I. C. through similar challenges and cases. IRCAI’s SDG 6 (2024). Integrating Social Media, News and Machine Learning for Enhanced Hydrological Event Detection and Management. In 15th Observatory [10] is being built to properly address the International Conference on Hydroinformatics (p. 278). challenges of decision makers, using AI. It is benefitting: (i) [8] Pita Costa J., Massri M. B., Novalija I., Casals del Busto I., et al. (2021). national governments providing access to a variety of Observing Water-Related Events for Evidence-Based Decision-Making. In: Slovenian Data Mining and Data Warehouses conference (SiKDD2021) perspectives (including trend and comparative) on a data [9] Pita Costa J. (2022). Water Intelligence to Support Decision Making, driven dashboard with information on Water Sustainability Operation Management and Water Education - the NAIADES Report. IRCAI Library. [online] https://ircai.org/project/ircais-project-report-on- trends for decision-making; access to local (e.g. country-level) naiades/ progress on SDG 6; (ii) educational institutions, offering access [10] Pita Costa J., Zaouini M., Crnko M., Polzer M., Corzo Perez G., Mikoš M., Orlic D. and Jermol M. (2024) Challenging Water Sustainability in Africa to information on current trends on Water Sustainability Through AI, Proceedings of the HHAI 2024 workshop on AI in Africa and research and development; (iii) research institutions, sourcing SDGs. open data over interactive visualisation and research; (iv) the [11] UN-Sustainable Development [online] The IRCAI Water Observatory - AI in the service of SDG 6 [online] https://sdgs.un.org/partnerships/ircai- NGO community, easing access to information directly linked to water-observatory-ai-service-sdg-6 community priorities including citizen science activities; and [12] UN-Water Work Programme 2024-2025 [online] https://www.unwater.org/publications/un-water-work-programme- (v) general population, empowering water education for all. 2024-2025 42 Predicting poverty using regression Luka Urbanč Marko Grobelnik Jožef Štefan Institute Jožef Stefan Institute Ljubljana, Slovenija Ljubljana, Slovenija urbancluka3@gmail.com marko.grobelnik@ijs.si Joao Pita Costa Luis Rei IRCAI, Quintelligence Jožef Stefan Institute Ljubljana, Slovenija Ljubljana, Slovenija joao.pitacosta@quintelligence.com luis.rei@ijs.si Abstract defined by each country individually, recognizing that different Poverty reduction is the first Sustainable Development Goal set countries have different measures of, e.g., what life conditions by the United Nations to be achieved by 2030, but current data and how much income makes an individual reach a "poor" status, indicates that the progress is insufficient. The diverse factors as well as how we can normalise this to better compare these influencing poverty across different nations pose a challenge in relative indicators between countries. We are still missing a clear developing effective predictive models. This paper evaluates the theory in poverty research, despite the issue existing for a number use of various regression models to predict poverty rates using a of decades [2]. With that being said, some authors have already comprehensive dataset of 111 variables from sources such as the explored the causes of poverty. For instance, corruption, political UN and the World Bank. The data, spanning multiple domains instability, ineffective local governance, government polices, gen- like political stability, education, and economic conditions, was der inequality and short-term wage replacement policies, such as preprocessed and transformed to create auxiliary features and maternity leave benefits and sickness pay, impact relative poverty interactions. Among the models, Ridge regression yielded the [6, 7]. When assessing what people believe causes poverty some best results, achieving a Root Mean Square Error (RMSE) of 3.6, geographical differences emerge. For example, the United States indicating high predictive accuracy on a global scale. This study are mostly of the thought that an individuals traits are responsible highlights the importance of addressing multicollinearity and for poverty, while countries in Europe have a blend of individ- incorporating a wide range of features to improve the general- ualistic, fatalistic and structural beliefs such as lack of will, bad izability of poverty prediction models. Future research should luck and social injustice respectively [4]. explore more complex methods, such as neural networks, and Machine learning (ML) has also been used in academic re- refine model hyperparameters for enhanced performance. search to identify trends and analyze data in most fields, includ- ing poverty research. Although a number of papers have already Keywords been published on the use of ML to predict poverty [1, 10, 12, 5, 3, 8] (for more see [11]) including combining satellite images and poverty, linear regression, lasso regression, ridge regression, elas-neural networks to help predict poverty in five African countries tic net regression, sustainable development goals [5], most take a limited number of variables. Usmanova’s litera-1 Introduction ture review found 22 papers published between 2016 and March 2022, with a total of 57 AI methods applied, the most popular be- The need to eradicate poverty has been a long standing issue, ing random forest, used in more than half of all papers reviewed. which was globally recognized numerous times, most impor- It also found most papers focus only on African and South Asian tantly in the United Nations (UN) Sustainable Development Goals countries, a finding consistent with our own [11]. (SDGs), being given the number one spot of SDG1: "End poverty In this paper we focus on the following research questions: (i) in all its forms everywhere", which should be achieved by 2030. can regression be useful to identify the most influential features, The latest UN report on the progress made in achieving SDG1 from a large amount of global indicators; and (ii) can direct and indicates Poverty has returned to pre-pandemic levels in middle- indirect causality relations be identified that signal new indicators and high-income countries, with poverty in low income countries relevant to the Poverty-related issues? still a fraction above those reported in 2019. While the trends seem to be going in the right direction, the UN warns that the cur- 2 Data rent pace of improvement is insufficient to reach the agreed goals before 2030. This raises the question of what impacts poverty To address the research questions, we utilized 111 primary vari- rates the most and how countries can most effectively reduce ables from sources such as the UN and the World Bank, aggre- poverty levels. gated through the Our World in Data portal. These variables span To fully understand and address the issue of poverty, one must diverse domains, including political stability, policies, education, navigate several definitions, which can often lead to confusion. healthcare, economic conditions, and inequality. We prioritized The baseline definition used in this paper is the poverty line as is features that prior research has identified as significant, while also incorporating some factors that are less intuitively linked Permission to make digital or hard copies of all or part of this work for personal to poverty. The dataset was then used to train various models or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and aimed at predicting poverty rates across countries. This task is the full citation on the first page. Copyrights for third-party components of this particularly challenging because countries respond differently work must be honored. For all other uses, contact the owner/author(s). to the same variables. For instance, GDP growth tends to have a Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia more significant impact on poverty reduction in developing na- © 2024 Copyright held by the owner/author(s). https://doi.org/https://doi.org/10.70314/is.2024.sikdd.20 tions compared to developed ones. Additionally, many variables 43 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Urbanč et al. are strongly correlated, making it difficult for linear regression the basic parameters in addition to the mathematically derived models to capture their relationships accurately. columns. As previously mentioned, most of the data used in this paper was sourced from ourworldindata.com (OWiD), with some additional data coming from fao.org—including variables such as foreign direct investment inflows and outflows, and the added value of agriculture, among others. Data on the transatlantic slave trade and colonial rule was obtained from www.slavevoyages.org. All datasets were preprocessed before being merged, following a series of steps. The first preprocessing step involved light modifications, such as removing irrelevant columns, renaming columns, and exclud- ing data from before 1987 and after 2023 due to gaps and incom- plete data. Despite increased reporting in recent years, many countries still omit certain indicators, complicating model train- ing. To address this, missing features with more than 𝑛 data points for a given country were interpolated, with the edges filled using backward fill (bfill) and forward fill (ffill). Those with less than 𝑛 data points used the mean of the country’s income group for the given year as a filler value. The number 𝑛 was intuitively chosen to be five and the methods bfill and ffill were chosen to prevent the use of unrealistic data. The World Bank classifies countries into income groups: low (less than 1,045 USD), lower-middle (1,046 USD to 4,095 USD), upper-middle (4,096 USD to 12,695 USD), and high income (12,696 USD or more). However, it is im- portant to note that the data generated using the aforementioned methods somewhat reduces overall robustness. The next step involved generating auxiliary columns, specifi- cally lagged columns and changes in value for relevant parame- ters. For instance, the row corresponding to Niger in 2013 would also include the GDP per capita for 2012, 2011, and earlier years, in addition to the value for 2013. This approach reflects the fact that poverty trends often manifest in response to changes over time, rather than immediately. The default number of years for Figure 1: Scheme of adopted methodology lagged data was set to five. Similarly, we incorporated changes in value over the same five-year period to capture more explicit data on unusual events, such as the onset of wars or significant political changes. 3 Methodology Next each primary parameter was also used as an argument In order to predict worldwide poverty levels, we have used dif- for a number of mathematical functions in an effort to see if any ferent linear regression models and compared their accuracies. correlations aren’t linear but perhaps squared, cubed or another With this we aimed to ease the interpretability of the models, elementary function. The functions used were: 2 3 𝑥 , 𝑥 , ln 𝑥, sin 𝑥, which is harder to obtain with more complex methods such a cos 𝑥, tan 𝑥, arcsin 𝑥, arccos 𝑥, arctan 𝑥 to try and capture any neural networks. To perform the research work that is the base elementary nonlinear dependence within the model. of this paper, we have selected ordinary linear regression, lasso The last step was to create all possible products with the avail- regression, ridge regression and elastic net regression as the mod- able primary parameters, as creating all possible products with els to compare. OLS regression struggles with multicollinearity, all auxiliary parameters included would have been computation- where predictor variables are highly correlated, leading to un- ally inefficient. After all these steps were made, the individual stable estimates of the coefficients. Ridge regression addresses columns were fused together. This method of preprocessing in- this by adding an L2 regularization term, which penalizes large creases the possible variables included, making the model even coefficients and helps to stabilize the estimates in the presence more general and retaining as many rows of data as possible. of multicollinearity. By shrinking the coefficients, ridge regres- The function responsible for preprocessing, generating and sion reduces the sensitivity of the model to colinear predictors, merging the data has a few parameters: basic_parameters_only, ensuring more reliable and generalizable results. Unlike lasso, combinations and math. basic_parameters_only determines, if ridge retains all predictors, making it particularly useful when the model will only contain data obtained from various online multicollinearity is a key concern but feature selection is not databases, or if the model should include generated data: the the goal. We use the implementation of these linear regression change in value and values for previous years. combinations de- algorithms in scikit-learn [9]. termines, if the model should create all possible combinations The datasets were split into training and test sets using the with the primary parameters and math determines if mathemati- sklearn function train_test_split, with 80% for training and cal columns are included in an attempt to gain a deeper insight 20% for testing. The training set was used to train four regression into the features’ relationships. The parameters are marked with variants (LinearRegression, Lasso, Ridge, ElasticNet), B, C and M. For instance, B+M would mean the file contains all with a random state seed of 42. while the test set was used to 44 Predicting poverty using regression Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia determine the mean squared error (MSE) and 2 𝑅 value using the functions mean_squared_error and r2_score from [9], both common metrics used to assess models accuracy. All models except OLS regression also had the data standardized before training. The hyperparameter 𝛼 for the models was sensibly chosen as 0,1. The results, seen in Table 1 are color coded: red for poor performance, yellow for intermediate, and green for the best. The variation in the number of rows is due to the exclusion of rows with insufficient yearly data, which were dropped when calculating differences from previous years. After identifying the most successful model, we proceeded to compare its performance between high-income and low-income Figure 2: Visual representation of model weights countries. This comparison aimed to assess how the accuracy and frequency of reported data influence the model’s performance. These two income groups were chosen because low-income coun- tries typically report less data with lower accuracy, while high- population with urban and rural population share. Other no- income countries provide more precise reports. We selected all table combinations include secondary school completion with high- and low-income countries from the dataset that were not women’s civil liberties, internet usage with sanitation access, used during the model’s training. From the 20% of data reserved and military spending with wealth distribution. The weights also for evaluation, 444 rows (30%) belonged to high-income countries, reflect factors like infant mortality, years colonized, and agricul- and 368 rows (24%) belonged to low-income countries. tural employment. Figure 2 further illustrates the decline in the We used the trained model to predict poverty levels for these absolute value of these weights. groups and evaluated its performance using the MSE metric to The model performed better on high-income countries, with analyze differences between income groups. Additionally, we an MSE of 6.60, significantly below the overall MSE. In contrast, calculated the maximum error to determine if the average per- the MSE for low-income countries was 20.68. The maximum formance was skewed by outliers. A similar evaluation was con- error was also lower for high-income countries (22.1) compared ducted on the data from Slovenia and Somalia, which were part of to low-income ones (34.4). the split. Slovenia had 8 rows of data, and Somalia had 6, allowing The difference in the model’s performance on Slovenia and us to explore how missing data impacts the model’s performance, Somalia was notable. For Slovenia, the MSE was 0.78 with a as Somalia had significantly fewer data points overall. maximum error of 1.54, far below the overall metrics. Somalia, however, had a much higher MSE of 95.7 and a maximum error of 18.7, likely due to less reliable and extreme poverty data, which 4 Main Results skews the model’s performance on extreme cases. The file configuration plays a critical role in the model’s per- formance. The results show that C+M, C, and B+C are the best 5 Discussion configurations. The C+M file includes all basic features, lagged Firstly, the fact that ordinary least squares linear regression values, changes in value, mathematical columns, and all possi- couldn’t produce an accurate model confirms the fact that the ble combinations of basic parameters, totaling 8,236 parameters. parameters are indeed correlated. This is probably also the rea- Configuration C contains all basic features, combinations, and son why the ridge regression model performed the best: ridge lagged and difference columns. Lastly, B+C includes only the regression is used to address the issue of multicollinearity and basic parameters and their combinations. All top-performing the features included are mostly strongly correlated, as stated in models were trained on these datasets. the introduction. Furthermore, the correlation between parame- The results in Table 1 show considerable variation. Models ters is obviously drastically increased by generating all possible trained with ordinary least squares regression performed poorly, products of basic parameters. with the best model reaching an RMSE just under 10.15 and an Secondly, the impact of mathematical columns needs to be 2 𝑅 of 0.50. In contrast, lasso and elastic net regression achieved considered. Of the first four models, two have mathematical better results, with RMSEs around 7 and 2 𝑅 values close to 0.80. columns and two don’t. Of the eight models generated, three Ridge regression also struggled, except for configuration B+C, of them perform worse if mathematical data is present, while 5 which provided the best results with an RMSE of 3.6 and an 2 𝑅 performed better with mathematical data included. This might of 0.94. However, caution is advised when interpreting models indicate some deeper connection, which would be interesting using configuration C+M or C, due to the high number of features to try and understand. Furthermore, lasso regression handles relative to the dataset size, which could affect their real-world mathematical columns much better compared to the other models reliability. used due to its ability to exclude features. The model weights reveal that only products are present The impact of product combinations of basic features stands among the top ten most important factors. These products in- out, with all better-performing models having the combinations clude data on population, population density, agriculture, equal- parameter set to True, suggesting deeper relationships between ity, healthcare, and education. The largest weights show the variables. Exploring these connections further, perhaps by train- biggest differences, gradually decreasing in magnitude. The top ing a neural network on the basic parameters and comparing ten weights range from just over 10 to 7, with the highest weights it to linear regression models, could be insightful. If the neural involving combinations such as population and population den- network performs better, further investigation into these correla- sity, meadows and pastures with the global peace index, and tions would be needed. 45 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Urbanč et al. Structure Linear MSE Linear 2 2 2 2 𝑅 Lasso MSE Lasso 𝑅 Ridge MSE Ridge 𝑅 Elastic net MSE Elastic net 𝑅 Shape of X M 203 0.031 74 0.65 - - - - (7653, 2131) None - - 109 0.48 163 0.22 108 0.49 (7653, 1221) C+M 198 0.054 45 0.78 - - 40 0.81 (7653, 8236) C - - 50 0.76 - - 45 0.79 (7653, 7326) B 103 0.50 110 0.47 103 0.50 111 0.46 (7661, 111) B+C - - 48 0.77 13.3 0.94 43 0.79 (7661, 6216) Table 1: MSE and R-squared values for different regression models and dataset configurations. The presence of B, C or M signals the presence of basic basic parameters only (B), combinations (C) and mathematically (M) derived columns in the dataset. A dash is used to label non-converging models with a negative R-squared value. The dataset used spans from 1987 to 2023, which is relatively through testing of numerous linear regression models using open short, given that poverty often has deep historical roots. Al- data, with the best model being created by using ridge linear though data becomes scarcer in earlier years, those points could regression trained on data which also included all possible com- still be crucial for improving model accuracy. Moreover, most binations of the basic features included in the dataset. The basic hyper parameters in this paper were chosen sensibly due to time parameters included consist of 111 different parameters describ- and computational constraints. Different values for the number ing countries across 36 years. Better models could possibly be of lagged years, years of differences, hyperparameters in the generated using more complex methods such as neural nets or training of models and the minimum number of data points re- random forest, gaining in accuracy but compromising the ex- quired to interpolate missing data could all lead to interesting plainability of the model. The models could also benefit from discoveries and improvements of the generated models. Our re- hyperparameter tuning during the whole process to improve sult here shows it is possible to achieve this degree of accuracy, results and find the optimal values. We will be addressing this in but it doesn’t limit what the best model could be. The elastic net, further research. especially, should benefit from such a tuning. As stated in [11], the recent literature mostly uses the random 7 Acknowledgements forest model and, in fact, ordinary linear regression wasn’t even This research was partially funded by the Future of Life Institute in the top ten most common methods. An interesting thing to under the project "An AI-driven Observatory Against Poverty", explore would also be the performance of random forest using and the European Commission’s projects under grant agreement the best configuration, B+C. The models may struggle to capture 101135800 (RAIDO) and 101120237 (ELIAS). correlations between variables due to differing impacts across countries, as mentioned in the introduction. A potential solution References is to split the countries into 𝑘 groups and train separate models [1] Gianni Betti, Antonella D’Agostino, and Laura Neri. 2002. Panel regres- for each group. While this could improve predictions, it raises sion models for measuring multidimensional poverty dynamics. Statistical methods and applications, 11, 359–369. two challenges: how to split countries without bias and how to [2] David Brady. 2019. Theories of the causes of poverty. Annual Review of ensure enough data for training. Sociology, 45, 1, 155–175. The weights in the model further emphasize the issue of mul- [3] Muse A.H. Hassan A.A. and Chesneau C. 2024. Machine learning study using 2020 sdhs data to determine poverty determinants in somalia. IEEE ticollinearity among the parameters, with only product terms Transactions on Radiation and Plasma Medical Sciences, 14, 1, 5956. emerging as the most influential. However, this does not reveal [4] Dariush Hayati and Ezatollah Karami. 2005. Typology of causes of poverty: the true importance of individual parameters, as they may en- the perception of iranian farmers. Journal of Economic psychology, 26, 6, 884–901. hance the impact of another factor within the product term. Addi- [5] Neal Jean, Marshall Burke, Michael Xie, W Matthew Davis, David B Lobell, tional research is needed to better determine the true significance and Stefano Ermon. 2016. Combining satellite imagery and machine learning to predict poverty. Science, 353, 6301, 790–794. of these parameters and gain a clearer understanding of what [6] AH Ng, Abdul Ghani Farinda, Fock Kui Kan, Ai Ling Lim, and Teo Ming Ting. drives poverty rates up or down. It can be seen in Figure 2, the 2013. Poverty: its causes and solutions. International Journal of Humanities models weights occupy a wide range. It is clear that some features and Social Sciences, 7, 8, 2471–2479. [7] Rense Nieuwenhuis, Teresa Munzi, Jörg Neugschwender, Heba Omar, and are more important, based on their weights and further work is Flaviana Palmisano. 2019. Gender equality and poverty are intrinsically being done to understand which features stand out and why. linked: A contribution to the continued monitoring of selected sustainable The model also performed better in predicting poverty lev- development goals. Tech. rep. LIS Working Paper Series. [8] Shah O. and Tallam K. 2023. Novel machine learning approach for predict- els in high-income countries compared to low-income countries. ing poverty using temperature and remote sensing data in ethiopia. IEEE This discrepancy can likely be attributed to the fact that high- Transactions on Radiation and Plasma Medical Sciences, 5, 6, 2302.14835. [9] F. Pedregosa et al. 2011. Scikit-learn: machine learning in Python. Journal income countries report more data with greater accuracy, allow- of Machine Learning Research, 12, 2825–2830. ing the model to identify underlying patterns more effectively. [10] Mubaraq Dele Sulaimon. 2020. Multidimensional poverty and its determi- In contrast, much of the data for low-income countries had to be nants: empirical evidence from nigeria. [11] Aziza Usmanova, Ahmed Aziz, Dilshodjon Rakhmonov, and Walid Osamy. interpolated, which reduced variability between countries and 2022. Utilities of artificial intelligence in poverty prediction: a review. Sus- negatively impacted the model’s performance. tainability, 14, 21, 14238. [12] Huang Zixi. 2021. Poverty prediction through machine learning. In 2021 2nd International Conference on E-Commerce and Internet Technology (ECIT). IEEE, 314–324. 6 Conclusion In this paper, we have shown that a general model exists, based on linear regression methodologies, which can predict poverty with a relatively high accuracy (RMSE of 3.6). This was achieved 46 Fact Manipulation in News: LLM-Driven Synthesis and Evaluation of Fake News Annotation Luka Golob Abdul Sittar lukag26@gmail.com abdul.sittar@ijs.si Jožef Stefan Institute and Jožef Stefan Postgraduate Jožef Stefan Institute and Jožef Stefan Postgraduate School School Jamova cesta 39 Jamova cesta 39 Ljubljana, Slovenia Ljubljana, Slovenia Abstract (1) A methodology to create synthetic data for fake news using LLMs. Advancements in artificial intelligence and increased internet (2) We then use this methodology, to adapt the FA-KES dataset accessibility have made it simpler to create and disseminate fake 1 with 100 additional synthetic fake news . news with customized content. However, they also improved the ability to analyze and identify such misinformation. To effectively In Section 2, we discuss work that is closely related to our task. train high-performance models, we require high-quality, up-to- Section 3 then outlines the methodology for generating synthetic date training datasets. This article delves into the potential for fake news, culminating in Section 4, where we present the results generating fake news through factual modifications of articles. and introduce some modifications to the methodology. Finally, in This is facilitated by prompt-based content generated by large Chapter 5, challenges, capabilities, and potential improvements language models (LLMs), which can identify and manipulate are considered. facts. We intend to outline our methodology, highlighting both the capabilities and limitations of this approach. Additionally, 2 Related Work this effort has resulted in new quality synthetic data that can be A wide range of approaches to generate fake synthetic news with incorporated into the standard FAK-ES dataset. LLM has been developed. In [8] authors generated huge amounts of fake news and categorized them into multiple categories. LLMs Keywords can generate fake news by altering the style to mimic credible fake news, synthetic data, fact extraction, fact verification, large sources or using sensationalism to influence perception. They language models can subtly manipulate content to be perceived as true, blend real and fabricated information to exploit cognitive biases, or create 1 Introduction convincing fictional narratives. In general, when making a dataset we want a diverse distribu- Synthetic data refers to artificially generated data that is not tion of fake datasets. In our case, we will focus on one way of data obtained by direct measurement or observation of real-world change, which comes under the umbrella of Content Manipulation. events. Instead, it is created using algorithms and simulations. Similar news manipulations can be seen in [7] where the authors The primary purpose of synthetic data is to provide a realistic use two main techniques. The first one extracts the summary alternative to real data for various use cases, such as training from the original text, which preserves the main content, which machine learning models, testing systems, ensuring data privacy, is then changed to produce a fake article. The second one asks a and more. question about the article and changes the content of its answer, We will generate synthetic data from news articles. By making to construct a new article. Our approach is in nature similar to sure, that the information in the news is changed we can safely the Question-Answer framework. call it fake news. In our article, fake news will denote articles that Many articles provide fake news detection models made using are intentionally and verifiably false [4]. Synthetic data enhances synthetic data. Most popular are deep neural networks such as model training by providing additional examples to supplement BERT [1]. But there are other fact-based approaches for fake news scarce labeled datasets and allows for privacy-conscious testing labeling as in [3]. In [2] they used GPT4-turbo for prompt-driven without real content manipulation. It enables adaptability to fake news detection. evolving fake news tactics by simulating diverse scenarios from the newest data, thereby improving the robustness and resilience 3 Methodology of detection algorithms [3]. Large language models (LLMs) made a huge difference in the The methodology is divided into four conceptual steps: Data world of news. Fake news is now much easier and cheaper to collection, Characterization of facts, Fact extraction, and Fact construct, but we also have additional methods to help us tackle manipulation as presented in Table 1. its spread. Numerous articles appeared trying to partake in this effort. The following are the main scientific contributions of this 3.1 Data Collection paper: The publicly available FA-KES dataset [5], focused on the Syrian war, addresses the deficiency of manually labeled datasets in Permission to make digital or hard copies of all or part of this work for personal this domain of news data. It comprises 804 articles sourced from or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and various media outlets. We used 426 articles that were manually the full citation on the first page. Copyrights for third-party components of this labeled as authentic news, but we could just as well use the other work must be honored. For all other uses, contact the owner /author(s). (fake) articles. Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia © 2024 Copyright held by the owner/author(s). 1 https://doi.org/10.70314/is.2024.sikdd.13 https://github.com/golobluka/Fake- news- generation- from- FA- KES- dataset 47 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Luka et al. Data collection Characterization of facts 3.4 Fact Manipulation and Synthetic News Generation Should have textual and 1.Name of casualty The objective is to modify relevant information without altering statistical facts 2.Gender or age group the writing style or topic of the article. For this transformation, 3.Cause of death we used a chain of thought prompt, which for a given fact: 1) 4.Type changes the fact to another with a different meaning, 2) generates 5.Actor a new article based on the altered facts. By changing one fact at 6.Place of death a time, quality is improved compared to altering multiple facts 7.Date of death simultaneously, as one fact creates a clearer chain of instructions. LLMs such as Llama3.1:8B often struggle with precise changes Fact Extraction Fact manipulation in the article, such as modifying implicit references or incorpo- rating new facts. Quality can be improved by carefully adjusting Name of casualty: Civilians Name of casualty: the prompt content. Gender or age group: e.g., Manipulated fact LLMs are also exceptional in summarization and paraphras- child, adult, senior Gender or age group: ing. Both are used simultaneously with changing the facts. The Cause of death: shooting, Manipulated fact shelling, weapons, etc. Cause of death: problem is that we aim to maintain the extracted facts when sum- Type: military personnel Manipulated fact marizing. But this is not crucial, as it usually has better results Actor: rebels, forces Type: Manipulated fact as article generation. Place of death: Airbase Actor: Manipulated fact Date of death: April 7, 2017 Place of death: Manipulated 3.5 Fake News Annotation and Fact fact verification Date of death: Manipulated fact After we have generated the fake articles, we can label that data as “fake” or “non-fake”, based on comparison with extracted facts. We performed this labeling with various models and compared Figure 1: A methodology to generate synthetic data for the performance of labeling,to get the best model. In this ex- fake news detection periment we decided for Llama3.1. To do the labeling, we are performing fact verification [4]. The fact verification task in gen-3.2 Characterization of Facts eral is making a decision as to whether a claim is correct, based on the explicitly-available evidence, such as Wikipedia articles While making the FA-KES dataset, its authors created seven fac- or research papers. We have the extracted fact, which will be tual categories: compared to the article content. The question thus becomes: Do these facts appear in the given article? This approach emphasizes (1) Name of casualty (4) Type, factual content rather than the overall sentiment of the article. or group, (5) Actor, There are two primary types of prompts: 1) Direct prompts (2) Gender or age (6) Place of death, that present the article and a table of facts, asking if the facts group, (7) Date of death. relate to the article, 2) Structured prompts that inquire about (3) Cause of death, the correspondence of one fact at a time with the article. The question is: Does this fact correspond to the content of the arti- It is crucial to note that all articles have a similar structure, cle? This method combines individual results into an aggregated describing war incidents. This allows us to establish a consistent score. Say the Place of death is characterized as Idlib and framework of facts, such as actor and casualty details. We stick Daraa provinces. Then the question posed to LLM is of the form: to those facts, but generate them differently, employing LLMs Read the article and understand its places of death. capabilities with faster and cheaper execution, albeit with a slight Do Idlib and Daraa provinces “really correspond” to reduction in reliability. places of death in the article? 3.3 Fact Extraction We are not as interested in labeling, as we are interested in the quality of produced synthetic fake news. For this purpose, We extract facts by constructing prompts for LLMs. First ap- we will also use fact verification in a slightly different way. We proach was a few-shot prompt, which gives some examples of are asking the LLM: Were the factual changes in fake news really output. Later we constructed an additional approach: Say we made, as they were supposed to? A similar method is used in the are extracting the fact Place of death with this second tech- article [7]. nique. We give a detailed description of what should be extracted and then LLM reads the article and performs the task solely on 4 Experimentation and Results this basis. This description is usually longer and contains more context. The issues with fact extraction in general are: 4.1 Experimental settings • Some articles lack certain facts or merely imply them. We selected 426 articles labeled as authentic news from FA-KES LLMs can identify this, outputting responses such as “No dataset. Then facts were extracted and transformed, as described information.” in the previous section. At first two basic approaches were used • Longer articles may contain multiple events, each with dis- to randomly choose 70 news articles and transform them. After- tinct data such as dates or casualties. This can be managed ward, we used the labeling procedure to compare performance, by creating separate tables for each event or consolidating resulting in the table 1. Based on the results we then composed all events into a single table with various facts. the final algorithm, which would be manually evaluated. 48 Fact Manipulation in News: LLM-Driven Synthesis and Evaluation of Fake News Annotation Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia 4.2 Evaluation Table 1: Comparison of fake synthetic data. For every experiment, we first manually checked a minimum 10 Type of data Number of facts manipulated Precision Recall F1 Accuracy percent of random examples to get an overview of how well the Summarization 2/7 0.74 0.63 0.68 0.71 LLM was able to do the job. It is quite useful to print text that rep- Detailed facts 2/7 0.70 0.80 0.75 0.73 resents the procedure of decision-making that LLM undertakes, when challenged with the task. It was even helpful to see LLMs generated thinking procedure, as this gives valuable insight, into this context, leaving it unchanged in most cases. Our fake news what is going on “under the hood”. We believe that manual fact- fails to preserve enough coherence to be trusted by a skeptical checking is the first and most crucial step in generating good reader, who tries to connect background material to the event in prompts. Based on fallacies one can then adjust prompts content. the article. To shed some light on this procedure we have made the following Generating false text, while maintaining coherency, is chal- overview. lenging for LLM. In this task, we have changed one fact: for example, the Place of death may be changed to another city 4.3 Fact Extraction Results or neighborhood. Then this fact must be changed in the article while maintaining other factual information. Here are the main Name of casualty or Members of Nusra Front issues: group: • In the beginning some facts did not get changed, or the Gender or age group: Adults (no specific age men- facts were altogether just removed from the article. We tioned) managed to reduce this error by adjusting the prompt. It Cause of death: Explosion at a mosque is difficult to adjust all occurrences of the fact, especially Type: if it is only implied and not explicitly stated. We managed Non-civilian (militants) to minimize this problem, by a method yet to be shown in Actor: Unknown (no group section 4.5. claimed responsibility, but • What remains is the problem of a wider context, Suppose supporters blamed ISIS) we change the town of the incident, then we must change Place of death: Ariha, Idlib province, Syria the name of the neighborhood accordingly. LLM usually Date of death: Not specified in the article fails in this, leaving our article inconsistent, which is a widespread problem. • LLM does not want to output the content because of harm- Figure 2: Example of fact extraction. ful content or does not want to produce articles that could be used with illegal intent. This was quite a common prob- LLMs are capable of recognizing different topics and extracting lem, which is also reasonable, based on the violent con- words that correspond to this topic, and also noting if the fact is tent of articles and the possible abuse of LLM-generated not mentioned. At first, we extracted short words as represented content. The best thing to prevent this error is to use un- in Figure 2. censored LLM. In other cases, one can adjust the prompts The issue begins with nuances. For example, in many articles by removing suspicious words like “fake news”. the Actor is only suspected but not known. In some cases, ac- • The Generated article was shorter, skipping the original tor and causality are not precisely distinguished. This usually text which was not linked to extracted facts. This problem leaves LLM to some kind of arbitrariness. For this purpose, We was reduced but still exists in long articles. also added a longer description that better captures the nuanced • If the fact is not present in the article, then it is hard subtleties related to facts. This can also be captured in Table 1. for LLM to incorporate a new fictitious fact into the text. There we see the results for short (normal) or detailed extracted Mainly it just adds the information in separate sentences. facts. The recall is far worse in the case of short prompts. This • When we change facts, traces of the old facts still persist. likely means that there is an abundance of false negatives, which This is especially common in complicated articles with result from the fact, that labeling does not manage to match true diverse structures. articles and their corresponding short facts. • Sometimes the change does not bring about any additional The shorter extracted facts are often not comprehensive. For meaning. For example, LLM might change previously un- example, under the label Type (which classifies civilian or non- known casualties and designate them as civilians. They civilian) it writes only civilians, even though, contextual under- were implied to be civilians all along, and this makes only standing also includes some non-civilian casualties. a minor change and is not really fake. Overall the most important insight remains: fact extraction has better quality than article generation. 4.5 Fact verification with LLMs Remember that in this task, the prompt asks: Does this fact “really 4.4 Quality and coherence of synthetically correspond” to the content of the article? Performance largely generated fake news depends on how the program takes the word “really correspond”. The LLM can detect (for example) the Actor of some attack in the Words have many nuances: different words can have different news, and then it is mostly able to change every occurrence of meanings, which can complicate labeling. To simplify: we can be this Actor with another Actor. But if we would like to preserve stricter, in the sense that words must be the same in the literal all the coherence of the article much more would need to be done. sense, or we can count on the similarity of meaning [6]. Based News usually contains background information, that provides on our goal of creating fake news it is best to focus on meaning context for the accident. Our algorithms failed to properly adjust and not concrete words. 49 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Luka et al. Here are some common problems: 5.1 Problems, Capabilities and Possible • Improvements Sometimes the fact is changed, but LLM skeptically as- sumes, that those two names refer to the same group. • In this stage, LLMs like Lamma3.1:8B are not able to co- • In longer articles, where there are many events, the names herently change certain facts of news articles. Changing get changed only in some events (usually at the beginning facts can distort the article content, which appears to be of the article). In this case, the LLM can make unwanted extremely hard to manage. This normally does not happen predictions, labeling the fact as true rather than false. for manageable data as dates (changing the time of some event), but for much more involved actors of the attack Manual checking shows that labeling is more accurate than in the article. Even so, the synthetic fake news provides generation of fake news. This leads us to use labeling as a means valuable information. to improve article generation. • We did not use the model, which has additional informa- Table 1 was used to compare different ways to generate fake tion about the news content. Providing additional context news. It shows two of the best datasets, which contain true arti- would likely have a beneficial effect on all the processes. cles and their false twins, generated in two ways: • In our case facts were largely dependent on each other. For (1) Fake news generated by “standard” fact extraction and example Gender or age group is an extraction of Name with additional summarization. of casualty or group. We think it is best if such depen- (2) Fake news generated by “detailed” fact extraction and with dencies are removed because they bring to inconsistencies an additional paraphrasing of the article. when changing facts. An additional solution would also be to change Gander or age group whenever Name of In this experiment, instead of merely categorizing the articles casualty or group is changed. as true or false, the results shown in Table 1 reflect how well the • Fact extraction is close to human-like quality. The issue generation process aligns with fact verification. is, that besides manual checking, it is hard to find a good Low precision in the row with Detailed facts led us to detect measure of the quality of extracted facts. articles that were not changed. We implemented a strategy where • Detection of changed facts is in quality similar to extrac- labeling was applied after generating the fake articles to assess tion of facts (this is not surprising, since they are based the quality of the generation. LLMs often provide incomplete on the same skill). Because of the diversity of meanings in responses and struggle to correct them directly. By introducing an language, it is hard to specify the exact reasoning proce- additional verification step, we were able to enhance the overall dure of LLMs and many mistakes come from this kind of accuracy of the results. miscommunication. 4.6 Final Dataset Description 6 Acknowledgments In the end, we constructed 100 fake-news based on a prior ex- This work was supported by the European Union through AI4Gov 2 periment, which can be found on GitHub . In every article we (101094905) and TWON (101095095) EU HE projects and the randomly chose three facts and changed them. Afterward, we Slovenian National grant (CRP V2-2272). carefully went through 10 examples, which are also present on Git Hub, while here we present only the main points: References • Fact verification improved quality by making sure, that [1] Nicola Capuano, Giuseppe Fenza, Vincenzo Loia, and Francesco David Nota. 2023. Content-based fake news detection with machine and deep learning: a the synthetic fake article really incorporated new infor- systematic review. Neurocomputing, 530, 91–103. doi: https://doi.org/10.1016 mation. More than 90% new facts really got incorporated /j.neucom.2023.02.005. in the article. Sometimes new information is only added [2] Fredrik Jurgell and Theodor Borgman. 2024. Fake news detection : using a large language model for accessible solutions. (2024). as additional text(and does not seriously change the main [3] Ye Liu, Jiajun Zhu, Kai Zhang, Haoyu Tang, Yanghai Zhang, Xukai Liu, Qi Liu, topic). and Enhong Chen. 2024. Detect, investigate, judge and determine: a novel • llm-based framework for few-shot fake news detection. (2024). https://arxiv Fact is not always incorporated in all places where it is .org/abs/2407.08952 arXiv: 2407.08952 [cs.CL]. referenced, which leads to inconsistencies. The new article [4] Taichi Murayama. 2021. Dataset of fake news detection and fact verification: is then a blend of old and new information. a survey. (2021). https://arxiv.org/abs/2111.03299 arXiv: 2111.03299 [cs.LG]. • [5] Fatima K Abu Salem, Roaa Al Feel, Shady Elbassuoni, Mohamad Jaber, and There are problems with ˙˙detailed” prompts. Containing May Farah. 2019. Fa-kes: a fake news dataset around the syrian war. In more information results in contradictions as we change Proceedings of the international AAAI conference on web and social media. only one fact at a time. Vol. 13, 573–582. [6] Abdul Sittar, Dunja Mladenic, and Tomaž Erjavec. 2020. A dataset for in- formation spreading over the news. In Proceedings of the 23th International 5 Conclusion Multiconference Information Society SiKDD. Vol. 100, 5–8. [7] Yanshen Sun, Jianfeng He, Limeng Cui, Shuo Lei, and Chang-Tien Lu. 2024. In this article, we focused on exploring the potential of LLMs in Exploring the deceptive power of llm-generated fake news: a study of real- world detection challenges. (2024). https://arxiv.org/abs/2403.18249 arXiv: fact extraction and generation of fake news. Our motivation was 2403.18249 [cs.CL]. primarily to understand how accurate are LLMs in fact extraction [8] Lionel Z. Wang, Yiming Ma, Renfei Gao, Beichen Guo, Zhuoran Li, Han and how reliably LLMs generate synthetic news by altering facts. Zhu, Wenqi Fan, Zexin Lu, and Ka Chung Ng. 2024. Megafake: a theory- driven dataset of fake news generated by large language models. (2024). As a result of our experiment, we have generated 100 synthetic https://arxiv.org/abs/2408.11871 arXiv: 2408.11871 [cs.CL]. news by randomly transforming there out of seven facts and have performed a manual evaluation, to observe the quality of the generated news dataset. 2 https://github.com/golobluka/Fake-news-generation-from-FA-KES-dataset 50 Borrowing Words: Transfer Learning for Reported Speech Detection in Slovenian News Texts Zoran Fijavž Jožef Stefan Postgraduate International School Peace Institute Slovenia, Ljubljana zoran.fijavz@mirovni- institut.si Abstract to explore speaker representation by gender [1], institutional affiliations [8], and topic stances [15], or to distinguish between This paper describes the development of a reported speech clas-journalists’ and sources’ voices [11]. sifier for Slovenian news texts using transfer learning. Due to a lack of Slovenian training data, multilingual models were trained 2.2 Existing Datasets and Modelling on English and German reported speech datasets, reaching an F-score of 66.8 on a small manually annotated Slovenian news Approaches dataset and a manual error analysis was performed. While the Datasets with reported speech annotations mostly cotain liter- developed model captures many aspects of reported speech, fur- ary or news texts. Key corpora include RiQuA [12], SLäNDa 2.0 ther refinement and annotated data would be needed to reliably [19], Redewiedergabe [3], QUAC [14], PolNeAR [10], Quotebank predict less frequent instances, such as indirect speech and nom- [21], and STOP [22]. RiQuA and Redewiedergabe are the largest inalizations. th annotated corpora, covering English and German 19 century texts. QUAC contains 212 annotated articles from the Portuguese Keywords newspaper Público, while Quotebank spans 162 million news ar- reported speech, natural language processing, transfer learning, ticles with automatic annotations. PolNeAR, consisting of 1,028 news analysis news articles, includes attribution annotations, which include and exceed the definition of reported speech. A summary of the 1 Introduction datasets is provided in Table 1. The corpora differ in annotation complexity and size. They are Reported speech, ubiquitous in literary and news texts, has clear mostly monolingual, warranting the used cross-lingual transfer lexical and syntactic patterns which may be reliably modeled learning for low-resource languages by employing multilingual via natural language processing (NLP) and may be useful for models such as mBERT [6] and XLM-R [4]. Narrower multilingual downstream tasks by drawing a distinction between source and models, such as CroSloEngual BERT, often outperform broader background information. The paper applies transfer learning to ones [20]. Reported speech modeling may be operationalized as extend reported speech classification to Slovenian news texts and speaker or quotation detection tasks [23, 17]. Simplifying the task provides a provisional classification model. A manual error anal-to sentence-level classification is warranted by the fact news (un- ysis reveals the model’s strengths and weaknesses, highlighting like literary texts) rarely mix statements by sources and authors possible steps for further improvements. in the same sentence and can improve classification reliability at 2 Related Work the expense of detailed aspects of reported speech [17] and simplify the annotation structure. Missing fine-grained outputs, such 2.1 Role of Reported Speech as speakers and boundaries of reported and reporting clauses, Reported speech is common in news texts, generally expressed as may thus be an acceptable trade-off for NLP-based content analy- direct or indirect speech, with the former repeating the original sis in news texts. A systematic review of such approaches points utterance verbatim and the latter embedding it in a that-clause to the limits resulting from a low number of features with no [18] (e.g., Jimmy said: “Another systematic review would be great!” guarantee of reliable ( joint) prediction, which preclude drawing and Jimmy said that another systematic review would be great.). rich conclusions expected from the method’s manual counterpart More complex forms include mixed speech (City officials rebuffed [2]. the accusations as "groundless and blatantly false".) and reportative nominalizations with an analogous function as reported speech 3 Experimental Setting (The speaker particularly emphasized the pressures on the media) 3.1 Task Overview [7]. Around 50% of sentences in newspaper corpora may be at- We treated reported speech as a sentence-level classification task. tributed to a source in the text, predominantly through direct Sentence splitters were applied to existing datasets, and binary and indirect speech [17]. Verbs cue 96% of reported speech, fol-labels were assigned by matching annotated spans with the split lowed by prepositional phrases (3%) [13]. Reported speech lends sentences. Reported speech sub-types were unified under a single objectivity to statements [9], summarizes source statements [16], label, joining the annotation schemes of individual datasets. A and is used in discourse analysis and communication studies Slovenian dataset of 10 news texts was manually annotated at Permission to make digital or hard copies of all or part of this work for personal the sentence level. The datasets were split into training, evalu-or classroom use is granted without fee provided that copies are not made or ation, and test sets to train multilingual pretrained models. For distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this CroSloEngual BERT, preprocessing also involved machine trans-work must be honored. For all other uses, contact the owner /author(s). lating the German training data into English. The model outputs Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia were binary labels indicating reported speech, used to calculate F- © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2024.sikdd.21 scores on the test data. A manual error analysis was performed on 51 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Z. Fijavž Table 1: Summary of Datasets’ Characteristics. Corpus Type Annotations Language Sentence No. Role Positive Class RiQua fiction direct and in- English 38,610 72% train, 18% 48% direct speech, development, 10% cues, speakers, test addressees Redewiedergabe fiction, direct, indirect, German 24,033 76% train, 16% 33% news free indirect and development, 9% reported speech, test speaker, cues Quotebank (man- news speaker, direct English 9,071 test 30% ual) speech QUAC news speaker, direct Portuguese 11,007 test 11% speech PolNeAR news speaker, cues, at- English 34,153 test 59% tributions Slovenian parlia- news sentence-level bi- Slovenian 744 test 43% mentary news nary labels Figure 1: Flowchart of Data Preprocessing, Model Training and Evaluation Processes for Sentence-Level Reported Speech Classification. the best model’s outputs for Slovenian. Preprocessing, training, on the media and the "illegal non-funding of the Press Agency.") and evaluation steps are visualized in 1. as well as implied quotes (e.g., There will be more than 300,000 recipients, he emphasized. 169 million euros will have to be paid 3.2 Training and Test Data out.). Our experiments were based on existing annotated reported speech datasets and a small Slovenian dataset. The training data 3.3 Evaluation Procedure included sections from RiQuA and Redewiedergabe, both large datasets with labels for direct and indirect speech. For CroSlo- The models’ performance on the test datasets was calculated with Engual BERT training, the Redewiedergabe data was machine an F-score. A baseline of assigning a positive label to all examples translated into English. Testing was conducted on the test sec- was calculated for all test datasets. The models’ results on the tions of RiQuA, Redewiedergabe, the entire Portuguese corpus test datasets were compared with a Friedman’s test as suggested QUAC, and the manually annotated portion of the English Quote- in the literature [5]. bank corpus. Additionally, we manually annotated 10 Slovenian The best Slovenian model’s predictions were reviewed with news articles from RTV Slovenia. The datasets are summarized close reading. The error typology consisted of direct speech, in- in Table 1. direct speech, speech fragments, annotation errors, annotation The Slovenian dataset comprised 10 parliamentary news texts, errors and unrelated and other tags. Direct speech fragments were covering various reporting strategies. Retrieved articles were sentences part of multi-sentence direct speech quotations. Anno- split into sentences and annotated. Sentences were considered tation errors were examples with annotations inconsistent with reported speech if they included direct or indirect speech cued by the definition described in Section 3.2. For unrelated examples, a reporting clause or prepositional phrase. We excluded nominal- close reading revealed no clear misclassification cause. Other was izations and phrasal quotes (e.g., They emphasized the pressures used for examples that did not fit any of the mentioned categories. 52 Transfer Learning for Reported Speech Detection in Slovenian Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia 3.4 Training Settings XLM-R and mBERT were used as base models with the default training settings from the transformers library with the excep- tion of using 16 gradient accumulation steps and freezing the bottom 8 layers of all models. The latter reduces the training time without significant performance drops (Kovaleva idr., 2019; Merchant idr., 2020). Additionally, a Slovenian-Croatian-English BERT model was trained on English machine-translated data from Redewiedergabe. 4 Results 4.1 Model Results The model performance varies based on the congruence between the language and precise task definitions in each dataset. The Figure 3: False Negatives from the CroSloEngual BERT differences between model predictions were not statistically sig- Classifier. 2 nificant ( 𝜒 = 9.66; df = 5; n = 8; p = 0.14) so post-hoc tests were not 𝐹 performed. As Table 2 demonstrates, the XLM-R model trained on both RiQuA and Redewiedergabe performed well across the unmarked examples of direct or indirect speech (9.1%). The dis- datasets with an F-score of 80.5 and 77.6 on the Redewiedergabe tribution of categories identified in the sample of false positives and RiQuA test set, respectively. The high results from train- are illustrated in Figure 2. The most common errors in the 73 ing on combined data suggests the RiQuA and Redewiedergabe false negative examples were instances of indirect speech (34.2% datasets may benefit from additional or complementary data, of false negatives) and prepositional queing of reported speech at least when using cross-lingual transfer learning. The most (27.4%). The remainder were instances of direct speech, direct successful strategy for Slovenian data was training on RiQuA speech fragments and annotation errors representing 11%, 8.2% and English machine-translated Redewiedergabe data using the and 9.6% of the false negatives, respectively. The annotation CroSloEngual BERT model, reaching a F-score of 66.8. We did errors included nominalizations and statements reported as ad- not evaluate the impact of using translated training data with jective complements (The speaker was happy that the provisions mBERT and XLM-R. were accepted) not included in our annotation schema. Figure 3 summarizes the identified false negative categories . 5 Discussion This paper presents the development of a reported speech classi- fier, tested through a small annotated Slovenian dataset and man- ual error analysis. Cross-lingual transfer learning from the anno- tated RiQuA and Redewiedergabe datasets achieved an F-score of 66.8 on a small manually annotated dataset of Slovenian news of parliamentary sessions using the base CroSloEngual model with RiQuA and English machine-translated Redewiedergabe 1 training data . This these results corroborate the observation that language models trained on a limited number of languages may outperform less specialized ones such as mBERT and XLM-R [20]. The major source of errors were false positives (23.4% of all sentences) for which no systematic pattern was discernible in the majority (72.9%) of examples. Instances of indirect speech and Figure 2: False Positives from the CroSloEngual BERT Clas- prepositional queing of statements were overrepresented in the sifier. false negatives, accounting for 61.6% of false negatives. Although rare, nominalizations were present in both false positives and false negatives and should be considered in future annotation guidelines. These obeservations indicate reported speech clas- 4.2 Error Analysis Results sifiers may benefit form approaches for addressing imbalanced The results from CroSloEngual BERT on Slovenian data were classes. analyzed further. False positives were more common than false negatives, representing 23.4% and 9.8% of all examples (n = 744), 6 Conclusion respectively. Close reading of a sample of 100 false positives This study developed a sentence-level reported speech classifier did not show a definite pattern for most (72.9%) of them. These for Slovenian news texts using cross-lingual transfer learning. examples were clearly unrelated to reported speech, although By leveraging existing multilingual models (mBERT, XLM-R, and some did include words lexically related to reporting verbs (e.g. CroSloEngual BERT) with the English and German datasets Ri- The proposed law is still under discussion). The second category QuA and Redewiedergabe, we demonstrated that sentence-level of false positives were nominalizations of reported statements (13.1%) not included in our annotation schema. The final source 1 The fine-tuned CSE model is available on the Hugging Face Hub under the name of false positives were annotation errors consisting of wrongly zo-fi/rep-sp-CSE-rwg-riq. 53 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Z. Fijavž Table 2: Model Performances across Datasets (F-scores). Redewiedergabe RiQuA PolNeAR QUAC Quotebank Slovenian dataset Positive by default 52.1 60.6 74.2 19.5 45.8 60.3 mBERT+Both 77.5 77.4 73.1 40.5 53.5 63.2 mBERT+RiQuA 68.2 76.9 72.6 31.1 52.6 39.1 mBERT+RWG 78.4 70.4 65.5 43.4 49.1 63.2 XLM-R+Both 80.5 77.6 70 38.8 57.7 63.2 XLM-R+RiQuA 66.6 76.7 73.6 25.5 53.7 60.3 XLM-R+RWG 80.9 70.7 66.4 43.9 50 63.2 CroSloEngBERT+Both+MT 54 76.6 73 24 52.5 66.8 classification can detect some aspects of reported speech in Slove- [10] Edward Newell, Drew Margolin, and Derek Ruths. 2018. An Attribution Re- lations Corpus for Political News. In Proceedings of the Eleventh International nian. However, the performance estimates are limited due to the Conference on Language Resources and Evaluation (LREC 2018). LREC 2018. small size of the Slovenian testing set and the limited definition Nicoletta Calzolari et al., editors. European Language Resources Association used for the annotations. Future research should focus on de- (ELRA). Retrieved Apr. 10, 2024 from https://aclanthology.org/L18- 1524. [11] Mojca Pajnik and Marko Ribać. 2021. Medijski populizem in afektivno veloping a Slovenian annotated dataset, refining the annotation novinarstvo: časopisni komentar o »begunski krizi«. Javnost - The Public, schema for multiple use cases, and exploring additional modeling (Dec. 14, 2021). Retrieved Apr. 24, 2024 from https://www.tandf online.com features such as encoding broader sentence contexts. This work /doi/abs/10.1080/13183222.2021.2012943. [12] Sean Papay and Sebastian Padó. 2020. RiQuA: A Corpus of Rich Quotation contributes a provisional tool for computational discourse analy- Annotation for English Literary Text. In Proceedings of the Twelfth Language sis of Slovenian media texts. Further development is necessary Resources and Evaluation Conference. LREC 2020. Nicoletta Calzolari et al., editors. European Language Resources Association, 835–841. isbn: 979-10- for its application in more nuanced tasks. 95546-34-4. Retrieved Apr. 21, 2024 from https://aclanthology.org/2020.lrec- 1.104. Acknowledgements [13] Silvia Pareti, Tim O’Keefe, Ioannis Konstas, James R. Curran, and Irena Ko- prinska. 2013. Automatically Detecting and Attributing Indirect Quotations. This work was supported by the Slovenian Research Agency In Proceedings of the 2013 Conference on Empirical Methods in Natural Lan- grants via the core research programs Equality and Human Rights guage Processing. EMNLP 2013. David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu, and Steven Bethard, editors. Association for Com- in the Times of Global Governance (P5-0413) and Hate Speech putational Linguistics, 989–999. Retrieved Apr. 17, 2024 from https://aclant in Contemporary Conceptualizations of Nationalism, Racism, hology.org/D13- 1101. Gender and Migration ( J5-3102). [14] Marta Ercília Mota Pereira Quintão. 2014. Quotation A ttribution for Por- tuguese News Corpora. In Retrieved Apr. 21, 2024 from https://www.seman ticscholar.org/paper/Quotation- A- ttribution- f or- Portuguese- News- Corp References ora- Quint%C3%A3o/69f ea7d030d5e71b973ec67aa897a7c9aadadac2. [15] Masaki Shibata. 2023. Dialogic Positioning on Pro-Whaling Stance: A Case [1] Fatemeh Torabi Asr, Mohammad Mazraeh, Alexandre Lopes, Vasundhara Study of Reported Speech in Japanese Whaling News. Japanese Studies, 43, Gautam, Junette Gonzales, Prashanth Rao, and Maite Taboada. 2021. The 1, (Jan. 2, 2023), 71–90. doi: 10.1080/10371397.2023.2191839. Gender Gap Tracker: Using Natural Language Processing to measure gender [16] Michael Short. 1988. Speech presentation, the novel and the press. In The bias in media. PLOS ONE, 16, 1, (Jan. 29, 2021), e0245533. doi: 10.1371/journ Taming of the Text. Willie Van Peer, editor. Routledge. isbn: 978-1-315-54452- al.pone.0245533. 6. [2] Christian Baden, Christian Pipal, Martijn Schoonvelde, and Mariken A. C. G [17] Alexander Spangher, Nanyun Peng, Jonathan May, and Emilio Ferrara. van der Velden. 2022. Three Gaps in Computational Text Analysis Meth- 2023. Identifying Informational Sources in News Articles. Version 1. doi: ods for Social Sciences: A Research Agenda. Communication Methods and Measures 10.48550/ARXIV.2305.14904. , 16, 1, (Jan. 2, 2022), 1–18. doi: 10.1080/19312458.2021.2015574. [18] Stef Spronck and Daniela Casartelli. 2021. In a manner of speaking: how [3] Annelen Brunner, Stefan Engelberg, Fotis Jannidis, Ngoc Duyen Tanja Tu, reported speech may have shaped grammar. Frontiers in Communication, 6, and Lukas Weimer. 2020. Corpus REDEWIEDERGABE. In Proceedings of the Twelfth Language Resources and Evaluation Conference 624486. . LREC 2020. Nicoletta [19] Sara Stymne and Carin Östman. 2022. SLäNDa version 2.0: Improved and Calzolari et al., editors. European Language Resources Association, 803–812. Extended Annotation of Narrative and Dialogue in Swedish Literature. In isbn: 979-10-95546-34-4. https://aclanthology.org/2020.lrec- 1.100. Proceedings of the Thirteenth Language Resources and Evaluation Conference. [4] Alexis Conneau et al. 2020. Unsupervised Cross-lingual Representation LREC 2022. Nicoletta Calzolari et al., editors. European Language Resources Learning at Scale. In Proceedings of the 58th Annual Meeting of the Associ- ation for Computational Linguistics Association, 5324–5333. Retrieved Apr. 21, 2024 from https://aclanthology.o . ACL 2020. Dan Jurafsky, Joyce Chai, rg/2022.lrec- 1.570. Natalie Schluter, and Joel Tetreault, editors. Association for Computational [20] Matej Ulčar and Marko Robnik-Šikonja. 2020. FinEst BERT and CroSlo- Linguistics, 8440–8451. doi: 10.18653/v1/2020.acl- main.747. Engual BERT. In Text, Speech, and Dialogue (Lecture Notes in Computer [5] Janez Demšar. 2006. Statistical Comparisons of Classifiers over Multiple Science). Petr Sojka, Ivan Kopeček, Karel Pala, and Aleš Horák, editors. Data Sets. The Journal of Machine Learning Research, 7, (Dec. 1, 2006), 1–30. Springer International Publishing, Cham, 104–111. isbn: 978-3-030-58323-1. [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. doi: 10.1007/978- 3- 030- 58323- 1_11. BERT: Pre-training of Deep Bidirectional Transformers for Language Under- [21] Timoté Vaucher, Andreas Spitz, Michele Catasta, and Robert West. 2021. standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- Quotebank: A Corpus of Quotations from a Decade of News. In Proceedings gies, Volume 1 (Long and Short Papers) of the 14th ACM International Conference on Web Search and Data Mining. . NAACL-HLT 2019. Jill Burstein, WSDM ’21: The Fourteenth ACM International Conference on Web Search Christy Doran, and Thamar Solorio, editors. Association for Computational and Data Mining. ACM, 328–336. isbn: 978-1-4503-8297-7. doi: 10.1145/343 Linguistics, 4171–4186. doi: 10.18653/v1/N19- 1423. 7963.3441760. [7] Gabriel Dvoskin. 2020. Reported speech and ideological positions: the so- [22] M. Wynne. 1996. Speech, Thought and Writing Presentation Corpus. Re- cial distribution of knowledge and power in media discourse. Bakhtiniana: Revista de Estudos do Discurso trieved Apr. 21, 2024 from https://ora.ox.ac.uk/objects/uuid:6caa73c1- d283- , 15, 193–213. 4d51- a78f - 55df 69bae986. [8] Zoran Fijavž and Darja Fišer. 2021. Citatnost in reprezentacija v spletnem [23] Dian Yu, Ben Zhou, and Dong Yu. 2022. End-to-End Chinese Speaker Identi- migracijskem diskurzu. In Sociolingvistično iskrenje. Maja Bitenc, Marko fication. In Proceedings of the 2022 Conference of the North American Chapter Stabej, and Žejn Andrejka, editors. Založba Univerze v Ljubljani. Retrieved of the Association for Computational Linguistics: Human Language Technolo- Apr. 3, 2024 from https://ebooks.uni- lj.si/ZalozbaUL/catalog/view/259/370 gies. NAACL-HLT 2022. Marine Carpuat, Marie-Catherine de Marneffe, and /6011. Ivan Vladimir Meza Ruiz, editors. Association for Computational Linguistics, [9] Elizabeth Holt. 1996. Reporting on Talk: The Use of Direct Reported Speech 2274–2285. doi: 10.18653/v1/2022.naacl- main.165. in Conversation. Research on Language and Social Interaction, 29, 3, (July 1, 1996), 219–245. doi: 10.1207/s15327973rlsi2903_2. 54 What kind of ESG is profitable? Connecting company performance to ESG terms in financial reports Luka Andrenšek Katarina Sitar Šuštar trovato@corporation.com University of Ljubljana Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia katarina.sitar@ef.uni- lj.si Senja Pollak Matthew Purver Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia senja.pollak@ijs.si matthew.purver@ijs.si ABSTRACT in the last few years on applying computational machine learning and statistical methods to ESG analysis (see e.g. the recent review In this paper, we examine the relationship between the discussion by Lim [9]). of Environmental, Social and Governance (ESG) in companies’ However, much of this analysis examines numerical company annual financial reports and their financial performance. Specifi- performance data and categorical metadata; our interest is in cally, we analyse the companies’ use of specific ESG terms along- developing and applying natural language processing (NLP) tech- side the performance metric, sector-normalized Return on Assets nologies to not only help automate analyses, but allow under- (ROA). Our motivation is to determine whether companies fre- standing of how human actors discuss and understand the im- quently mentioning terms such as “gender”, “equality”, “talent”, portant and meaning of ESG aspects. and “innovation” in their reports demonstrate a higher annual Application of NLP in finance is not new: for example, topic ROA compared to those that rarely used these terms. To explore modelling has been used to predict company performance and this, we used existing datasets with reports and performance met- investigate strategies [14, 7]. Recent work also includes applica-rics from 348 companies, covering the years from 2009 to 2021. In tion to ESG aspects: Nugent et al. [12] automatically extract news order to better examine differences, we then selected companies about ESG controversies, and Lee et al. [8] analyse sentiment whose ROA significantly differed from the average (either higher on ESG issues. Closer to our interests, Purver et al. [13] investi-or lower), allowing for a more pronounced examination of the gated how the use of ESG terms by companies has changed over impact of ESG term usage on financial performance. The filtered time. By analysing and annotating a set of existing resources, dataset consisted of 107 companies, with a total of 427 reports; they defined a set of 93 ESG terms categorised into 5 core ESG split into two sections representing higher and lower performing areas. They then showed how these terms can be used to anal- companies. We then used an existing list of ESG terms derived yse changes in reporting, by analysing a collection of company from a range of separate data sources, and applied a basic sta- annual reports, collated over a period of 8 years, using language tistical n-gram language model to extract the probabilities of modelling and distributional methods to reveal changes in the each ESG term’s occurrence in each of the higher- and lower- frequency and in the usage of the ESG terms. performing dataset sections. Results show that while certain sets Here, we are interested not in changes in ESG discussion over of ESG concepts correlate with higher financial performance, time, but in whether and how the reporting of ESG aspects is others do the opposite, and give some initial interpretation into connected to financial performance. We take Purver et al. [13]’s the light this sheds on company reporting behaviour. resources and methods as a starting point, but augment the fi- KEYWORDS nancial report text data with available metadata on financial performance, allowing us to compare how ESG reporting varies financial report analysis, language modelling, environmental, between more and less well-performing companies. social and governance reporting 1 INTRODUCTION & RELATED WORK 2 DATA AND METHODS There is increasing interest in the behaviour of companies in 2.1 Hypotheses the area of Environmental, Social and Governance (ESG) criteria, In general, we expect increased probability of appearance of ESG including a company’s environmental impact (Environmental), terms in the annual reports from the more profitable firms, based relationships with the community including employees, suppliers on a number of factors. In general, overall high ESG performing and customers (Social), and leadership structures including exec- companies exhibit high financial performance [1, 5]; although we utive pay and shareholder rights (Governance). Although until note that the link between high ESG score performance and men- recently, ESG analyses were almost entirely performed manually tion of ESG terms is not guaranteed to be straightforward. More by experts (see e.g. [10]), there has been a large amount of work specifically, during the period between 2010-2020 analysed here, Permission to make digital or hard copies of part or all of this work for personal there was a growing emphasis on corporate social responsibility or classroom use is granted without fee provided that copies are not made or (CSR) and sustainability. Investors, consumers, and other stake- distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this holders increasingly prioritised companies that demonstrated a work must be honored. For all other uses, contact the owner /author(s). commitment to innovation, diversity, and environmental sustain- Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia ability [11, 2]. Busru and Shanmugasundaram [3] find that firms © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2024.sikdd.3 closely engaging in fostering innovation, attracting top talent, 55 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Andrenšek et al. Year # Reports # Words ’positive’ group included reports with an ROA of at least 0.2, 2012 178 12.5M reflecting very good yearly performance. 2013 181 14.0M Subsequently, we employed a statistical n-gram language model 2014 184 15.0M 3 (using NLTK ) to analyze the occurrence of each ESG term. For 2015 196 16.3M each term, we calculated the probability of its occurrence in pos- 2016 198 17.5M itive reports (𝑝+) and in negative reports (𝑝 − ), and the difference 2017 200 18.4M (𝑝+ − 𝑝 − ). Terms with a large difference in these probabilities are 2018 200 19.6M more strongly associated with positive reports than with nega- 2019 202 21.2M tive ones, and vice versa: terms with a large negative difference total 1539 134.6M are common in negative reports but rare in positive ones. We Table 1: Number of annual reports available by year conducted this analysis for both unigrams and bigrams. 3 RESULTS AND DISCUSSION The results for 1- and 2-grams are shown in Figures 1 and 2 below promoting gender and diversity initiatives, could confer a com- 4 (3- and 4-grams showed no clear interpretable associations). As petitive advantage over the industry peers. Furthermore, some hypothesized, many ESG terms show a strong association with policy and regulatory changes (e.g. the 2018 UK Corporate Gov- positive performance, with many of these being core terms as- ernance Code, the 2014 EU Directive on Non-Financial Reporting, sociated with human resources (innovation, talent), with social Carbon Disclosure Project (CDP)) directly or indirectly encour- aspects (gender, diversity), environmental aspects (renewable, car- aged companies to address issues related to diversity, gender bon footprint, environmental impact) and overall ESG descriptors equality, and environmental sustainability. (ethical). However, many terms are conversely (and contrary to our general hypothesis) associated with negative performance, 2.2 Data and pre-processing including, again, terms across various ESG categories including To test this hypothesis, we build on the resources and methods environmental (carbon emissions, energy efficiency, greehouse), of Purver et al. [13], who provide a dataset of annual reports human resources (mental health, wellbeing) and general ESG from FTSE350 companies over the years 2012-2019, based on the descriptors (governance). FTSE350 list as of 25th April 2020 and obtained from the publicly However, by combining these terms with recent work in clus- accessible collection at www.annualreports.com. The reports are tering and describing ESG terms [4], we can shed more light already converted to plain text, and we use their publicly avail- on which categories seem to be more positive and which more able tools to tokenize the collection into words and build ngrams negative. Ferjancic et al. [4], using the same dataset and ESG of length 1-4 padded with sentence start and end symbols; the term list [13], perform a further topic analysis using BERTopic dataset size is reported in Table 1 below (taken from [13]). We [6], in which they derive 30 ESG-related topics and 6 higher-level use their set of ESG terms, defined via a process of extracting clusters of ESG concepts; they then examine the correlations candidate terms from a set of public ESG definitions and tax- between these ESG topics and company ESG scores as obtained onomies, asking financial expert annotators to label them as to from external analysts. We align our ESG terms with Ferjancic their representativeness as ESG terms and their ESG subcategory, et al. [4]’s 30 topics by matching against the words most asso-and keeping the terms with high inter-annotator agreement (see ciated with each topic (if a term appears in the top 10 words [13] for details). associated with a topic, we take the term and topic as aligned); we can then compare our positive/negative associations with Fer- 2.3 Financial performance analysis jancic et al. [4]’s correlations with company ESG scores. Table 2 The reports were then linked to financial indicators for the re- shows this alignment for our most positive and negative bigram spective year and company. The data on company fundamentals terms here, with the topic labels and an indication of the strength 1 was obtained from the Refinitiv EIKON Datastream. Each entry and direction of correlation with overall company ESG scores, as contained annual financial indicators, as well as the companies’ given by [4]. industry and sector codes. The main variable of interest was Given this, we see some systematic groupings. Climate change, 2 normalized, averaged return on assets (ROA) as defined below: as part of the ‘climate risk and policy’ topic, as well as supply chain and human trafficking as part of the ‘human rights’ topic, 𝑁 𝑒𝑡 𝐼 𝑛𝑐𝑜𝑚𝑒 − 𝐵𝑜𝑡 𝑡𝑜𝑚𝐿𝑖𝑛𝑒 represent the themes that appear to be, across different industries, +( (𝐼𝑛𝑡𝑒𝑟𝑒𝑠𝑡𝐸𝑥𝑝𝑒𝑛𝑠𝑒𝑂𝑛𝐷𝑒𝑏𝑡 − 𝐼𝑛𝑡𝑒𝑟𝑒𝑠𝑡𝐶𝑎𝑝𝑖𝑡𝑎𝑙𝑖𝑧𝑒𝑑) related to high company ESG scores. A similar observation holds ×(1 − 𝑇 𝑎𝑥𝑅𝑎𝑡𝑒)) for gender balance, gender pay and environmental impact, which ′ ′ 𝐴𝑣 𝑒𝑟 𝑎𝑔𝑒𝑂 𝑓 𝐿𝑎𝑠𝑡 𝑌 𝑒𝑎𝑟 𝑠𝐴𝑛𝑑𝐶𝑢𝑟 𝑟 𝑒𝑛𝑡 𝑌 𝑒𝑎𝑟 𝑠𝑇 𝑜𝑡 𝑎𝑙 𝐴𝑠𝑠𝑒𝑡 𝑠 all fall in a group of topics which are strongly and significantly After extracting financial reports with available ROA data, we correlated with high ESG scores throughout different industries. categorized the financial reports into two groups, in order to Overall high ESG performing companies exhibit high financial examine differences in the associated reports’ use of ESG terms. performance [1, 5], therefore our results for terms such as climate The distribution of ROA shows a heavy concentration around change, supply chain and human trafficking are not surprising: as the mean, so in order to derive two distinctive groups we took indicators of topics associated with high ESG, they are good terms the two extremes and excluded the central group around the for tracking these ESG aspects associated with high financial mean. The ‘negative’ group comprised reports with a yearly ROA performance. less than -0.2, indicating very poor performance. Conversely, the 3 https://www.nltk.org/ 1 4 https://www.refinitiv.com Note that these figures show differences in absolute probabilities: magnitudes are 2We use this normalization and averaging to smooth and remove one-off effects. comparable within 1-grams, and within 2-grams, but not between 1- and 2-grams. 56 What kind of ESG is profitable? Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Figure 1: Difference in probability between positive and negative reports 𝑝+ − 𝑝− for the most positive and negative unigram ESG terms. Figure 2: Difference in probability between positive and negative reports 𝑝+ − 𝑝− for the most positive and negative bigram ESG terms. Looking at the terms with low values which are associated efficiency. It seems that better performing companies use carbon with low RoA, waste management and corporate responsibility are footprint instead of carbon emissions, and discuss more on the associated with topics, for which in some industries proportion use of renewable energy than on energy use, energy efficiency of these correlate with ESG scores significantly positively and in and/or fossil fuels. In future work, we plan to analyse the use of other industries this correlation is significantly negative. Based these terms in more depth, including analysis of the lexical and on overall correlation between ESG scores and topic proportions topical contexts in which they appear, and adding techniques across different industries, these two topics are among the third such as sentiment and topic analysis to shed more light on these of the topics for which negative correlation between the topic distinctions. proportion and ESG score prevails. Due to the aforementioned correlation between ESG and financial performance it is therefore ACKNOWLEDGEMENTS understandable that these terms are associated with mention in The authors thank the reviewers for helpful suggestions, and ac- annual reports of companies with low RoA. Overly extensive knowledge financial support from the Slovenian Research Agency discussion on specific topics (such as ‘waste management’ and for research core funding (No. P2-0103), as well as for funding of ‘corporate responsibility’) can negatively impact ESG score (see the research project Quantitative and qualitative analysis of the [4]) which can by analogy of ESG and financial performance [1, unregulated corporate financial reporting (No. J5-2554). 5] hold for companies with low RoA. There is a surprising number of bigrams in both the high RoA REFERENCES and low RoA groups which seem to be associated with the same [1] Nisar Ahmad, Asma Mobarek, and Naheed Nawazesh Roni. 2021. Revisiting topic, namely ‘climate footprint and energy management’. For the impact of ESG on financial performance of FTSE350 UK firms: static and dynamic panel data analysis. Accounting, Corporate Governance & Business companies with high RoA, these terms are carbon footprint and Ethics. doi: 10.1080/23311975.2021.1900500. renewable energy, and for companies with low RoA, the terms are fossil fuels, carbon emissions, energy use, air quality and energy 57 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Andrenšek et al. 2 grams Term/ROA Topic Topic/ESG score correlation correlation Supply chain + Human rights ++ Business model + Customer services, People and culture +; - Gender balance + Diversity and inclusion ++ Environmental impact + General ESG + Carbon footprint + Climate footprint and energy management = Gender pay + Diversity and inclusion ++ Climate change + Climate risk and policy ++ Human trafficking + None directly related, in broader context in Hu- ++ man rights Working environment + People and culture - Renewable energy + Climate footprint and energy management = Waste management - Waste management – Fossil fuels - No explicit match; contextually appears in Cli- = mate footprint and energy management Corporate responsibility - Corporate governance – Carbon emissions - Climate footprint and energy management = Mental health - Health and safety + Energy use - Climate footprint and energy management - Air quality - No explicit match; contextually appears in Cli- - mate footprint and energy management Energy efficiency - Climate footprint and energy management - Product safety - Health and safety = Table 2: Selected ESG terms with their ROA correlation direction (+/−), topic according to [4], and topic/ESG score correlation strength (+ + /+/= /−/−−) as calculated by [4]. [2] A. C. Amason and H. J. Sapienza. 2012. The effects of top management team rep. Available from https://iri.hks.harvard.edu/links/transparency- p size and interaction norms on cognitive and affective conflict. Journal of erf ormance- industry- based- sustainability- reporting- key- issues. Management, 23, 495–516. Hauser Center for Nonprofit Organizations at Harvard University. [3] S. A. Busru and G. Shanmugasundaram. 2017. Effects of innovation invest- [11] M. Marzook and B. Al Ahmady. 2022. Linking organisational performance ment on profitability and moderating role of corporate governance: empiri- and corporate social responsibility. European Jnl. of Business and Manage- cal study of indian listed firms. Indian Journal of Corporate Governance, 10, ment Research, 7, 335–343, 3. https://doi.org/10.24018/ejbmr.2022.7.3.1466. 97–117, 2. https://doi.org/10.1177/0974686217730938. [12] T. Nugent, N. Stelea, and J. L. Leidner. 2020. Detecting ESG topics using [4] Ursa Ferjancic et al. forthcoming. Textual analysis of corporate sustainability domain-specific language models and data augmentation approaches. (2020). reporting and corporate ESG scores. under review. (Forthcoming). http://arxiv.org/abs/2010.08319. [5] Gunnar Friede, Timo Busch, and Alexander Bassen. 2015. ESG and financial [13] Matthew Purver, Matej Martinc, Riste Ichev, Igor Lončarski, Katarina Sitar performance: aggregated evidence from more than 2000 empirical studies. Šuštar, Aljoša Valentinčič, and Senja Pollak. 2022. Tracking changes in ESG Journal of Sustainable Finance & Investment, 5, 4, 210–233. doi: 10.1080/204 representation: initial investigations in UK annual reports. In Proceedings of 30795.2015.1118917. the First Computing Social Responsibility Workshop within the 13th Language [6] Maarten Grootendorst. 2022. BERTopic: neural topic modeling with a class- Resources and Evaluation Conference. Mingyu Wan and Chu-Ren Huang, based TF-IDF procedure. (2022). https://arxiv.org/abs/2203.05794 arXiv: editors. Marseille, France, (June 2022), 9–14. https://aclanthology.org/2022.c 2203.05794 [cs.CL]. srnlp- 1.2. [7] M. Jagannathan, D. Roy, and V. S. K. Delhi. 2022. Application of NLP-based [14] W. Xu and K. Eguchi. 2021. Topic embedding regression model and its appli- topic modeling to analyse unstructured text data in annual reports of con- cation to financial texts. In Proceedings of the Third Workshop on Financial struction contracting companies. CSI Transactions on ICT, 10, 2, 97–106. Technology and Natural Language Processing, 15–21. [8] H. Lee, S. H. Lee, K. R. Lee, and J. H. Kim. 2023. Esg discourse analysis through bertopic: comparing news articles and academic papers. Computers, Materials & Continua, 75, 3, 6023–6037. [9] Tristan Lim. 2024. Environmental, social, and governance (esg) and artificial intelligence in finance: state-of-the-art and research takeaways. Artificial Intelligence Review, 57, 76. doi: 10.1007/s10462-024-10708-3. [10] Steve Lydenberg, Jean Rogers, and David Wood. 2010. From Transparency to Performance: Industry-Based Sustainability Reporting on Key Issues. Tech. 58 Classification of Patents Into Knowledge Fields: Using a Proposed Knowledge Mapping Taxonomy (KnowMap) Elham Motamedi Inna Novalija Luis Rei elham.motamedi@upr.si inna.koval@ijs.si luis.rei@ijs.si University of Primorska Jožef Stefan Institute Jožef Stefan Institute Koper, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia Abstract Table 1: Example of a sequence of codes across different levels of the CPC hierarchy Various platforms, including patent systems and repositories like GitHub and arXiv, support knowledge dissemination across do- mains. As knowledge increasingly spans multiple disciplines, CPC Code Title there is a need to track innovations that intersect various fields. Section H Electricity Despite available data, a comprehensive knowledge taxonomy for Class H03 Electronic circuitry effectively tracking innovations across domains is lacking. Devel- Subclass H03C Modulation oping such a taxonomy and employing automated classification Group H03C3/00 Angle modulation methods will enhance the ability to track shared knowledge. Subgroup H03C3/005 Circuits for asymmetric modulation In this work, we first developed a knowledge taxonomy based on the CPC schema. We formulated the classification of textual data into defined knowledge fields as a multi-label problem. Then, study, we created a knowledge field taxonomy by merging CPC’s we evaluated the effectiveness of the classification models by detailed classes into a more abstract representation. This taxon- fine-tuning pre-trained transformer language models. The multi- omy not only serves as a framework for knowledge representa- label framework enables the tracking of knowledge trends at the tion but also offers a benchmark for patent classification systems. intersection of various disciplines. While some studies address the issue of numerous class labels by Keywords excluding less-represented classes or truncating hierarchies [24], a consistent benchmark taxonomy has been lacking. Since our Knowledge Taxonomy, Knowledge Tracking, Patent Classifica- proposed knowledge taxonomy aligns with the CPC schema, it tion, Hierarchical Classification, Multi-label Classification is able to provide a benchmark for future studies, facilitating the comparison of different models. 1 Introduction In summary, our paper’s contribution is the proposal of a According to the World Intellectual Property Organisation (WIPO), knowledge field taxonomy, KnowMap, which aligns with the a patent is an exclusive right granted for an invention, providing widely used CPC schema. The KnowMap merged several class legal protection to the inventor while simultaneously benefiting labels within the CPC schema based on the scope of the knowl- 1 society by making the invention publicly accessible . Each year, edge field and the number of patents associated with each class. 2 patent offices receive numerous patent applications that need to The KnowMap taxonomy is available online . In this study, we be processed [13].To ensure the novelty of patent applications, in-also performed a classification task to categorise patents into the ventors should also be able to search existing patents. Organising fine-grained classes defined by our proposed taxonomy. patents with unique codes in a hierarchical structure aids efficient retrieval and aligns with natural human navigation, starting from 2 Related Work broad categories and narrowing down to specifics[21]. Among Patent documents contain various types of information, including these hierarchical structures, the CPC system is widely recog- text, diagrams, plots, and references to other patents or scientific nised [6]. The CPC codes are organised as a taxonomy, meaning publications [20]. The textual content of a patent is divided into that each entity in the lower level is the detail group of the parent. several sections, such as the title, abstract, claim, and description A patent can be assigned to one or more labels by the experts [11]. The title and abstract are shorter than the description but in patent offices [8, 18]. In the first level of the CPC hierarchy, still provide relevant information for classification. Li et al. [15] there are nine sections, which are divided into classes, subclasses, evaluated various lengths of the abstract and title, finding that groups, and subgroups. Each level of this hierarchy can have using the first 100 words of title and abstract resulted in the best several codes ending in approximately 250,000 classification la- classification performance in their study. bels [11]. An example of the hierarchical structure of CPC code Various classification systems exist for organising patents [6]. is provided in Tab. 1. In this work, we focus on the CPC schema. The hierarchical repre- The CPC schema’s top level has only nine sections, but the sentations help organise patents and facilitate efficient searching. number of groups increases substantially at lower levels. In this Kamateri et al. [11] discussed several potential challenges that 1 artificial intelligence technologies face in patent classification. https://www.wipo.int/portal/en/ One such challenge is the extensive number of class labels. As an Permission to make digital or hard copies of all or part of this work for personal example, the IPC contains approximately 86,000 classes, while or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the CPC has around 250,000. the full citation on the first page. Copyrights for third-party components of this Patent classification is a multi-label classification problem work must be honored. For all other uses, contact the owner /author(s). since every patent can belong to several knowledge fields [18, Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia © 2024 Copyright held by the owner/author(s). 2 https://doi.org/10.70314/is.2024.sikdd.19 https://github.com/elmotamedi/KnowMap- Taxonomy 59 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Motamedi et al. 10]. Given the large number of classes at the lowest level of the or higher were considered duplicates. To generate the hash sig-taxonomy tree, the performance of automatic models in predict- natures in MinHash, we used 128 permutations. For the n-gram ing such granular categories is limited. Various models have been representation, we used a range of 1 to 3, incorporating 1-grams, used to classify patents in a multi-label setting, ranging from clas- 2-grams, and 3-grams. sical machine learning models to deep learning models [15, 5, 8]. Several previous studies have focused on higher levels of the 3.2 Refining Hierarchical Structure Through hierarchy, limiting classification to broader categories such as Group Merging sections, classes, or subclasses within the taxonomy [3]. Bekamiri The hierarchical structure of the CPC groups was refined at each et al. [3] fine-tuned the SBERT model to predict labels at the sub-level of the tree. We started with nine sections at the top level (i.e., class level (i.e., 663 class labels) using a multi-label formulation. level 1), which were preserved. At subsequent levels (i.e., level 2 to They achieved F1-score of 66%, outperforming previous studies level 4), groups were merged by manual analysis based on shared that used the same datasets. Aroyehun et al. [1] similarly trun-knowledge and the number of documents. Groups with relatively cated the IPC hierarchy at the subclass level and predicted these few documents (i.e., groups with fewer than 40,000 for level 2, labels by transferring knowledge from two higher levels (section 20,000 for level 3, and 9,000 for level 4) were combined with other and class) to the lower level (subclass), achieving a precision groups at the same level that shared similar knowledge. As an ex- score of 0.53. While it remains valuable for patent office experts ample, at the subclass level of the CPC hierarchy, "A01B" (i.e., Soil to use an automatic model that can narrow down applications to working) and "A01C" (i.e., Planting, Sowing, Fertilising) represent higher levels of the taxonomy tree, this approach has limitations related steps in agricultural practices, as both are foundational and challenges. One such challenge is that the choice of target processes in land preparation and management. We merged them class labels does not depend on the scope of the knowledge area. into a single group labelled "Soil working and planting," resulting More established and expansive areas may benefit from directing in 162,567 patents in this category. The refinement continued experts to detailed groups, while less developed areas may be until the fine-grained classes contained at least 9,000 documents. adequately served by broader classifications. 3.3 Text Classification 3 Methods and Materials We formulated the classification problem as a multi-label problem, In this work, we developed a knowledge taxonomy and classi- in which each document can be assigned to multiple knowledge fied patents into fine-grained classes by fine-tuning pre-trained fields. In this study, we aimed to classify the patents into the fine- models. Below, we outline the methods and materials used. grained classes in the lowest level of the proposed taxonomy (i.e., 83 classes). To balance performance and computational cost given 3.1 Patent Collection and Preprocessing the large size of the dataset, We used the pre-trained language models distilroberta-base, a distilled version of RoBERTa [16, 19], The dataset used in our experiments is the Google Patents Pub- 3 and all-MiniLM-L6-v2, a version of MiniLM fine-tuned for seman- lic Datasets on BigQuery . Each patent has several pieces of tic similarity [22, 17]. The pre-trained models were fine-tuned information, including the publication number, application num-for the downstream task by adding a classification head. The ber, CPC code, title, abstract, and detailed description. We have classification head takes the hidden state of the first token from expanded the dataset to include the titles associated with each 4 the model and processes it through a fully connected dense linear CPC code from Espacenet. . In this study, we focused on the tex- layer, followed by a dropout layer for regularisation and a tanh tual data. We generated the input text by concatenating the title, activation function for non-linearity. Since our task is multi-label followed by the abstract, and then the description. We included classification, the output logits for each class are converted into only those documents where the concatenated text is at least 100 probabilities using a sigmoid function. words long. Previous studies have examined various lengths of For model training, we used a learning rate of 4e-5 with a textual data and found that using the first 100 words often results linear scheduler and a weight decay of 0.1. To prevent overfitting, in higher performance for classification tasks [15]. the best checkpoint was selected based on evaluation metrics To create a hierarchical structure where we have enough doc- on the validation set. We trained the model for up to 5 epochs uments among leaf-node labels (i.e., avoiding scenarios where with early stopping criteria based on validation accuracy. The one group contains only a few hundred documents while others dataset, consisting of 1,092,991 samples randomly selected after contain hundreds of thousands as an example), we needed to deduplication, was split into training, validation, and test sets count the number of documents which fall into the defined cate- with ratios of 0.8, 0.1, and 0.1, respectively. To preserve the ratio gories. As a preprocessing step before counting, we performed of samples per class in training, validation, and test sets, we used de-duplication, which involved removing duplicate and near- 5 stratified splitting . duplicate textual data [4, 12, 14]. Due to the large size of the dataset, we employed MinhHash 3.4 Classification Evaluation Locality Sensitive Hashing (LSH) as a deduplication method to efficiently identify similar documents [7, 9, 22]. Specifically, we The F1-score is a common metric for classification tasks. We used MinHash to approximate the Jaccard similarities between report both Micro-F1, averaged across all instances, and Macro- sets of n-grams within the documents. MinHash is particularly F1, averaged across all classes. advantageous for large datasets because it supports parallel com- 4 Results and Analysis putation, enhancing scalability [2]. We set the similarity threshold at 0.9, meaning that documents with a Jaccard similarity of 90% In this section, the results are presented in two parts. First, we present our proposed KnowMap taxonomy. Then, we report the 3 5 https://github.com/google/patents- public- data https://github.com/trent- b/iterative- stratification?tab=readme- ov- file#multilab 4 https://worldwide.espacenet.com/ elstratifiedkfold 60 Classification of Patents Into Knowledge Fields: Using KnowMap Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Table 3: Classification Results performance of classifiers in categorising patents into the fine- grained classes of this taxonomy. Metric RoBERTa SBERT 4.1 The Proposed Knowledge Mapping Micro-F1 (Val) 0.76 0.76 Taxonomy (KnowMap) Macro-F1 (Val) 0.86 0.86 The taxonomy, along with the associated CPC sections, classes, Micro-F1 (Test) 0.77 0.76 subclasses, groups, and subgroups are provided in the shared Macro-F1 (Test) 0.90 0.90 online source. An example of detailing the knowledge field of soil working and planting within the broader knowledge field of human necessities is illustrated in Fig. 1. 1.0 CPC All groups in A A01 0.8 A01B, A01C A01B, A01C 162,567 docs 162,567 docs KnowMap SOIL WORKING AND SOIL WORKING AND PLANTING PLANTING 0.6 alue HARVESTING AND 1,543,195 docs PRODUCE PROCESSING AGRICULTURE ANIMAL HUSBANDRY malized V AND CONTROL 0.4 30,813,838 docs FOODSTUFFS TOBACCO Nor HUMAN NECESSITIES DAIRY PRODUCTS PERSONAL OR DOMESTIC ARTICLES OPERATIONS AND TRANSPORTING 0.2 HEALTH AMUSEMENT ocs CHEMISTRY AND d METALLURGY 2 ,02 49 F1 Macro TEXTILES AND PAPER 0.0 Test Size 7,7 t 18 oo 0 12 20 41 62 82 FIXED CONSTRUCTIONS R Class Index MECHANICAL ENGINEERING Figure 2: Normalised test size along with F1 Macro scores PHYSICS for each class. The x-axis represents class indices. The y- ELECTRICITY axis shows normalised values for test size and F1 Macro scores (blue dots). NEW TECHNOLOGIES Level 1 Level 2 Level 3 Level4 We demonstrated the experimental results on the two classifi- Figure 1: An example of a branch extension in KnowMap cation models RoBERTa and SBERT in Tab. 3. from the root to the lowest level, showing the association As observed from the results, the Macro-F1 score is higher than of KnowMap classes with corresponding CPC classes at the Micro-F1 score, which may indicate that the model performs each level. better for minority classes compared to majority classes. To gain more insights into these results, we generated a plot (see Fig.2), showing the F1 scores along with the normalised number of documents for each class in the test set. We used normalised 4.2 Classification Results values to allow both F1 scores and class sizes to be displayed in a single figure, facilitating better comparison. The classification task in this study was to classify patents into The plot shows that the Macro-F1 score is higher for minority 83 fine-grained classes within our proposed KnowMap taxonomy. classes than for majority classes, also indicating that random The dataset comprised 1,092,991 documents, which were split sampling led to an unbalanced dataset. The imbalanced sample into the train, validation, and test sets with a ratio of 0.8, 0.1, likely caused the higher Macro-F1 score relative to Micro-F1, and 0.1 respectively. We preserved the ratio of samples per class reflecting poorer performance in the majority classes. Future in all three sets with stratified splitting. The average number work will focus on using balancing techniques when sampling of documents in the train set, validation set, and test sets are to address this issue and enhance model performance. presented in Tab. 2. When looking more closely at the lowest F1-Macro scores, we found that the bottom 10 classes were all leaves under the chem- Table 2: Overview of sample metrics: total number of sam- istry and metallurgy section. Moreover, the highest F1-Macro ples, average number of samples per class, and normalised scores (0.996) were achieved by the two classes in the textiles average number of samples per class across training, vali- and paper section, followed by all 17 leaves from the physics dation, and test sets. section. We suspect this performance difference may be due to greater variation in the textual data of chemistry and metallurgy Set Total Avg/ class Normalised Avg class compared to physics and textiles and paper, leading to more variation between the training and test sets. Analysing this vari- Train 1,092,991 132,202 0.012 ation in detail remains a task for future work. Additionally, we Val 874,372 16,476 0.012 believe future work could benefit from adapting the classifier to Test 218,619 15,543 0.012 a hierarchical structure, prioritising correct predictions at higher 61 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Motamedi et al. levels before refining predictions at the leaf level. In our current [4] Gianni Costa, Alfredo Cuzzocrea, Giuseppe Manco, and Riccardo Ortale. 2011. Data De-duplication : A Review Data De-duplication : A Review. Learn- approach, the classifier does not account for the hierarchy and ing structure and schemas from documents, January. isbn: 9783642229138. predicts all leaves directly. doi: 10.1007/978- 3- 642- 22913- 8. [5] C. J. Fall, A. Törcsvári, K. Benzineb, and G. Karetka. 2003. Automated cate- 5 Discussion and Conclusions gorization in the international patent classification. ACM SIGIR Forum, 37, 1, 10–25. doi: 10.1145/945546.945547. In this work, we proposed a knowledge field taxonomy, KnowMap, [6] Juan Carlos Gomez and Marie Francine Moens. 2014. A survey of automated hierarchical classification of patents. Lecture Notes in Computer Science which aligns with the widely used CPC schema. The taxonomy (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes consists of 83 groups at the lowest level, with fine-grained classes in Bioinformatics), 8830, 215–249. doi: 10.1007/978-3-319-12511-4_11. containing a minimum of 9,000 samples from the original Google [7] Bikash Gyawali, Lucas Anastasiou, and Petr Knoth. 2020. Deduplication of scholarly documents using locality sensitive hashing and word embeddings. Patents Public Dataset after preprocessing. KnowMap serves as a In Proceedings of the 12th Conference on Language Resources and Evaluation benchmark taxonomy, addressing a gap in the existing literature. (LREC 2020). European Language Resources Association, 894–903. [8] Arousha Haghighian Roudsari, Jafar Afshar, Wookey Lee, and Suan Lee. From the preprocessed original dataset, we randomly selected 2022. PatentNet: multi-label classification of patent documents using deep 1,093,151 samples to fine-tune pre-trained RoBERTa and SBERT learning based language understanding. Scientometrics, 127, 1, 207–231. doi: models for downstream tasks. However, the random sampling 10.1007/s11192- 021- 04179- 4. [9] Omid Jafari, Preeti Maurya, Parth Nagarkar, Khandker Mushfiqul Islam, resulted in an unbalanced dataset, which contributed to higher and Chidambaram Crushev. 2021. A Survey on Locality Sensitive Hashing Macro-F1 scores compared to Micro-F1 scores. To enhance clas- Algorithms and their Applications. ACM Computing Surveys. eprint: 2102.0 sification results, we plan to create a balanced dataset from the 8942. [10] Guik Jung, Junghoon Shin, and Sangjun Lee. 2023. Impact of preprocessing original data. Additionally, we aim to use larger models than and word embedding on extreme multi-label patent classification tasks. those used in this study to further improve the fine-tuning pro- Applied Intelligence, 53, 4, 4047–4062. doi: 10.1007/s10489-022-03655-5. [11] Eleni Kamateri, Michail Salampasis, and Eduardo Perez-Molina. 2024. Will cess. AI solve the patent classification problem? World Patent Information, 78, June, 102294. doi: 10.1016/j.wpi.2024.102294. 6 Future Work [12] Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022. Deduplicating Training Data Mitigates Privacy Risks in Language Models. In International Confer- Several knowledge platforms, such as news sites and GitHub, ence on Machine Learning, Baltimore number 1. Vol. 162, 10697–10707. host various types of information shared online. In future work, [13] Jong Wook Lee, Won Kyung Lee, and So Young Sohn. 2021. Patenting trends in biometric technology of the Big Five patent offices. World Patent Infor- we aim to incorporate these sources to extend and enhance the mation, 65, March, 102040. doi: 10.1016/j.wpi.2021.102040. knowledge taxonomy’s coverage. For example, the All Science [14] Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Dou- Journal Classification (ASJC), which organises research publica- glas Eck, Chris Callison-Burch, and Nicholas Carlini. 2022. Deduplicating Training Data Makes Language Models Better. Proceedings of the Annual tions by subject area, can be used to identify alignments with Meeting of the Association for Computational Linguistics, 1, 8424–8445. eprint: the existing taxonomy. This taxonomy alignment can then be 2107.06499. doi: 10.18653/v1/2022.acl- long.577. [15] Shaobo Li, Jie Hu, Yuxin Cui, and Jianjun Hu. 2018. DeepPatent: patent further analysed to determine whether to merge or split classes classification with convolutional neural networks and word embedding. at various levels. Beyond patents, we plan to evaluate the classi- Scientometrics, 117, 2, 721–744. isbn: 1119201829. doi: 10.1007/s11192-018-2 fier on other data, using domain adaptation methods to transfer 905- 5. [16] Yinhan Liu et al. 2019. Roberta: a robustly optimized bert pretraining ap- knowledge from the labelled patent domain to those with limited proach. ArXiv, abs/1907.11692. or no labels. Large language models (LLMs) could further aid in [17] Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: sentence embeddings evaluating the classifier’s performance across different domains. using siamese bert-networks. In Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Recent research has shown the potential of LLMs to augment or [18] Arousha Haghighian Roudsari, Jafar Afshar, Charles Cheolgi Lee, and even replace human-labeled training data with labels generated Wookey Lee. 2020. Multi-label patent classification using attention-aware deep learning model. In Proceedings - 2020 IEEE International Conference on by these models [23]. Big Data and Smart Computing, BigComp 2020, 558–559. isbn: 9781728160344. Moreover, we plan to enhance the classification task by bal- eprint: arXiv:1910.01108. doi: 10.1109/BigComp48618.2020.000- 2. ancing the dataset using balancing techniques for multi-label [19] Victor Sanh, L Debut, J Chaumond, and T Wolf. 2019. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arxiv 2019. arXiv preprint problems and leveraging larger pre-trained models. we will also arXiv:1910.01108. closely examine the different knowledge fields to better under- [20] Mirac Suzgun, Luke Melas-Kyriazi, Suproteem K. Sarkar, Scott Duke Komin- stand the variations in classifier performance across them. ers, and Stuart M. Shieber. 2023. The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applica- tions. In 37th Conference on Neural Information Processing Systems (NeurIPS Acknowledgements 2023) Track on Datasets and Benchmarks number NeurIPS, 1–39. eprint: 2207.04043. This work was supported by the Slovenian Research and Inno- [21] Christoph Trattner, Philipp Singer, Denis Helic, and Markus Strohmaier. vation Agency under grant agreements CRP V2-2272, V5-2264, 2012. Exploring the differences and similarities between hierarchical decen- tralized search and human navigation in information networks. In ACM CRP V2-2146 and the European Union through enrichMyData International Conference Proceeding Series, 0–7. isbn: 9781450312424. doi: EU HORIZON-IA project under grant agreement No 101070284. 10.1145/2362456.2362474. [22] Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. References 2020. Minilm: deep self-attention distillation for task-agnostic compres- sion of pre-trained transformers. Advances in Neural Information Processing [1] Segun Taofeek Aroyehun, Jason Angel, Navonil Majumder, Alexander Gel- Systems, 33, 5776–5788. bukh, and Amir Hussain. 2021. Leveraging label hierarchy using transfer and [23] Xinru Wang, Hannah Kim, Sajjadur Rahman, Kushan Mitra, and Zhengjie multi-task learning: A case study on patent classification. Neurocomputing, Miao. 2024. Human-llm collaborative annotation through effective verifi- 464, 421–431. doi: 10.1016/j.neucom.2021.07.057. cation of llm labels. In Proceedings of the 2024 CHI Conference on Human [2] Mehmet Aydar and Serkan Ayvaz. 2019. An improved method of locality- Factors in Computing Systems (CHI ’24) Article 303. Association for Com- sensitive hashing for scalable instance matching. Knowledge and Information puting Machinery, Honolulu, HI, USA, 21 pages. isbn: 9798400703300. doi: Systems, 58, 2, 275–294. isbn: 1011501811995. doi: 10.1007/s10115-018-1199 10.1145/3613904.3641960. - 5. [24] Junghwan Yun and Youngjung Geum. 2020. Automated classification of [3] Hamid Bekamiri, Daniel S. Hain, and Roman Jurowetzki. 2024. PatentS- patents: A topic modeling approach. Computers and Industrial Engineering, BERTa: A deep NLP based hybrid model for patent distance and classifica- 147, July, 106636. doi: 10.1016/j.cie.2020.106636. tion using augmented SBERT. Technological Forecasting and Social Change, 206, June, 123536. doi: 10.1016/j.techf ore.2024.123536. 62 Enhancing causal graphs with domain knowledge: matching ontology concepts between ontologies and raw text data Jernej Stegnar Jože M. Rožanec Jožef Stefan Institute Jožef Stefan International Postgraduate School Ljubljana, Slovenia Ljubljana, Slovenia jernej.stegnar@gmail.com joze.rozanec@ijs.si Gregor Leban Dunja Mladenić Event Registry d.o.o. Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia gregor@eventregistry.org dunja.mladenic@ijs.si ABSTRACT foresight outcomes at such a pace. Nevertheless, this would be When building a causal graph from textual sources, such as media possible with the use of artificial intelligence. reports, a key task is to provide an accurate semantic understand- AI enhances strategic foresight by automating the analysis of ing of the causal variables encoded as nodes and to link them data and detecting patterns that may go unnoticed by human to existing ontologies with at least two purposes: (i) expand the experts [1]. Machine learning algorithms can continuously mon- knowledge with the domain knowledge captured in such ontolo- itor emerging trends, geopolitical shifts, and market fluctuations gies and (ii) provide accurate and different levels of abstraction in near-real time, offering dynamic insights into potential future of the extracted causal variables. This article describes how we scenarios. Natural language processing (NLP) enables AI to sift used OntoGPT, a tool for matching raw text to ontology concepts through massive amounts of text, extracting relevant informa- initially designed for the medical domain, to match concepts from tion from reports, news, and social media, thus accelerating the media events to relevant ontologies. We build upon our previous forecasting process. By integrating AI into strategic foresight, work on extracting causal variables and enrich the extraction organizations can adapt more swiftly and make more informed, pipeline by matching causal variables to concepts from specific data-driven decisions in the face of uncertainty. domain ontologies. In particular, we describe our work regard- Ontologies provide structured knowledge informing the rela- ing the GEO ontology. Future work will focus on expanding tionships between concepts within a specific domain. Further- OntoGPT’s capabilities by utilizing a wider selection of ontolo- more, they describe those concepts through properties and can gies. Addressing its limitations, such as dealing with multiple link such classes to specific instances observed in the real world. instances of the same class, will also be crucial for improving its As such, they are of key importance when building a causality utility. These improvements will allow the tool to better support graph, given they can augment our understanding of the causal strategic foresight applications by providing more detailed in- relationships between variables with a better understanding of sights across a multitude of different sectors, further enriching the context and the variable implications [3]. For example, if causal graphs and facilitating more accurate predictive modeling. the causal relationship reports about the ceasing of an armed conflict, knowing whether a causal variable relates to a coun- KEYWORDS try, the location of that country, the neighboring countries, and international organizations it is involved in would help to un- strategic foresight, ontology matching, artificial intelligence derstand the magnitude of that event and contextualize other likely outcomes (refugee repatriation, impacts on investments, 1 INTRODUCTION and others). Strategic foresight is a discipline concerned with anticipating In the scope of the graph massive project, ontology matching future trends, uncertainties, and disruptions to inform decision- is being used to link the extracted causal relationships from text making and enable the creation of resilient, long-term strategies. to concepts inside the ontologies, allowing for a more detailed As such, it is valuable to governments, organizations, and enter- understanding of the concepts that appear in causal relationships prises, who can use it to remain competitive and adaptable in a and their interconnectivity. rapidly changing world [4]. The pace of technological advancement, shifting geopoliti- 2 ENRICHING CAUSAL GRAPHS WITH cal landscapes, environmental crises, and unpredictable market DOMAIN KNOWLEDGE trends make it essential to react quickly to change. Traditionally, We consider ontologies a framework (an organized and structured foresight has been based on trend analysis, expert opinion, and system for representing knowledge) used to represent knowl- qualitative insights. Such approaches lack the agility required to edge within a specific domain by defining the relationships be- scan real-world events in near-real time and produce strategic tween concepts. They consist of classes (concepts), properties (attributes), and relationships that connect different concepts. Permission to make digital or hard copies of part or all of this work for personal This structure provides a standardized way to organize and in-or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and terpret data, ensuring consistent understanding across systems. the full citation on the first page. Copyrights for third-party components of this For example, in a medical ontology, concepts like "disease" might work must be honored. For all other uses, contact the owner/author(s). be linked to "symptoms," "treatments," and "causes," each with Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia © 2024 Copyright held by the owner/author(s). its own defined properties. By formalizing these relationships, https://doi.org/https://doi.org/10.70314/is.2024.sikdd.25 ontologies allow AI systems to better interpret and reason about 63 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Jernej Stegnar, Jože M. Rožanec, Gregor Leban, and Dunja Mladenić complex information, leading to more accurate data processing and decision-making. Ontologies enhance causality graphs by providing domain- specific knowledge that improves the accuracy and depth of relationships represented. When extracting causal relationships from large datasets, such as media reports, the data can often be ambiguous or incomplete. Ontologies address this by offering structured knowledge that defines concepts and their relation- ships within a specific domain, linking extracted causal rela- tionships to well-defined entities in the ontology. This enriches the causality graph, uncovering implicit connections and non- obvious relationships that may otherwise be missed. In strategic foresight, for example, ontology-based enrichment helps capture a broader range of potential future scenarios by incorporating Figure 1: The figure showcases our pipeline for building knowledge beyond the immediate dataset. This leads to more a causality graph. The sub-figure B showcases how the reliable predictions, especially when the training data is limited process of ontology linking was executed as a part of our or domain-specific. Ultimately, ontologies are expected to enable pipeline the system to generalize better, predict outcomes with higher accuracy, and improve the overall reliability of causality graphs. The causality graph pipeline in the Graph Massivizer strategic consistent and accurate representation of complex information foresight project is designed to automate the extraction, organi- by defining structured relationships between concepts. zation, and analysis of causal relationships from large datasets, The primary purpose of OntoGPT is to enhance AI systems’ particularly news articles. The Figure 1 showcases the structure understanding, processing, and categorization of data by linking of our causality graph’s data pipeline. The process begins with extracted information to predefined concepts and relationships extracting these relationships from news articles, which are then within an ontology. This structured approach ensures greater organized into a causality graph that maps the interactions be- accuracy and reliability compared to traditional AI systems that tween various factors and events. The goal is to develop link rely on unstructured data. prediction models that estimate the likelihood of future events OntoGPT works by connecting data from sources such as text based on observed patterns. For instance, one use case involves or reports to specific concepts in an ontology, allowing for more predicting oil price trends by analyzing factors that influence informed and contextually accurate connections. For example, in pricing. healthcare, OntoGPT can link symptoms from patient records to Ontology matching is then integrated into the pipeline to link diseases and treatments outlined in medical ontologies, helping extracted causal relationships with concepts from structured on- to suggest possible diagnoses or treatment plans. tologies. This enrichment adds layers of context and enables the By combining the language-processing capabilities of LLMs discovery of connections that may not be evident from raw data with the structured knowledge available in ontologies, OntoGPT alone. By incorporating ontologies, the pipeline transcends the enables AI systems to go beyond keyword matching and consider limitations of its training data, identifying causal relationships the relationships between terms. This leads to more intelligent that may be implied by broader knowledge contained in the on- data interpretation and improved decision-making. tologies. This not only enhances the accuracy of the graph but OntoGPT is widely used in fields where structured knowl- also allows it to capture more complex and non-direct relation- edge is critical for high accuracy, such as healthcare, biology, ships, improving its predictive capabilities. and pharmaceutical research. In medical research, for instance, As shown in Fig. 1B, the process of ontology linking in our OntoGPT links clinical trial data, medical records, and scientific pipeline consisted of creating ontology matching templates, then literature to medical ontologies, supporting better analysis and linking the concepts in text to ontologies, using the information decision-making. to add additional data to existing causalities, all with the purpose The key advantage of OntoGPT lies in its ability to ground of finding extra implicit connections based on the information AI outputs in domain-specific, structured knowledge, reducing provided by the ontologies. the likelihood of errors and improving the relevance of insights. The main problem that needed solving for that purpose was, This grounding ensures that AI responses are not just based on how to link ontologies to raw text data. In our case that was patterns but also on well-defined concepts and their relationships. done using OntoGPT [2], which is a tool used for ontology link-In summary, OntoGPT bridges the gap between the raw data- ing. Another key challenge is inter-ontology matching, which processing power of LLMs and the structured knowledge in on- involves linking multiple ontologies through shared concepts. tologies. By leveraging both, it provides a more accurate and This process expands the knowledge framework, making it even reliable approach to extracting and linking data across various do- more valuable for our purposes. The challenge of inter-ontology mains, particularly when working with large, complex datasets. matching hasn’t been addressed yet and remains a matter of future work. 3.1 OntoGPT’s role At a lower level, OntoGPT operates using YAML templates that define how data should be extracted from text and linked to onto- 3 ONTOGPT: A BRIEF OVERVIEW logical concepts. These templates serve as blueprints, specifying OntoGPT is an advanced tool that integrates large language which types of entities, relationships, and properties to look for models (LLMs) with ontologies to improve knowledge extraction in the input text. The templates guide the large language model and organization across various domains. Ontologies provide a by mapping textual data to predefined concepts and relationships 64 Enhancing causal graphs with domain knowledge Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia classes inside the ontology, that we are trying to link the text data to, and their descriptions, which assists OntoGPT in more accu- rately identifying these classes inside the text. The YAML file also contains the information of "annotators" which tells OntoGPT, which ontology to ground the responses to. The generated YAML templates are saved into a separate file after generation, which makes them ready for use. The python code that is used by OntoGPT in the process of ontology linking, is similarly generated by using the extracted information to fill in the "general template" and is then saved to a separate file. 5 LIMITATIONS Figure 2: A Showcase of the function of OntoGPT 5.1 Multiple Same-Class Concepts OntoGPT has problems trying to link two or more concepts to a place in the ontology if the concepts are of the same class. This happens because both concepts suit the description and similar criteria that OntoGPT extracts the information based on. This causes OntoGPT to merge both concepts into a single string and then try to locate the said string inside the ontology, which fails because there is no individual inside the ontology class with such Figure 3: The Process of Templates Generation a name. An example of such a response is shown in Listing 1: Listing 1: Example of a bad response from the ontology, ensuring that the extracted information is e x t r a c t e d _ o b j e c t : both relevant and structured. The figure 2 shows the process c o n t i n e n t : AUTO : Europe%2C%20 A f r i c a of ontology linking for an example of a simple sentence. Each n a m e d _ e n t i t i e s : YAML template contains detailed instructions on how to identify − i d : AUTO : Europe%2C%20 A f r i c a key terms, their corresponding ontology classes, and the relation- l a b e l : Europe , A f r i c a ships between them. This allows OntoGPT to recognize when a piece of text, such as a sentence from a media article, contains If OntoGPT managed to locate the concept inside the text a concept that aligns with an entity or event in the ontology. in the ontology, it returns its id (an example of this is "sea: Once identified, the tool links the extracted data to these ontol- GEO:000055471" and "id: GEO:000055471 : White Sea") If the ogy entries, enabling richer and more meaningful connections concept suits the class criteria, but couldn’t be located inside the in the data, as it is now grounded in an established knowledge ontology, it returns it as a “AUTO” detection. For the purpose of framework. ontology linking this is not optimal as it does not give us access The approach described in this article uses an ontology file to the additional information that is stored inside the ontology’s as input to create such templates for data extraction and link- individual information. The ontology’s individual information is ing. This enables for a broader range of ontology linking, as the a set of predefined relationships and properties, that an individ- templates can be created on demand. ual concept has. For example, if the individual "Africa" is defined inside the ontology, the individual’s data would include its size, 4 TEMPLATES AND PYTHON CODE countries on the continent, population, and climates, among oth- GENERATION ers. This information gives us reliable information about a certain concept, allowing for more contextual understanding. The approach works by using the information defined inside To solve this problem, the approach of creating "buffer" classes the ontology, to generate the YAML templates. The Figure 3 was taken, where a certain class from ontology would be used to showcases the process of how this is done. generate three classes describing the different occurrences of the First the class information, for each class inside the ontology, ontology class and a description that would provide sufficient is extracted. This is done by using the "owlready2" python library context to OntoGPT to separate the same class concepts into to parse the ontology into an object, and then extract the relevant different entities. The corrected response is showcased in Listing information from the new object. 2: Every class inside the ontology is used to create a correspond- ing template class, which is optimal, as it covers all parts of the Listing 2: Example of a corrected response ontology that could potentially be linked. A small portion of e x t r a c t e d _ o b j e c t : the data extraction process is ontology-specific and was custom- c o n t i n e n t : GEO: 0 0 0 0 0 0 3 4 0 tailored to the individual ontology, as some information (like class descriptions) is saved in different parts. c o n t i n e n t _ 2 : GEO: 0 0 0 0 0 0 3 4 2 Secondly the data extracted from the ontology is processed n a m e d _ e n t i t i e s : and used to create custom YAML templates. This is done by sim- − i d : GEO: 0 0 0 0 0 0 3 4 0 ply using the extracted information to fill in a "general template" l a b e l : A f r i c a we used for generation. Specifically the class names and descrip- − i d : GEO: 0 0 0 0 0 0 3 4 2 tions are used, to do so. This gives OntoGPT the names of the l a b e l : Europe 65 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Jernej Stegnar, Jože M. Rožanec, Gregor Leban, and Dunja Mladenić While this approach deals with a high percentage of this type ACKNOWLEDGMENTS problem, it does not cover the cases where more than three same- The Slovenian Research Agency supported this work. This re- class concepts are inside the piece of text being analyzed. search was developed as part of the Graph-Massivizer project funded under the Horizon Europe research and innovation pro- 6 CONCLUSIONS gram of the European Union under grant agreement 101093202. Using OntoGPT in the Graph Massivizer strategic foresight project will prove valuable for enriching causal graphs with linked on- REFERENCES tology data, aiming to improve predictive accuracy in predicting [1] Patrick Brandtner and Marius Mates. 2021. Artificial intelligence in strategic future events. Despite OntoGPT’s initial focus on medical data, foresight–Current practices and future application potentials: current practices some custom adaptations were successfully implemented to suit and future application potentials. In Proceedings of the 2021 12th International Conference on E-business, Management and Economics. 75–81. a portion of different domains. However, limitations persist in [2] J Harry Caufield, Harshad Hegde, Vincent Emonet, Nomi L Harris, Marcin P distinguishing between multiple instances of the same concept Joachimiak, Nicolas Matentzoglu, HyeongSik Kim, Sierra Moxon, Justin T Reese, Melissa A Haendel, et al. 2024. Structured prompt interrogation and recursive class. These challenges highlight the need for further develop- extraction of semantics (SPIRES): A method for populating knowledge bases ment to enhance the tool’s versatility across a broader array of using zero-shot learning. Bioinformatics 40, 3 (2024), btae104. applications and ontologies. [3] Fatma Özcan, Chuan Lei, Abdul Quamar, and Vasilis Efthymiou. 2021. Semantic enrichment of data for AI applications. In Proceedings of the Fifth Workshop on Data Management for End-To-End Machine Learning. 1–7. [4] David Sarpong and Nicholas O’Regan. 2014. The Organizing Dimensions of Strategic Foresight in High-Velocity Environments. Strategic Change 23, 3-4 (2014), 125–132. 66 Measuring and Modeling CO2 Emissions in Machine Learning Processes Ivo Hrib Oleksandra Topal Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia ivo.hrib@gmail.com Oleksandra.Topal@ijs.si Jan Šturm Maja Škrjanc Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia jan.sturm@ijs.si maja.skrjanc@ijs.si Abstract offer insights into a model’s emissions before its construction or use. The service we aim to provide addresses this gap by offering With the rapid expansion of the computing industry, efficient an estimation of emissions and power consumption for differ- energy utilization and reduction of CO emissions are critically 2 ent models before they are selected for specific use cases. This important. This research develops analytical tools to predict CO2 forward-looking approach allows for more informed decisions emissions from various machine learning processes. We present a when choosing models, potentially reducing their environmental novel methodology for data acquisition and analysis of CO emis- 2 footprint. sions during model training and testing. Our results demonstrate the environmental impact of different algorithms and provide 2 Related Work insights into optimizing energy consumption in artificial intelli- gence applications. The environmental impact of machine learning models has been a growing concern in recent years. Several studies have focused Keywords on quantifying and reducing the carbon footprint of artificial CO Emissions, Machine Learning, Energy Consumption, Envi- intelligence (AI) processes. For instance, [12] highlighted the en-2 ronmental Impact, AI Model Optimization, Green AI, Sustainable ergy consumption of training large neural models and suggested Computing, Carbon Footprint methods for minimizing emissions. Similarly, tools like CodeCar- bon [2] and eco2AI [3] have emerged to measure real-time CO2 1 Introduction emissions from computational tasks. However, these tools often lack predictive capabilities for assessing emissions before model The global computing industry significantly contributes to CO2 selection, as pointed out by. Our work builds on these existing emissions, with data centers accounting for 2.5 to 3.7 percent methodologies, concretely on the work of eco2AI[3], by providing of global greenhouse gas emissions [1]. These emissions exceed a forward-looking approach that estimates emissions during the those of the aviation industry due to continuous operations and model selection phase, thus complementing real-time monitoring heavy reliance on fossil fuels [11]. Given the growing demand for tools. This is achieved through heavy dependency on eco2AI[3] artificial intelligence (AI) applications, there is an urgent need measuring systems for data collection, later used for modeling for CO -conscious solutions. 2 based on the collected data and registered hyperparameters. This research aims to develop tools for predicting CO emis- 2 sions associated with machine learning processes, thus enabling 2.1 Research Gap and Contribution the reduction of the environmental impact of AI models. In col- laboration with Eviden (Spain) and under the FAME EU project, Despite the growing availability of tools like CodeCarbon [2] we have developed a CO emissions analysis system using tools 2 and eco2AI [3], a significant gap remains in the preemptive eval-like CodeCarbon [2] and eco2AI [3]. uation of environmental impact during the machine learning (ML) model selection phase. The mentioned tools are valuable for 1.1 Research Goals post hoc analyses but do not assist ML practitioners in making The primary goal of this research is to develop a service that pre- informed decisions upfront—before model development—on dicts CO emissions and power consumption of different machine the environmental footprint of different model architectures or 2 learning models during both training and evaluation phases with hyperparameters. emphasis on hyperparameter dependency. The CO emissions This gap is crucial, as the model selection phase often involves 2 𝑘𝑔 trial-and-error across multiple models and configurations, po- are measured in kilograms per second ( ) , while the power 𝑠 tentially leading to unnecessary resource consumption. Without consumption is measured in kilowatt-hours (kWh). predictive capabilities, practitioners have limited insight into While existing services, such as CodeCarbon [2] or eco2AI which models will have the lowest environmental impact before [3], provide real-time measurement of emissions, they do not engaging in resource-intensive training. Permission to make digital or hard copies of all or part of this work for personal Our research aims to fill this gap by introducing a predictive or classroom use is granted without fee provided that copies are not made or service that estimates the environmental footprint of different distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this ML models before they are trained or used. This service leverages work must be honored. For all other uses, contact the owner /author(s). the data collected from existing tools like eco2AI [3], incorporat-Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia ing key features such as hyperparameters, and model architec- © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2024.sikdd.23 turre into predictive models. By doing so, we enable developers 67 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Hrib et al. to make more sustainable choices at the model selection stage, 3.3 CO2 Emission Measurement reducing carbon emissions from the start of the ML lifecycle. We measure CO emissions produced during both the training 2 The table 1 below presents a feature matrix comparing our and testing phases of the machine learning models. This involves proposed service with current tools, showing how our approach using tools like eco2AI [3] to track energy consumption and addresses unmet needs: convert it into equivalent CO emissions. The measurements 2 are taken for various models, such as Decision Trees, Random 3 Methodology Forests, Logistic Regression, and Neural Networks, to assess their environmental impact under different computational loads. Due to the lack of suitable data on CO emissions of machine 2 learning models, we began by developing an infrastructure for 3.4 Feature Extraction data collection. This infrastructure is composed of the following steps: To gain deeper insights, we extract various features that could im- pact CO emissions and energy consumption. These features in- 2 • Dataset Generation: Creating synthetic datasets using clude project identifiers, detailed descriptions of each experiment, random data generation methods. the duration of each training epoch, power consumption metrics, • Data Preprocessing: Cleaning and preparing the data for hardware configurations (such as the type of CP U/GP U used), and analysis. hyperparameters. The project identifiers refer to unique alphanu- • CO2 Emission Measurement: Recording CO emissions 2 meric codes assigned to each machine learning experiment upon during both training and testing phases using different execution. These identifiers help differentiate between various machine learning algorithms. model configurations and experimental setups. They are gener- • Feature Extraction: Extracting relevant features such ated and stored automatically by our system during the dataset as project ID, experiment details, epoch duration, power generation process to ensure traceability and reproducibility of consumption, and hardware configurations. the experiments. • Adding Hyperparameters to Final Dataset: Document- ing hyperparameters used in each experiment to assess 3.5 Adding Hyperparameters to Final Dataset their impact on emissions. • Containerization: We document the hyperparameters used in each machine learn- Utilizing Docker for containerization ing experiment, such as learning rates, batch sizes, and the num- to ensure reproducibility and scalability of the experi- ber of layers in neural networks. This allows us to evaluate how ments. • Data Storage: these hyperparameters influence CO emissions and energy con- 2 Storing all datasets, features, and emission sumption. records systematically in a database for further analysis. • Modeling: Developing and training machine learning 3.6 Containerization models to predict CO emissions and power consumption. 2 To ensure reproducibility and scalability of our experiments, we The software implementation uses Python, with dependencies employ Docker for containerization. This approach encapsulates including pandas [7], scikit-learn [10], matplotlib [5], eco2AI [3], the code, dependencies, and environment settings, allowing the TensorFlow [abadi2016tensorflow], Keras [chollet2015keras], experiments to be easily replicated and deployed across different and Docker for containerization [merkel2014docker]. platforms. 3.1 Dataset Generation 3.7 Data Storage In this step, we created a synthetic dataset by generating random All datasets, extracted features, hyperparameter configurations, data points using tools like sklearn.datasets.make_regression and CO emission records are systematically stored in a database. 2 or make_classification. The primary objective here is not This central repository facilitates efficient querying, retrieval, to reflect real-world data scenarios but to produce a controlled and analysis of data to support ongoing and future research. environment where the focus is on measuring CO emissions 2 and power consumption during model training and evaluation. 3.8 Modeling Datasets generated vary in size from ranges of 250 to 15000 sam- In this step, we develop and train machine learning models to ples and 5 to 2000 features. In classification cases additionally the predict CO emissions and power consumption based on various 2 number of classes ranges from 2 to 50. These parameter ranges features, such as the type of algorithm used, hardware configura- were selected to mitigate the risk of computational overload, en- tion, and model parameters. This modeling allows us to estimate suring that the experiments remain feasible within the available emissions for different machine learning workflows before their computational resources while maintaining the integrity of the actual deployment. The models help identify the most efficient analysis. algorithms and configurations, thus guiding the selection of en- vironmentally friendly AI solutions. 3.2 Data Preprocessing The general pipeline for the previously mentioned steps can be seen below (see Figure 1). Before analysis, the dataset must be cleaned and prepared. This A more thorough view of the workings of this can be seen as includes handling missing values, normalizing or standardizing shown below for running a single measurement (see Figure 2). data, encoding categorical variables, and splitting the data into training and testing sets. Proper preprocessing ensures that the 4 Model Architecture data is in the optimal format for the models to learn from and min- imizes biases that may affect model performance and emission In this section, we explain the architecture of the model used measurements. for predicting CO emissions and power consumption based on 2 68 Measuring and Modeling CO2 Emissions in Machine Learning Processes Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Tool/Technology Platform Model Cov- Metric Carbon Energy Additional Real-time Forward- Compati- erage Granular- Metrics Metrics Features measure- looking bility ity ment Prediction CodeCarbon Cloud, On- All ML mod- Per training CO emis- Energy con- Dashboard Yes No 2 Premise els session sions (kg) sumption Visualiza- (kWh) tion eco2AI Cloud, On- All ML mod- Per training CO emis- Energy con- Not RAPL Yes No 2 Premise els session sions (kg) sumption based (kWh) Proposed On-Premise Specific Per model, CO emis- Energy con- Predictive No Yes 2 Service kg models per selec- sions sumption modeling s (mentioned tion phase (kWh) bellow) Table 1: Feature comparison of existing tools and the proposed service • Output Layer: A single neuron that outputs the predicted value for either CO emissions or power consumption. 2 4.1 Model Training The model is compiled using the Adam optimizer [6] and the Mean Squared Error (MSE) loss function. Seeing as we were un- able to gather adequate real-time environmental data of factors that may influence our predictions (e.g. Distribution of energy sources, real time CO per kWh), our model relies on static yearly 2 averages of these values[8] [9] . Our model uses the aforemen- tioned features for the purpose of regression with the goal of Figure 1: General Measurement Pipeline predicting power consumption and CO emissions gathered by 2 previously mentioned random tests. Each model is trained for 25 epochs using the preprocessed data. After training, the models, along with their respective scalers and encoders, are saved to disk for later use. 4.2 Prediction Once trained, the model can predict CO emissions and power 2 consumption for new data points by loading the appropriate model, scaler, and one-hot encoder. The input data is prepro- cessed in the same manner as during training, and the predictions are obtained by applying the trained models. This modular approach allows for easy extension to additional models or data sources and provides a scalable solution for ana- lyzing the environmental impact of machine learning processes. 5 Web Application Interface for CO2 Figure 2: Single Model Measurement Pipeline Emissions and Power Consumption Prediction various features such as CP U type, GP U type, region, and other In addition to the backend model developed for predicting CO2 experiment-specific details. The model implementation is en- emissions and power consumption of various AI models, a web capsulated within a Python class named MultiModel, which is application was created to provide a user-friendly interface for responsible for managing the entire process from data prepro- real-time predictions. The web app, as shown in Figure 3, allows cessing to training and prediction. users to select different machine learning models and configure The model employs two separate neural networks for predict- parameters to estimate the associated environmental impacts. ing CO emissions and power consumption. The architecture for 2 5.1 Key Features of the Web Application each neural network is as follows: • Input Layer: Receives the scaled and encoded features. The web application interface is designed with simplicity and • Hidden Layers: Consist of multiple Dense layers with functionality in mind. It includes several key components: ReLU activation functions. The CO emissions model in- • Model Selection: Users can choose the type of machine 2 cludes three hidden layers with 128, 64, and 128 neurons, learning model they are interested in evaluating (e.g., Lo- respectively, while the power consumption model has gistic Regression ( abbr. LogR ), Decision Tree Classifier ( three hidden layers with 64, 64, and 128 neurons. abbr. DTC ), Decision Tree Regression ( abbr. DTR ), Neural 69 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Hrib et al. 6 Results 6.1 Model Error To evaluate the performance and accuracy of the models, we conducted a 10-fold cross-validation to estimate the errors in predicting CO emissions and power consumption. The results 2 are presented in Table 2. The errors for both CO emissions 2 and power consumption were computed for both training and evaluation phases of each model type. Note: In this context, "Train." refers not to the error on the training set, but rather to the error made by our model in predict- ing the CO emissions / Power Consumption during the training 2 Figure 3: Web App Interface phase of the listed model. Similarly, "Eval." refers not to the error on the evaluation set, but rather to the error made by our model in predicting the CO emissions / Power Consumption when 2 Network Classifier ( abbr. NNC ), Neural Network Regres- the listed model makes predictions. This distinction is crucial to sion (abbr. NNR ), Linear Regression ( abbr. LinR ), Random understanding the results accurately. Forest Classifier ( abbr. RFC ) and Random Forest Regres- sion ( abbr. RFR ) ). The dropdown menu in the upper-left corner of the interface provides a list of available models. Table 2: Model Scaled Error Estimates from 10-Fold Cross- • Model Parameters Configuration: A section labeled Validation "Model Parameters" allows users to specify various inputs: – Train or Evaluate: Users can choose whether to esti- Model Phase CO2 Error Power Er- mate emissions for the training or evaluation phase of ror the model. – Dataset Samples and Features: Input fields are pro- DTC Eval. 0.0036 0.0043 vided for users to define the size of the dataset in terms DTC Train. 0.0631 0.0649 of the number of samples and features. DTR Eval. 0.0032 0.0034 – CPU and GPU Specifications: The app allows the DTR Train. 0.0133 0.0517 selection of the CP U and GP U type, reflecting differ- RFC Eval. 0.0094 0.0098 ent hardware configurations, such as "Intel(R) Xeon(R) RFC Train. 0.3242 0.3582 Gold 6246R CP U @ 3.40GHz/1 device(s), TDP:205.0" or RFR Eval. 0.0087 0.0081 "AMD Ryzen 7 4800H with Radeon Graphics/1 device(s), RFR Train. 0.2565 0.2779 TDP:45.0". LogR Eval. 0.0063 0.0057 – Region/Country Selection: A dropdown to select the LogR Train. 0.0055 0.0043 geographic location where the model is being executed, LinR Eval. 0.0099 0.0105 which influences the CO emissions based on local en- LinR Train. 0.0104 0.0095 2 ergy sources. NNC Eval. 0.0018 0.0030 • Real-Time Predictions: Once all parameters are config- NNC Train. 0.1083 0.1216 ured, the application dynamically calculates and displays: NNR Eval. 0.0045 0.0112 – CO2 Emissions: The predicted emissions are shown in NNR Train. 0.1051 0.1008 kilograms per second (kg/s). – Power Consumption: The power consumption is pro- vided in kilowatt-hours (kWh). Based on the results obtained through the 10-fold cross-validation, • Electricity Source Distribution: A graphical representa- it is evident that the model performance varies significantly tion is provided for the distribution of electricity sources, across different algorithms and phases. One notable observa- such as coal, gas, and oil, in the selected region. This in- tion is that the errors in predicting CO emissions and power 2 formation is crucial for understanding the environmental consumption are relatively higher during the training phases, impact of power consumption based on the local energy particularly for more complex models like Neural Networks and mix. Random Forests [4]. This discrepancy in model performance can be attributed to 5.2 User Experience and Accessibility the sparsity of the data collected during the measurement phase. The web application is developed with accessibility in mind, en- The limited data points lead to substantial gaps in the attribute suring that users, regardless of technical background, can interact space covered by the models, resulting in erratic behavior when with the model’s predictive capabilities. By offering a clear and predicting outside these ranges. Consequently, the models show intuitive interface, it aims to make the process of estimating CO diminished accuracy and reliability when confronted with input 2 emissions and power consumption transparent and straightfor- configurations that fall beyond the scope of the original data. ward. Future research should focus on enhancing the robustness of Figure 3 illustrates the application’s main screen, where the these models by expanding the dataset to include a broader range model type, parameters, and results are all visible at a glance. This of scenarios and conditions. This would help mitigate the effects real-time feedback loop allows users to make informed decisions of sparsity and improve the model’s generalizability, ensuring based on the predicted environmental impact. more reliable predictions across diverse settings. 70 Measuring and Modeling CO2 Emissions in Machine Learning Processes Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia 8 Limitations This study presents several limitations, particularly regarding the data, model evaluation, and hardware configurations, which must be considered when interpreting the results. 8.1 Training Duration and Model Learning The models were trained for a fixed number of epochs (e.g., 10 or 20), prioritizing computational cost over learning performance. The focus was on estimating CO emissions rather than model 2 accuracy or convergence, meaning the models may not have fully captured patterns in the data. As such, the reported emissions reflect standardized training durations (with an upper limit for computational efficiency), not optimized learning outcomes. Figure 4: Logarithmically scaled mean emissions across 8.2 Lack of Meaningful Learning Objective different models The use of randomly generated data limits the evaluation of model learning. Since the data lacked inherent structure, the models’ ability to learn was not assessed. Instead, the models were pri- marily evaluated on their resource consumption during training, reducing the focus on generalization or predictive power. 6.2 CO2 Emission Analysis Across Different Models 8.3 Hardware and Software Considerations Figure 4 provides a comparative analysis of the mean CO emis- The experiments were conducted on specific hardware (e.g., GP U/CP U 2 sions generated by different machine learning models during configurations), and variations in hardware were not examined. their operation, represented on a logarithmic scale to accommo- Different hardware setups, especially energy-efficient systems, date the wide range of emission values. could significantly impact CO emissions and energy consump- 2 The chart highlights significant variations in CO emissions tion. Therefore, the findings may not generalize across all hard- 2 among models, with the Neural Network Classifier and Neu- ware environments. However, we would like to point out that this ral Network Regressor exhibiting the highest emissions by a was due to lack of infrastructure for broader experimentation. considerable margin. This is expected due to the intensive com- putational requirements and numerous parameters these models 9 Future Work necessitate, resulting in elevated power consumption and conse- quently higher CO output. 2 Future research should incorporate real-world datasets, optimize In contrast, simpler models like Logistic Regression, Linear hyperparameters, and evaluate diverse hardware configurations Regression, and Decision Tree models show substantially lower to extend these findings to broader machine learning scenarios. CO emissions, reflecting their reduced computational complex- 2 The exploration of more complex architectures and learning ity and lower resource demand. objectives will provide a deeper understanding of the trade-offs Interestingly, the Random Forest models, particularly the Re- between performance and environmental impact. gressor, present moderate emissions, illustrating that even ensem- ble methods, which typically involve training multiple decision trees, can maintain reasonable emission levels depending on their 10 Conclusion configuration. Our study presents a methodology for monitoring and analyzing This analysis underscores the importance of model selection CO emissions during machine learning processes. The find- 2 not only for performance but also for minimizing environmental ings demonstrate that different machine learning models exhibit impact, particularly when scaling up operations or deploying in significant variability in their energy consumption and CO emis- 2 resource-constrained settings. sions, with complex models like neural networks having a higher environmental impact. By providing predictive insights into these emissions, our approach enables more informed decision-making 7 Discussion during model selection, thus contributing to the broader goal of The results highlight the significant environmental impact of reducing the carbon footprint of AI applications. training complex AI models, particularly neural networks. The Future work will focus on expanding the dataset to include variability in emissions suggests that optimizing model hyperpa- more diverse models and configurations. Additionally, we plan rameters and selecting appropriate hardware configurations can to integrate real-time monitoring tools to compare predictions reduce CO output. with actual emissions, further refining our predictive capabilities. 2 Future research should focus on model improvement for better Moreover, optimizing model hyperparameters and exploring al- and more accurate prediction, expanding the range of algorithms ternative, more sustainable hardware configurations will be key studied, as well as intensive data collection to accommodate gaps areas of investigation for minimizing the environmental impact in training data. of machine learning workflows. 71 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Hrib et al. Acknowledgements [4] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. A comprehensive resource on machine learning algorithms, including neural This work was supported by the FAME project, funded by the networks. MIT Press. http://www.deeplearningbook.org. European Union’s Horizon 2023 Research and Innovation Pro- [5] John D Hunter. 2007. Matplotlib: a 2d graphics environment. Computing in science & engineering, 9, 3, 90–95. gramme under grant agreement nº 101092639. [6] Diederik P Kingma and Jimmy Ba. 2014. Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. References [7] Wes McKinney. 2010. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference. Vol. 445, 51–56. [1] Climatiq. 2023. Climatiq: emissions intelligence platform. Provides data [8] OurWorldInData. [n. d.] (). https://ourworldindata.org/grapher/carbon- inte on the carbon emissions of various activities, including computing. (2023). nsity- electricity?tab=table. https://www.climatiq.io/. [9] OurWorldInData. [n. d.] (). https://ourworldindata.org/electricity- mix. [2] CodeCarbon Development Team. 2023. Codecarbon: an open source tool [10] Fabian Pedregosa et al. 2011. Scikit-learn: machine learning in python. Jour- for tracking the carbon emissions of machine learning experiments. An nal of machine learning research, 12, 2825–2830. open-source tool for tracking and reducing carbon emissions in machine [11] Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. 2020. Green learning models. (2023). https://github.com/mlco2/codecarbon. ai. arXiv preprint arXiv:1907.10597. 12 pages. doi: 10.48550/arXiv.1907.10597. [3] Eco2AI Development Team. 2023. Eco2ai: real-time co2 emission tracking [12] Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy for machine learning. A tool for real-time tracking of CO2 emissions during and policy considerations for deep learning in nlp. In Proceedings of the machine learning processes. (2023). https://github.com/sb- ai- lab/Eco2AI. 57th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, 3645–3650. 72 Enhancing Ontology Engineering with LLMs: From Search to Active Learning Extensions Ganna Kholmska Klemen Kenda Joze Rozanec Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia anna.kholmska@gmail.com klemen.kenda@ijs.si joze.rozanec@ijs.si Abstract Recent studies show that leveraging Large Language Models (LLMs) can streamline ontology construction by reducing This paper explores the use of LLMs in ontology engineering manual effort and improving consistency and quality. For within the HumAIne project, focusing on the discovery, analysis, instance, [1] demonstrates semi-automatic knowledge graph and extension of ontologies in Data Mining, Machine Learning, construction using open-source LLMs, while [2] proposes and manufacturing. The methodology leverages fine-tuned methods for automatic concept hierarchy generation through prompts and combines LLMs with traditional tools like Protege LLM queries. Building on this research, this paper contributes a for validation. A multi-LLM approach improved domain- methodology that integrates LLMs with traditional tools like specific concept coverage and reduced errors, though challenges Protege to streamline the discovery, analysis, and extension of remain in addressing deep domain-specific gaps and ensuring ontologies. By employing a multi-LLM approach, we address logical consistency. challenges in domain-specific concept identification and ensure more consistent, accurate results in ontology development for Keywords fields like Data Mining, Machine Learning, and manufacturing. LLMs, Ontology Engineering, Active Learning, Data Mining, Machine Learning, Ontology Selection, Ontology Extension 2 LLM-Assisted Search and Analysis of Domain Ontologies 1 Introduction Our experimentation with methodologies and tools for The HumAIne project, funded by the European Commission efficient web search and ontology analysis in Data Mining (DM), under the Horizon Europe program, aims to develop a platform Machine Learning (ML), and manufacturing domains led to the integrating advanced AI paradigms such as Active Learning development of the LLM-leveraging algorithm shown in Fig. 1. (AL), Neuro-Symbolic AI, Swarm Learning, and Explainable AI. This algorithm uses carefully crafted prompts to guide LLMs in This platform is designed to enhance human-AI collaboration in generating accurate, targeted queries. Before each step, the initial dynamic, unstructured environments, with applications spanning prompt is optimized through several iterations in a dialogue with healthcare, manufacturing, finance, energy grids, and smart cities. the LLM to improve accuracy and relevance. Further details on Its primary goal is to support decision-making by combining the iterative query refinement process are provided in the human expertise with AI capabilities. Discussion section. One of the project's key challenges is developing multiple Step 1: Define the Search Objective. At this stage, LLMs like ontologies that provide a structured framework for integrating Bing Chat, Google’s Bard, or ChatGPT with Web Browsing are domain-specific knowledge. This framework is essential for employed to iteratively refine the search objectives initially enhancing the clarity and reliability of AI-driven decisions, while formulated by the researcher, along with relevant keywords, ensuring adaptability across diverse applications. To construct phrases, and terms describing the ontologies or concepts of these ontologies, we first explored publicly available ontologies interest. For instance, our initial search objective for DM and ML relevant to the project's scope, then extended selected ones with ontologies was to "Find ontologies that offer up-to-date, detailed concepts from HumAIne’s AI paradigms, starting with Active descriptions of the DM and ML domains, following best Learning practices in ontology engineering." Keywords included "Active However, manual ontology construction is a complex, Learning" and "CRISP-DM standard." resource-intensive process that requires expertise across multiple Step 2: Formulate Search Queries Using LLMs. Based on the domains, collaboration among stakeholders. Ensuring refined search objectives and keywords, and using a carefully modularity, reusability, and scalability adds to this complexity. crafted prompt, LLMs generate targeted search queries. These queries are fine-tuned through feedback or early search results to maximize relevance and accuracy. For example, for a DM Permission to make digital or hard copies of part or all of this work for personal or ontology, the LLM generated queries such as "Data Mining classroom use is granted without fee provided that copies are not made or distributed ontology for semi-supervised machine learning," which were for profit or commercial advantage and that copies bear this notice and the full further refined before finalizing the query. citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Step 3: Conduct Web Search. This step involves real-time Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia browsing tools like Copilot in Microsoft Edge (GPT-4) and © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2024.sikdd.28 Perplexity AI to execute searches and identify relevant sources. Our study prioritized high-quality sources like ontology 73 repositories (e.g., BioPortal, OBO Foundry) and academic OWL and RDF code, were used alongside ontology tools like databases (Google Scholar, IEEE Xplore, ACM Digital Library). Protege. This combination ensured that the ontologies addressed It is important to acknowledge that LLM-driven web relevant concepts and aligned with frameworks like CRISP-DM. searches are generally confined to public repositories and a GPT-4 helped significantly in bridging the gap between textual limited range of academic databases. As a result, proprietary or descriptions and formal ontology representations. lesser-indexed ontologies may require manual exploration to Step 6: Cross-Reference and Compare Findings. LLMs with ensure a more thorough search. contextual understanding were employed to integrate and refine information from multiple sources. For this task, we used. Additionally, ChatGPT (GPT-4) categorized 65 manufacturing ontologies categorized 65 manufacturing ontologies, assessing them for relevance to process planning, standardization, industry adoption, interoperability, and support for advanced manufacturing concepts. Further exploration of the top 8 LLM- scored ontologies showed strong alignment with expert evaluations, but domain-specific tasks required carefully crafted prompts and human oversight for effectiveness. Step 7: Provide Recommendations for Further Exploration. LLMs generated recommendations for the most suitable ontologies or areas for additional research based on the previous step's results. This includes identifying underexplored concepts and areas needing further investigation. Step 8: Validate and Document Findings. The findings were manually validated for accuracy and relevance, then systematically documented. ChatGPT (GPT-4) was used to summarize and structure the documentation. Step 9: Iterate and Refine Search (if needed). When results were too broad or irrelevant (e.g., Active Learning misinterpreted as an educational method), we refined the search prompt by adding more context. By using this LLM-based algorithm, we conducted comprehensive web searches and extracted relevant information to identify the most suitable ontologies for the HumAIne project. In the DM and ML domains, we selected the OntoDM suite (OntoDM-Core, Onto-KDD, and OntoDT). For the manufacturing domain, we identified the Industrial Ontologies Foundry Core (IOF Core) as the best fit. Figure 1: Key steps of LLM-leveraging algorithm 3 LLM-Assisted Ontology Extension with Active Learning Concepts Step 4: Retrieve and Summarize Information. LLMs (Google Bard, Copilot (GPT-4), Perplexity AI) were employed to extract Integrating Active Learning (AL) into an ontology requires and summarize key information from ontology descriptions extending it with new classes, properties, and relationships found in publications, technical papers, and repository representing key AL concepts. While traditional methods of documentation identified during the search. Using a specifically building and extending ontologies are well-documented, we tuned prompt, LLMs extracted 11 characteristics for each of the leveraged GPT-4 for this task using iteratively refined prompts 34 identified DM and ML ontologies. These characteristics (see Discussion section). This section outlines how LLMs, included purpose, availability, ontology metrics, reused particularly GPT-4, were used to extend the IOF Core ontology ontologies, software editors, representation language, and with AL concepts. evaluation methodologies. This structured data, organized in Step 1: Define the Problem and Objectives. Through table format, provided valuable insights into each ontology’s iteratively refined prompts, LLMs formulated clear objectives, scope, quality, and reusability. From these results, we selected 6 specifying the domain (e.g., manufacturing) and key concepts ontologies for further exploration, prioritizing comprehensive (e.g., Active Learning). These outputs were used to guide further coverage of DM and ML concepts, adherence to ontology steps, with LLMs leveraging contextual understanding, engineering best practices, and alignment with established knowledge synthesis, and language generation to suggest standards in these domains. relevant AL applications such as adaptive scheduling. Queries Step 5: Analyze and evaluate ontologies. LLMs were further like "How can Active Learning improve adaptive scheduling in utilized to assess the relevance, content, and structure of the manufacturing?" generated valuable insights into potential use selected ontologies. In our study of DM and ML ontologies, cases. where AL would be most beneficial. LLMs such as GPT-4, which can process, explain, and generate Step 2: Analyze the Ontology to be Extended. By combining Protege’s visualization and navigation tools with GPT-4’s ability 74 to process textual and machine-readable data (e.g., OWL/RDF), efficiency and accuracy of reviewing, debugging, and validating we thoroughly examined the IOF Core ontology structure and OWL code. identified areas for introducing AL concepts. For example, GPT- 4 helped uncover key classes like “Process,” “Resource,” and “PerformanceMetric” within IOF Core, highlighting relevant properties for AL integration. Queries such as "What aspects of IOF Core can benefit from AL integration?" and "What key concepts are missing from the IOF Core ontology for integrating Active Learning in manufacturing?" guided us in identifying areas for improvement, including handling uncertainty and adjusting dynamic processes. Step 3: Identify Active Learning Concepts. The main tasks of this step and the role of LLMs in supporting each task are summarized in the Table 1: Table 1: LLMs applications for Identifying AL Concepts Task LLM Application Example Output Figure 2: Screenshot of LLM-generated code defining the 1. Identify Use LLMs to Concepts like“Uncertainty “LearningAlgorithm” class with properties “trainingData” fundamental generate a list of sampling,” “Query-by- and “validationData” AL concepts core AL strategies committee” and techniques Step 5: Ensure Semantic Consistency. LLMs, such as GPT- 2. Extract Query LLMs about Concept like "Query 4, assisted in ensuring semantic consistency by reviewing new domain- AL in specific Efficiency" in decision- and existing ontology elements and suggesting how new specific AL industrial contexts concepts could align with the existing framework. For example, making for manufacturing concepts an LLM suggested how an AL “QueryStrategy” class fits within the IOF Core ontology. 3. Mine AL Process academic Concepts like “Stream-based Example Prompt: " Review the new QueryStrategy class and concepts from papers, reports to selective sampling” from suggest how it can align with the existing classes in IOF Core." literature LLM Output: The QueryStrategy class aligns with decision- extract relevant AL papers on AL in making aspects of the Process concept. Strategies such as terms manufacturing “UncertaintySampling,” “QueryByCommittee,” 4. Assign Generate properties QueryStrategy class “ExpectedModelChange,” and “ExpectedErrorReduction” can properties to for AL ontology properties: be viewed as specialized decision-making processes within the new classes broader process framework of IOF Core. classes “hasuncertaintySampling” “queryByCommittee” However, LLMs cannot guarantee logical consistency and face 5. Refine and Ensure definitions, Refined and validated terms limitations in handling complex relationships, making it necessary validate resolve overlaps to use ontology reasoners, such as Protege or HermiT, to perform based on domain-specific terminology consistency checks. standards Step 6: Map to Existing Ontologies. LLMs, such as GPT-4, By prompting, LLMs generated nearly 200 fundamental AL assist in generating initial mapping suggestions by analyzing concepts, structuring them into a hierarchy by leveraging their similarities in definitions, relationships, and properties between vast training data. Additionally, LLMs helped generate new and existing concepts. This involves creating explicit definitions, assisting in verifying and refining concepts. relationships like “owl:sameAs,” “owl:equivalentClass”, and However, after a point, LLMs began repeating concepts or “owl:equivalentProperty”. producing less relevant terms. LLMs were also effective in Example LLM Output: generating domain-specific concepts through targeted queries. :FeedbackMechanism a owl:Class ; For instance, querying AL in manufacturing led to concepts like owl:equivalentClass :ControlSystem ; "uncertainty management" and "query efficiency." More rdfs:label "Feedback Mechanism" ; specialized concepts required extraction from academic papers, rdfs:comment "Mechanisms that provide feedback in which were cross-referenced with existing standards in DM, ML, Active Learning to control systems." and manufacturing (e.g., CRISP-DM, IEEE 7000 Series, ISA-95, While LLMs are effective in identifying high-level ISO 15531). Ontology learning tools like Text2Onto and similarities, they may face challenges with complex or domain- OntoLearn were combined with LLMs for cross-verification. specific relationships, requiring further refinement. Although we Step 4: Develop Ontology Extensions. LLMs helped create didn’t encounter these issues during our initial work extending AL-related classes, properties, and relationships based on the IOF Core with AL concepts, we used Protege’s alignment plug- identified concepts, using OWL-compliant syntax (see Fig. 2). ins to refine LLM-generated mappings. For more complex By combining GPT-4’s knowledge synthesis with Protege’s mappings, tools like AgreementMaker or COMA can further structural reasoning and consistency checking, we improved the refine the suggestions. 75 Step 7: Prototype and Test. LLMs, such as GPT-4, were prompt (clear, concise, and easily understood by you), b) prompted to generate validation scenarios, competency questions, Suggestions (on what details to include in the prompt to improve and SPARQL queries based on the integrated AL concepts. For it), and c) Questions (ask any relevant questions to improve the instance, a prompt like "Suggest validation scenarios for adaptive prompt). We will continue this iterative process with me scheduling with Active Learning" helped us produce realistic test providing additional information to you and you updating the cases, including prototype code, descriptions of initial setup, prompt until it's complete.” process flows, validation steps, and queries based on newly After 4-5 cycles, the prompts were highly optimized, integrated concepts. ensuring relevant outputs. This refinement process reduced SPARQL queries generated by LLMs were executed using inconsistencies and improved LLM-generated content across Protege with SPARQL plugins to assess the ontology’s ability to both search and extension phases. retrieve relevant information and answer competency questions. We integrated multiple LLMs, including Bing Chat (GPT-4), However, some LLM-generated scenarios revealed Google’s Bard, and Perplexity AI, to cross-validate outputs, limitations in domain-specific knowledge, resulting in generic reducing errors and refining results. This ensured consistency in outputs that required refinement. Additionally, LLMs struggled LLM-generated ontologies and mappings. with modeling intricate relationships or complex data retrieval To evaluate this multi-LLM approach, we propose the conditions, making human oversight essential for ensuring following metrics: Inter-Model Consistency (measures accuracy and thorough testing. alignment between LLM outputs). Error Rate Reduction (Tracks Step 8: Iterative Refinement. Following initial prototyping how often one LLM corrects another’s errors),.Coverage of and testing, we gathered feedback from domain experts and users Relevant Concepts (assesses LLMs' ability to capture domain- to further refine the ontology. Validation reports were uploaded specific concepts). Although these metrics provide a framework, to AskPDF Research Assistant (GPT-4), where LLMs reviewed formal measurements are yet to be implemented. the reports, extracted key improvement suggestions, and refined Future stages will involve applying these metrics to validate task lists. The LLM provided insights into areas where ontology outputs and testing extended ontologies in real-world relationships or properties required adjustments and identified applications. This hybrid method combines LLMs and traditional additional concepts that might have been overlooked. tools, ensuring both efficiency and accuracy in scalable ontology Step 9: Document and Disseminate. LLMs like ChatGPT or development. Bard were instrumental in generating comprehensive documentation, including details on the ontology extensions. Additionally, LLMs contributed to drafting technical reports and 5 Conclusions research papers. This study demonstrates how LLMs can streamline ontology Using this methodology, we successfully extended the IOF engineering by automating the search, analysis, and extension of Core ontology with Active Learning (AL) concepts. Future domain-specific ontologies. Leveraging multiple LLMs, we stages of the HumAIne project will focus on further validation successfully identified and extended key ontologies, including and refinement, particularly during pilot case implementations. OntoDM and IOF Core, for the HumAIne project, improving efficiency in generating classes, properties, and relationships. While LLMs significantly enhance the process, they face 4 Discussion challenges in domain-specific precision and require human This study highlights LLMs' potential in ontology oversight, particularly for complex relationships. Traditional engineering by reducing manual effort and increasing efficiency. tools like Protege and ontology reasoners remain critical for LLMs rapidly identified key ontologies like OntoDM and IOF ensuring logical consistency and validation. Core and generated structured classes, properties, and Future work will focus on refining these extended ontologies relationships, reducing the need for manual OWL/RDF code through real-world pilot tests and applying evaluation metrics to generation and concept mapping. However, LLMs face LLM-generated outputs. This hybrid approach, combining LLM challenges in domain-specific precision, requiring human automation with traditional validation tools, offers a scalable oversight to refine outputs and address nuances in specialized solution that balances efficiency with the need for human fields. While tools like Protege excel at ensuring logical expertise. consistency, LLMs offer dynamic capabilities for generating new concepts and relationships. Despite these advantages, traditional Acknowledgments tools like AgreementMaker and COMA are still necessary to This work was supported by the European Commission under the refine and validate LLM-generated mappings. Horizon Europe project HumAIne, Grant Agreement No. One strategy to mitigate LLM limitations was iterative 101120218. prompt engineering. We refined prompts for ontology search and extension tasks through multiple cycles of improvement. These References cycles, with LLMs like GPT-4, involved clarifying questions, [1] refining queries, and generating more focused outputs. Initial Kommineni, Vamsi Krishna, Birgitta König-Ries and Sheeba Samuel. “From human experts to machines: An LLM supported approach to prompt for starting the cycle can be the following: ontology and knowledge graph construction.” ArXiv abs/2403.08345 “Your role is my Prompt Creator. Your goal is to craft the (2024): n. pag. DOI: https://doi.org/10.48550/arXiv.2403.08345 [2] Funk, Maurice, Simon Hosemann, Jean Christoph Jung and Carsten Lutz. best possible prompt for my needs. The prompt will be used by “Towards Ontology Construction with Language Models.” ArXiv you, [LLM's name]. I want to write about: [keyword/topic]. abs/2309.09898 (2023): DOI: https://doi.org/10.48550/arXiv.2309.09898 Based on my input, you will now generate 3 sections. a) Revised 76 On the Brazilian Observatory for Artificial Intelligence Rafael Meira Silva, Luiz Costa, Alexandre Barbosa Joao Paulo Candia Vieira Joao Pita Costa* Cristina Godoy Oliveira CETIC, OBIA CIAAM, C4AI, Univ. of São Paulo IRCAI, Quintelligence CIAAM, C4AI, Univ. of São Paulo São Paulo, Brazil São Paulo, Brazil Ljubljana, Slovenia São Paulo, Brazil tuca@nic.br candia@usp.br Joao.pitacosta@quintelligence.com rafael@meirasilva.com.br alexandre@nic.br cristinagodoy@usp.br ABSTRACT Artificial Artificial Intelligence (AI) is rapidly transforming industries and economies worldwide, with Brazil and South America emerging as significant players in this global shift. The fundamental need to monitor the impact of artificial intelligence (AI) in the verticals for sustainable development, government engagement, investment and society at large motivated the Brazilian Artificial Intelligence Observatory (OBIA). It is also an integral part of the Brazilian Artificial Intelligence Plan (PBIA), and a former objective of the Brazilian Strategy of AI aims to become the leading platform for monitoring the uses of AI in the country. OBIA is part of Axis 5 Figure 1: Screenshot of OBIA showing some results on the preparedness of Brazilian industry to adopt AI workflows. of the PBIA focused on supporting the regulatory and governance process of AI. This research paper explores the The objectives of OBIA include compiling, recording, and current state, challenges, and potential of AI development in providing information related to Artificial Intelligence in Brazil, the region, examining how technological advancements are enabling analyses of its adoption and its main impacts on influencing economic growth, societal change, and policy- society. It also has the mission of consolidating and making across South America, with a particular focus on Brazil disseminating knowledge about the repercussions of this as a leading hub of innovation. It is also investigating common technology, providing support to guide policies, strategies, and aspects of the research agendas as with IRCAI’s SDG actions in promoting development and responsible use of AI. Observatory, particularly in what regards machine learning The observatory gathers Brazilian data on the use and adoption workflows and approaches complementing traditional and of Artificial Intelligence by different sectors, such as education, crowdsourced heterogeneous data collection and analysis. business, government, health, and others (see Figure 2). The currently available indicators rely mostly on traditional KEYWORDS data sources for analysis, such as surveys and data sets made Artificial Intelligence, Observatory, Survey Data Analysis, Complex Data available for the team. The first product of OBIA is the book Visualization, Multidisciplinary Collaboration. “Artificial Intelligence in Healthcare - Potentialities, Risks and 1 Introduction Perspectives”, published in July 2024. In a second line of action, it functions as a repository of guiding documents in the area, AI is increasingly shaping the economic landscape and societal originating from all parts of the world. In a third line, it acts as dynamics across Brazil and South America, positioning the an "information exchange point" between AI centers operating region as a growing hub for technological innovation. Despite in Brazil: the IAX. All indicators collected will be public and can challenges such as uneven infrastructure and regulatory be accessed on the OBIA portal [4]. hurdles, Brazil is making significant strides in AI research and The Center for Artificial Intelligence (C4AI) at the University of development, contributing to the regulation and better São Paulo, funded by FAPESP (the public agency for research understanding of the impact of AI in South America. OBIA [5] is funding in the State of São Paulo) and IBM, participates in the answering this need, serving as a platform to support the OBIA through its Humanities area. C4AI will contribute with strategy and other government actions with data on the uses qualitative research in the horizontal axes of "Legislation, and impacts of AI (see Figure 1). Regulation, and Ethical Use" and "AI Governance," while also conducting studies across various vertical axes to be Permission to make digital or hard copies of part or all of this work for personal monitored. The research group dedicated to this effort or classroom use is granted without fee provided that copies are not made or comprises scholars from the fields of law, computer science, distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of electrical engineering, sociology, and political science, allowing this work must be honored. For all other uses, contact the owner/author(s). for an interdisciplinary analysis of the key topics monitored by Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2024.sikdd.18 77 OBIA. This interdisciplinary approach will provide a Python and R programming languages, based on the TJSP API, comprehensive view of the current state of AI development and by Jesus Filho (github.com/jjesusfilho/tjsp). For the Executive implementation in Brazil. Various reports, articles, and data Power, a script was developed to scrape data from the Data will be provided to support OBIA in fulfilling its mission. Download section of the Brazilian Transparency Portal In addition to the participation of professionals from various (portaldatransparencia.gov.br/download-de-dados). NIC.br departments, the Observatory has a network of external Currently, we are developing an automation tool, based on NLP partners, including the Center for Management and Strategic techniques, to enhance the qualitative analysis of these court Studies (CGEE), the São Paulo State System Data Analysis rulings, allowing for more efficient identification and Foundation (SEADE), C4AI, CIAAM (Center of Artificial categorization of data relevant to AI research. The first Intelligence and Machine Learning) and others. The following approach for this automation tool is using a NER (Named Entity will explore how C4AI contributes to OBIA through a Recognition) model, to automate the identification of relevant complementary approach, focusing on the qualitative analysis entities, including litigants and court judgments. The next step of decisions by the São Paulo Court of Justice related to AI. would be to apply a classification model, yet to be chosen, to filter out noise data. The process of constructing the terms for 2. Data and Methodology web scraping is a critical step to ensure the relevance and accuracy of the data collected for AI research. This process 2.1. Legislation, Regulation, and Ethical Use: A Qualitative begins with the development of a comprehensive list of AI- Analysis related terms, which is built using multiple authoritative The research presented in this paper is the base of an action sources. One primary source is the OECD's report "Identifying contributing to implement the PBIA strategy [7], responsible and Measuring Developments in Artificial Intelligence," which for monitoring AI regulation and legislation. It has divided its offers a foundation of 226 AI-related terms identified through research into three main areas: the Executive, the Judiciary, and extensive analysis of scientific articles, open-source systems, the Legislative branch, combining traditional and modern data and patents. Another source is the ISO/IEC 22989:2022 collection methods. Regarding the Executive branch, standard [3], which provides a framework for AI concepts and monitoring is being conducted through data scraping of terminologies. These terms are carefully selected, refined, and government transparency websites based on a curated and translated into Portuguese by experts working within the continuously updated list of AI-related terms developed by the Brazilian Technical Standards Association (ABNT) to ensure group. This monitoring aims to understand what AI systems are that only those terms that are highly relevant and specific to AI being purchased or contracted by public authorities. For the are included. Terms that are too general or contextually Judiciary, we have been analyzing court decisions from the São irrelevant—such as "transparency," which could result in Paulo Court of Appeal (TJSP) related to AI, to understand unrelated hits concerning Brazil's Access to Information Law— judicial interpretations and rulings in the absence of specific AI are excluded to avoid false positives in the scraping process. legislation [2]. As of the latest data scraping in August 2024, The final list of terms, consisting of 103 terms in both English more than 13.000 relevant decisions have been identified. and Portuguese, is used to guide the web scraping data Lastly, in relation to the Legislative Branch, the group is closely collection processes, allowing a focused and efficient retrieval following the progress of discussions on Bill 2338/2023, which of information that aligns with specific research objectives. focuses on AI regulation, by participating in public hearings and issuing technical notes to guide legislators. The goal is to expand this research to monitor AI-related legislation at the state and municipal levels, as many municipalities are legislating on the matter to prepare their cities to assume roles of “smart cities”. 2.2. Monitoring and exploring the local data Figure 2: Current Dimensions of OBIA’s monitoring topics To effectively monitor developments in AI, it is essential to establish a comprehensive list of AI-related terms that can 2.3. How to implement and classify repositories with guide data collection efforts. This list is derived from multiple reference documents and statistics? sources, including scientific articles, standards like [3], and As part of the data collection and structuring process for reports such as OECD's [1]. The monitoring process involves qualitative analysis, we are implementing and classifying monthly web scraping of court rulings, based on the AI-related repositories containing reference documents and statistics. terms list, from TJSP (Judiciary Power) and data from the These repositories will focus on key thematic areas, such as Brazilian Transparency Portal (Executive Power), which occurs "Legislation, Regulation, and Ethical Use" and "AI Governance," on the 15th of each month. For the Judiciary Power, the scrapes and will be populated with data from sources like TJSP, the and data treatment are performed with scripts developed in 78 Transparency Portal, and other relevant databases. By the latter focusing on case-specific factors to draw broader combining different methods, data retrieval becomes more generalizations. From TJSP’s website, 597 rulings were efficient and targeted, ensuring the collection of relevant reviewed: "Facial Recognition" (1), "Facial Expression information. Web scraping supplements this process by Recognition" (1), "Machine Learning" (7), "Artificial capturing data unavailable through APIs, ensuring Intelligence" (163), "Artificial Intelligence" in English (4), comprehensive coverage. The data is regularly updated, with "Machine Learning" in Portuguese (3), "Learning Agent" (1), documents classified by relevance to AI terms, creating a and "Facial Biometrics" (417). dynamic and organized repository (see Fig 3) described in [6]. Figure 4: Nr. of Decisions per Month from Jan 2018 to Jun 2024. Figure 3: OBIA’s guiding principles and expected results [6] 2.4. How to establish and maintain cooperation networks? Establishing and maintaining cooperation networks requires fostering collaboration among interdisciplinary researchers from fields such as law, computer science, engineering, sociology, and political science. These networks are essential for sharing insights and methodologies related to AI monitoring. Using APIs and web scraping tools enables access to current data, supporting continuous knowledge exchange. Regular workshops, webinars, and joint research projects help keep participants engaged. Publishing reports, articles, and datasets strengthens the network and supports OBIAs mission to monitor AI developments comprehensively. 3 Discussion of initial results Figure 5: Number of results per AI term. As of June 28, 2024, a total of 13,064 decisions were scraped from the São Paulo State Court of Justice based on AI-related terms. Out of 103 terms searched, 45 returned at least one result. Graph 1 shows the monthly distribution of all results, while Figure 5 (logarithmic scale) displays the distribution of results by AI term. Both Portuguese and English terms were used for scraping. The top 15 terms with the most occurrences were analyzed over time, and Figure 6 presents the temporal evolution of these results by publication date. A qualitative review of 597 decisions from the São Paulo Court of Justice (TJSP) using a detailed list of AI-related terms, focused on terms like "Facial Recognition" and "Facial Biometrics," Figure 6: Evolution of results by year for top 15 terms. showing they are often used in various legal contexts, The rulings followed a structured format, and the analysis sometimes diverging from their technological meanings. included 14 categories, such as case number, appeal type, Terms like "Facial Expression Recognition" and "Learning judge, district, and the context of term usage. Key findings Agent" were often interpreted in psychological or social highlighted the use of "Artificial Intelligence" and "Machine contexts rather than purely technological ones. The analysis Learning" in commercial disputes and credit issues rather than used analytical, comparative, and monographic methods, with solely technological matters. The rulings analyzed represent 79 decisions, rendered by collegiate bodies composed of multiple study when, e.g., capturing the attention of media on the terms magistrates. Each ruling follows a structured format: “criminal law” and “AI” in “Brazil” in the past 12 months, where Description and Qualification, covering aspects such as appeal, 1.4% exhibits discussions on Human Rights, and terms like case number, judicial district, presiding judge, and parties “democracy” and “discrimination” are within the top 30. When involved; Summary of the ruling; Report, offering a brief performing sentiment analysis over these results we can see description of the facts; Majority Opinion; and Dissenting large variations after the summer of 2022 with a Opinion (if applicable). The analysis was conducted with each predominantly negative sentiment regarding this search topic. of the 14 subcategories corresponding to columns in a single row: case number; type of appeal; reporting judge; district; judicial body; subject matter; judgment date; publication date; summary; parties; reasoning; final decision; context of term usage in the full text; and relevant jurisprudence. While the first nine categories were predefined based on the complete jurisprudence search, the remaining five were more subjective, created to enhance the understanding of the rulings' content and improve data visualization. Significant findings were noted in cases involving "Artificial Intelligence" and "Machine Learning," where the terms were often associated with commercial disputes, service contracts, or credit-related issues rather than purely technological applications. A recurrent theme in cases involving "Facial Biometrics" was the legality and validity of loan contracts signed through biometric recognition. The majority of decisions upheld the legality of such contracts, highlighting issues of consent and the technical reliability of biometric systems [1]. However, inconsistencies in judicial reasoning were identified, where similar cases had varying outcomes depending on the presiding judge. Overall, Figure 7: Significance of criminal law and AI in the news. the analysis highlighted several gaps and challenges in the legal treatment of AI-related technologies, particularly concerning ACKNOWLEDGMENTS transparency, fairness, and consumer protection. The study We would like to express our sincere gratitude to the Center for underlined the need for more consistent legal standards and Artificial Intelligence (C4AI) at the University of São Paulo better understanding among judges of the technical nuances (USP), supported by FAPESP and IBM, for their invaluable involved in AI applications to ensure fair and equitable rulings. support to the AI Observatory team. We thank to the CIAAM for their continued collaboration and contributions to this 4 Conclusions and further work research. We thank the support of the European Commission The qualitative research findings from the analysis of court project ELIAS - Lighthouse of AI for Sustainability (10080425). decisions related to AI reveal several key conclusions. AI- related terms such as "Facial Recognition," "Voice Recognition," REFERENCES and "Autonomous Systems" are frequently used in judicial [1] Baruffaldi, Stefano, et al. (2020) Identifying and measuring developments in artificial intelligence: Making the impossible possible. OECD. contexts that extend beyond their traditional technological [2] Cristina Godoy B. de Oliveira, Otávio de Paula Albuquerque, Emily Liene meanings, intersecting with areas like consumer protection, Belotti, Isabella Ferreira Lopes, Rodrigo Brandão de A. Silva, Glauco Arbix. Intelligent Systems: 12th Brazilian Conference, BRACIS 2023, Belo contract law, and fraud. The inconsistency in judicial reasoning Horizonte, Brazil, September 25–29, 2023, Proceedings, Part I, pp 18 – 32. and varying outcomes in similar cases highlight the need for [3] ISO (2022) Information technology — Artificial intelligence — Artificial clearer legal frameworks and a deeper understanding of AI's intelligence concepts and terminology. ISO/IEC 22989:2022. [Online]. Available: https://www.iso.org/standard/74296.html/ [27 8 2024] technological implications among judges. Moving forward, the [4] Luiz Costa et al. (2024) The Brazilian Artificial Intelligence Observatory incorporation of NLP techniques into the analysis will help (OBIA). [Online]. Available: https://www.obia.nic.br/ [27 8 2024] [5] MCTI (2021). Brazilian Strategy of Artificial Intelligence. [Online]. Available: extract key arguments from judicial decisions, providing ebia-documento_referencia_4-979_2021.pdf (www.gov.br) [07 9 2024] deeper insights into the legal discourse on AI. This will enhance [6] MCTI (2023). OBIA: Observatório Brasileiro de Inteligência Artificial,. the robustness of future research on AI regulation and its Available: https://www.gov.br/mcti/pt-br/acompanhe-o- mcti/transformacaodigital/arquivosinteligenciaartificial/1_ebia-reuniao- implications for public policy. ro_7_24_05_2023_anexo_2_eixo2-pdf.pdf [27 8 2024] Furthermore, a preliminary analysis of news using the NLP [7] PBIA (2024). Brazilian Artificial Intelligence Plan . [Online]. Available: https://www.gov.br/mcti/pt-br/acompanhe-o- capabilities of the Eventregistry.org system (see Figure 7) show mcti/noticias/2024/07/plano-brasileiro-de-ia-tera-supercomputador-e- how this source can provide complementary results to the investimento-de-r-23-bilhoes-em-quatro- anos/ia_para_o_bem_de_todos.pdf/view [07 09 2024] 80 Pojavljanje incidentov ob uporabi Umetne Inteligence Marko Grobelnik Besher M. Massri Alenka Guček Dunja Mladenić Department for Artificial Department for Artificial Department for Artificial Department for Artificial Intelligence, Intelligence, Intelligence, Intelligence, Jozef Stefan Institute Jozef Stefan Institute Jozef Stefan Institute Jozef Stefan Institute Ljubljana Slovenia Ljubljana Slovenia Ljubljana Slovenia Ljubljana Slovenia marko.grobelnik@ijs.si m.besher.massri@gmail.com alenka.gucek@ijs.si dunja.mladenic@ijs.si Povzetek takšne incidente preprečujejo ali vsaj zmanjšujejo. Predstavljeni sistem deluje kot orodje, ki pomaga uporabniku, ki si prizadeva v Prispevek predstavi prve rezultate ob uporabi sistema, ki je bil realnem času slediti dejanskim incidentom, povezanim z umetno zasnovan in razvit v sodelovanju z OECD za spremljanje incidentov, povezanih z umetno inteligenco. Glavna motivacija teh inteligenco, ter zagotavljati dokazno osnovo za oblikovanje okvira prizadevanj je podpora zakonodaji, povezani z umetno inteligenco, poročanja o incidentih in povezanih političnih razpravah o UI. Z in učinkovitemu oblikovanju politik, saj sistem zagotavlja vpoglede zbiranjem podrobnih vpogledov v vsak incident omogoča učenje iz na podlagi zbranih podatkov. OECD AI Incidents Monitor za preteklih napak ter spodbuja varnejši in bolj odgovoren razvoj ter spremljanje incidentov, povezanih z umetno inteligenco, uporabo umetne inteligence. Koristi skupnosti, ki se ukvarja z dokumentira incidente in nevarnosti v zvezi z umetno inteligenco, umetno inteligenco, saj izpostavlja trende in področja, ki da bi oblikovalcem politik, strokovnjakom za umetno inteligenco potrebujejo pozornost ali regulativni poseg. in vsem zainteresiranim stranem po vsem svetu pomagal pridobiti dragocen vpogled v tveganja in škodo, ki jo povzročajo sistemi Prednost sistema je, da je zbiranje podatkov avtomatizirano, umetne inteligence. Ideja je, da bo sistem sčasoma pomagal kar je prednost v primerjavi s podobnimi repozitoriji, ki so povečati ozaveščen urednikovani ročno, kot je na primer AIAAIC Repository [2]. ost javnosti in vzpostaviti skupno razumevanje incidentov in nevarnosti umetne inteligence, in tako prispeval k Repozitorij je prosto dostopen in namenjen tako oblikovalcem zaupanja vredni umetni inteligenci. politik, kot razvijalcem UI, raziskovalcem, pravnikom in javnim organizacijam. Ključne besede V nadaljevanju predstavimo metodologijo za spremljanje umetna inteligenca, analiza podatkov, oblikovanje politik,incidenti incidentov, prikažemo delovanje sistema na nekaj realnih primerih, predstavimo deležnike in nekaj zaključkov. Abstract This paper presents a system designed and developed in collaboration with OECD for monitoring of AI-related incidents. 2 Metodologija The main motivation behind the efforts is in supporting AI-related legislation and effective policymaking, as the system provides Metodologija OECD za spremljanje AI incidentov se osredotoča na evidence based on the collected data. The OECD AI Incidents identifikacijo in klasifikacijo incidentov, s čimer zagotavlja Monitor documents AI incidents and hazards to help policymakers, vpogled v realno dogajanje in podpira razvoj okvira za poročanje o AI practitioners, and all stakeholders worldwide gain valuable incidentih. Začetna točka je identifikacija in klasifikacija insights into the risks and harms of AI systems. The idea is that over incidentov, ki so poročani v uglednih mednarodnih medijih, s time the system will help to raise awareness and establish a pomočjo modelov strojnega učenja, kar omogoča gradnjo collective understanding of AI incidents and hazards contributing zanesljive baze podatkov (incidenti so zajeti od 2014 naprej). to trustworthy AI. Kljub prizadevanjem, ti incidenti predstavljajo le podmnožico vseh globalnih AI incidentov. Incidenti so razvrščeni Keywords glede na resnost, industrijo, povezane AI principe (OECD AI Artificial Intelligence, data analysis, policy making, AI incidents Principles [3]), vrste škode in prizadete deležnike. Analiza temelji na naslovih, povzetkih in prvih odstavkih novinarskih člankov, pri čemer se pridobljeni podatki uporabljajo za izgradnjo zanesljive, 1 Uvod objektivne in kakovostne baze podatkov o incidentih, povezanih z AI. Kot vir novic služi sistem Event Registry Ob vse širši uporabi umetne inteligence (UI) prihaja tudi do [4]. incidentov ob njeni uporabi. Spremljanje teh incidentov je nujno za Razvoj sistema, h kateremu smo prispevali, nadgrajuje delo zagotavljanje preglednosti, nadzora in razvoj politik, ki lahko mednarodne skupine strokovnjakov (OECD Expert group), ki razvija teoretično ogrodje za poročanje o incidentih, definira pojem ∗Article Title Footnote needs to be captured as Title Note † AI incidenta in oblikuje povezano terminologijo, kot je AI Author Footnote to be captured as Author Note nevarnosti in njene potencialne posledice. Podrobna metodologija Permission to make digital or hard copies of part or all of this work for personal or in definicije so razložene na spletni strani OECD: classroom use is granted without fee provided that copies are not made or distributed https://oecd.ai/en/incidents-methodology. for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2024.sikdd.26 81 Slika 1 Prikaz začetne strani OECD monitorja AI incidentov (https://oecd.ai/en/incidents) prikazuje vmesnik za iskanje po konceptih, vizualizacijo incidentov v času (spodaj levo; y os: število incidentov; x os: čas (2014-danes) in statistični povzetek (spodaj incidentov glede na izbrano področje (12 883 incidenti in nevarnosti o katerih je poročalo 70612 novičarskih člankov) 3 AI Incidents Monitor , statistiko za zadnji mesec in mesece z največjimi vrednostmi AI Incidents Monitor je do konca avgusta 2024 zaznal preko 12 000 (februar 2024). Iz statistik o spremembi glede na mesec, na incidentov in nevarnosti v zvezi z UI, Kot je razvidno iz Slike 1. četrtletje in na leto, vidimo padec števila incidentov in nevarnosti o Sistem je popolnoma avtomatski in zaznava incidente s katerih so mediji poročali v zadnjem mesecu glede na prejšnji skeniranjem številnih podatkov objavljenih v novicah, ter nato s mesec oz. prejšnje četrtletje. pomočjo UI določa kaj se označi kot incident ali nevarnost. Na naslovni strani (Slika 1) je prikazan črtni diagram naraščanja 3.1 Primer analize pojavitve incidentov UI incidentov v času (levo) in pripadajoča statistika (desno). Sistem omogoča napredno filtriranje po incidentih UI za sledeče Uporabnik lahko izbira med absolutnim prikazom incidentov (kot kategorije: čas, država, industrija, princip UI, resnost, tip škode, na Sliki 1) ali v ustreznem meniju izbere pod-področja. Če se oškodovanci, tip iskanja po vsebini (glej Sliko 1). Tako so na poglobimo v prikaz na Sliki 1, vidimo z različnimi barvami označeni kumulativni incidenti (vijolična) oz. njihovo trimesečno primer možne vrednosti za resnost: smrt, poškodba, nevarnost, ne- povprečje (modra). St fizična nevarnost, možni tipi škode pa so: fizična, psihološka, atistika na desni prikazuje absolutno število ekonomska, ugled, javni interes, človekove pravice, neznana. 82 Sistem omogoča napredno iskanje po konceptih, recimo za njihovih pravnih posledic lahko usmerja razvoj robustnih okvirov primer generativne UI, sistem poroča statistike, ki kažejo 2302 upravljanja AI. incidenta in nevarnosti, en od primerov incidentov, ki jih je sistem Nazadnje lahko javne organizacije in zagovorniške skupine zaznal pa se nanaša na Apple in razvoj »AI personality«, ki naj bi uporabljajo AIM za spremljanje družbenih vplivov umetne nadomestil obstoječi Applov Siri. inteligence, s čimer zagotavljajo, da so interesi javnosti zaščiteni. Poleg konceptov uporabnik lahko nadalje izbere tudi To lahko vključuje analizo vzorcev incidentov z umetno napredno iskanje za natančnejšo identifikacijo želene podskupine inteligenco za zagovarjanje boljše zaščite potrošnikov in etičnih incidentov, ki ga zanimajo. Tako lahko recimo izbere državo, ki je standardov pri uvajanju AI. povezana s poročanjem o incidentih in nevarnostih UI. Na Sliki 2 je tako prikazan primer iskanja po kategoriji države, za Slovenijo. Sistem najde dva incidenta, ki sta bila povezana s Slovenijo. Prvi 5 Diskusija incident se nanaša na Microsoftov povečan prispevek k emisiji V prispevku smo predstavili OECD-jev monitor incidentov umetne CO2. Na prvi pogled ni očitna povezava s Slovenijo, ko pa inteligence, pri razvoju katerega smo sodelovali. Sistem služi kot pogledamo povezane novice naletimo na omembo Slovenije: dober vir za širok nabor uporabnikov, ki želijo razumeti in »…But the tech giant’s electricity consumption last year rivaled upravljati tveganja, povezana s tehnologijami UI. Sistem se that of a small European country—beating Slovenia easily.« [6]. nadgrajuje z dodatnimi podatkovnimi viri. Vsak primer je tudi semantično označen. Tako je na Sliki 2 za prvi V prihodnosti je predvideno, da bo omogočen odprt postopek primer označena povezanost s principi UI učinkovitost, trajnostni oddaje podatkov, ki bo dopolnil informacije o incidentih, razvoj. Microsoft s tem lahko prizadene več deležnikov: splošno pridobljene iz trenutnih virov. Nadaljnje delo zajema tudi javnost, podjetja, delavce, vlade (Affected Stakeholders, Slika 2). avtomatsko analizo podatkov o incidentih za namen bolj celovitega Poleg tega predstavlja nevarnost za okolje, javne interese in vpogleda. To vključuje avtomatsko odkrivanja vzorcev, kot so človekove pravice (Harm type, Slika 2). Klasificirano je kot ne- verižne reakcije ali učinki na več industrij hkrati. Za potrebe fizična nevarnost (Severity, Slika 2). preverjanja resničnosti poročanih incidentov, bi lahko vključili Iz podrobnih analiz, ki so zbrane v nedavnem poročilu kombiniranje informacij iz več neodvisnih virov in uporabljal »Observatory of the social and ethical impact of artificial algoritme za odkrivanje lažnih novic, kot tudi ročno preverjanje. intelligence« [5], je razvidno, da večina incidentov (96%) spada pod kategorijo ne-fizično nevarnih, a imajo lahko zelo resne psihološke in finančne posledice, vključujoč nadlegovanja, Zahvala odvisnosti in škodo ugledu tako posameznikom kot tudi Delo, opisano v tem prispevku, so podprli OECD in številni inštitucijam. mednarodni eksperti, Ministrstvo za digitalno preobrazbo in Javna agencija za raziskovalno dejavnost Republike Slovenije v okviru CRP V2-2272 in V5-2264. 4 Deležniki Acknolwedgements OECD-jev monitor incidentov AI (AIM) je dragoceno orodje, zasnovano za različne deležnike, ki sodelujejo pri razvoju, The described work was supported by OECD and many os its regulaciji in uporabi umetne inteligence. Potencialni uporabniki international experts, Slovenian Ministry of Digital Transformation tega orodja vključujejo oblikovalce politik, razvijalce AI, and Slovenian Research and Innovation Agency under CRP V2- raziskovalce, pravne strokovnjake in javne organizacije. 2272 and V5-2264. Oblikovalci politik lahko AIM uporabljajo za sledenje in analizo podatkov v realnem času o incidentih, povezanih z AI, po vsem svetu, kar jim pomaga pri oblikovanju informiranih in na Literatura dokazih temelječih predpisov. Zmožnost orodja za kategorizacijo [1] OECD AI Incidents Monitor (AIM), https://oecd.ai/en/incidents. August incidentov glede na resnost, industrijo in vrste škode je ključna za 2024 razumevanje širših posledic tehnologij umetne inteligence in [2] AIAAIC Repositoryhttps://www.aiaaic.org/aiaaic-repository. August 2024 [3] OECD AI Principles for trustworthy AI https://oecd.ai/en/ai-principles oblikovanje politik, ki zmanjšujejo tveganja. August 2024 Razvijalci AI in raziskovalci lahko koristijo AIM, da [4] Gregor Leban, Blaz Fortuna, Janez Brank, and Marko Grobelnik. prepoznajo pogoste težave, povezane s sistemi umetne inteligence. 2014. Event registry: learning about world events from news. In S preučevanjem incidentov, zabeleženih v AIM, lahko izboljšajo Proceedings of the 23rd International Conference on World Wide Web, svoje modele, da bi se izognili podobnim težavam in povečali 107–110. [5] Richard Benjamins, Another Inconvenient Truth: The Societal varnost ter zanesljivost aplikacij umetne inteligence. Emergency of AI Incidents - We Should Do Something About It Pravni strokovnjaki lahko uporabljajo AIM za pridobitev https://www.odiseia.org/post/another-inconvenient-truth-the-societal-emergency-of- vpogledov v spreminjajočo se pokrajino tveganj, povezanih z ai-incidents-we-should-do-something-about-it [6] Microsoft’s AI Push Imperils Climate Goal as Carbon Emissions Jump umetno inteligenco, kar bi lahko bilo koristno v pravnih primerih 30% https://tanaka-preciousmetals.com/en/elements/news-cred-20240821/ ali ocenah skladnosti. Razumevanje preteklih incidentov in 83 Slika 2 Prikaz naprednega iskanja na OECD monitorju AI incidentov (https://oecd.ai/en/incidents) filtrirano po državi za Slovenijo. Podane so statistike dveh incidentov o katerih je poročalo 25 novičarskih člankov, in spodaj sta prikazana oba incidenta. 84 Perception of AI in Slovenia Abdul Sittar Alenka Guček Dunja Mladenić abdul.sittar@ijs.si alenka.gucek@ijs.si dunja.mladenic@ijs.si Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute and Jožef Jamova cesta 39 Jamova cesta 39 Stefan Postgraduate School Ljubljana, Slovenia Ljubljana, Slovenia Jamova cesta 39 Ljubljana, Slovenia Abstract Following are the main scientific contributions of this paper: This paper introduces the AI News Monitor system developed (1) We present a methodology to understand public perception for real-time monitoring and analysis of artificial intelligence about AI in news. (AI) perception in global and local news media. Leveraging data (2) We analyse some trends in AI’s Perception. from the Event Registry platform, the AI News Monitor tracks The remainder of the paper is structured as follows. Section 2 AI-related news articles across multiple dimensions, providing describes the methodology to collect historical data, AI news insights through three key views: a global historical overview, categories and gaining insights in public perception about AI in current global trends, and local trends specific to Slovenian media. news. Section 3 presents the analysis of trends in AI’s Perception. The system facilitates both passive observation of AI discourse We present different user scenarios and possible applications and active exploration of specific AI-related events. Our illustra- of AI News Monitoring in Section 4 and discussion in Section tive analysis reveals significant global trends, including height- 5. Section 6 concludes the paper and outlines possible areas of ened media focus on deep learning, generative AI, and robotics, future work. and examines the implications of these trends on public trust in Plotly Graphs AI. Additionally, the paper discusses the practical applications BERT Topic Modeling Global Overview of the AI News Monitor for stakeholders such as policymakers, Global Trends requests Front End Back End Database journalists, business leaders, and researchers. We conclude with a Local Trends discussion on the impact of media coverage on public perception requests of AI and propose possible future enhancements of the system, including broader language and source coverage. Users Keywords Figure 1: Architecture for Real-Time AI News Monitor- datasets, artificial intelligence, media monitoring, perception ing and Visualization based on Event Registry and imple- mented using Flask and Plotly. 1 Introduction Artificial Intelligence (AI) is increasingly becoming an integral part of society, influencing various aspects of daily life and in- 2 Methodology dustries [4]. As AI continues to evolve, so does its portrayal in the media, which plays a critical role in shaping public percep-The proposed approach to creating a web service to analyze pub- tion and trust. Understanding how AI is perceived globally and lic perception involves two key steps: 1) identifying AI-related locally is essential for policymakers, businesses, and researchers categories and gathering news within these categories, and 2) to ensure that AI technologies are developed and deployed in developing a web service that displays trends across these cat- ways that are socially acceptable and trustworthy [3, 4]. egories, news publishers, and highlights current trends among In response to this need, we have developed the AI News Mon- both global and local (Slovenian) news sources (see Figure 1). itor system designed for real-time monitoring and exploratory Firstly, we selected AI-related categories based on the Slovenian analysis of AI-related news coverage. The AI News Monitor of- AI observatory1 and Wikipedia2. The key categories associated fers a comprehensive view of how AI is discussed in the media, with Artificial Intelligence include ‘Generative AI’, ‘Artificial In- capturing data from the Event Registry platform on a monthly telligence’, ‘NLP’, ‘Chat-GPT’, ‘Deep Learning’, ‘Robotics’, ‘Com- basis [7]. puter Vision’, ‘Neural Networks’, ‘Graph Neural Networks’, ‘Self- The AI News Monitor system is structured around three main supervised Learning’, and ‘Zero-shot Learning’. views: a global overview that presents historical data from the Next, we collected news articles from the last year related to past year, global trends that highlight recent AI-related events, these categories. These articles were classified into the appropri- and local trends focusing on mentions of AI by Slovenian news ate categories based on Wikipedia concepts, and we also obtained sources. These views allow the users to either passively monitor sentiment data from Event Registry. The portrayal of AI-related ongoing developments in AI or actively explore specific events news significantly impacts public perception, with the emphasis and trends that may influence public opinion. on risks, benefits, or ethical concerns shaping public opinion and driving narratives that can either build trust or instill fear[8],[12], Permission to make digital or hard copies of all or part of this work for personal [1]. or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and To understand global trends, we retrieved news events published the full citation on the first page. Copyrights for third-party components of this globally in the last month. For local trends, we focused on news work must be honored. For all other uses, contact the owner/author(s). Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia 1http://siai.ijs.si/dashboards/Main/SlovenianObservatoryIntro?globalCountry=SV © 2024 Copyright held by the owner/author(s). N https://doi.org/10.70314/is.2024.sikdd.14 2http://country-dashboards.ijs.si/dashboards/Main/Index? 85 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Trovato et al. Figure 2: Time series of the number of news articles by specific areas (in colors, at the top). Detailed view upon precise exploration (middle) and corresponding sentiment of news from specific areas (at the bottom). articles published by the top 50 Slovenian news publishers. Finally, we employed topic models to analyze the corpus of news articles and extract underlying themes [9], [2]. 3 Analysis of trends of AI’s Perception 3.1 Global Overview The global overview provides a historical review of global AI- related news (see Figure 2). Users can explore the number of news articles across 13 AI fields (Generative AI, Chat-GPT, Deep Learning, Robotics, Computer Vision, Neural Networks, Graph Neural Networks, Artificial Intelligence, Federated Learning, Few-shot Learning, Meta Learning, Self-supervised Learning, and Zero- shot Learning) or by news providers and have an overview of the sentiment of the news. Global trends allow for the review and exploration of global AI- Figure 3: A detailed view of Global Trends, showing the related trends based on captured events from the last month. option to select news events based on chosen AI fields. Figure 3 shows a detailed view of the Global Trends, where a written report of the number of news articles and events, a his- togram of the number of AI-related news articles over time, and 3.3.1 Global Overview. In the historical overview of AI trends in the ability to explore the last 10 events in a selected field. March 2024 (Figure 2), there was a significant increase in the number of news articles and interest in deep learning, generative AI, and robotics. Specifically, on March 18th, there were 1,800 news articles about generative AI, 970 about robotics, and 274 about 3.2 Local Trends deep learning. This spike in news highlights several key events: Local trends allow for the review of news from Slovenian news one of the standout stories was the launch of Gen-2 by Runway, providers for the last month. The local trends show the detailed a generative video model capable of creating high-quality short view, where a written report of the number of news articles and clips. An important topic was the use of AI in political campaigns, events, a histogram of the number of AI-related news articles particularly the creation of deepfakes and misinformation. This over time, and the ability to explore (see Figure 4). raised concerns about AI’s impact on elections and voter trust. In the field of robotics, researchers were inspired by advancements in generative AI to develop more versatile robots. These new 3.3 EXAMPLES OF TRENDS robots can perform various tasks using a single, comprehensive 86 Perception of AI in Slovenia Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia model, demonstrating significant progress in robotic capabili- News Monitor can support their specific goals. Other potential ties. Overall, the sentiment in March 2024 was positive (as seen stakeholders are business executives, NGOs, researchers and ed- from the sentiment analysis), reflecting enthusiasm and optimism ucators. regarding this technological progress. The increased media atten- Policy Makers: Scenario: A policymaker uses AI News Monitor tion highlights the rapid development and growing importance to track trends in robotics. of AI in various fields. Background: Jure, a decision-maker at a government agency for technology and innovation, is tasked with drafting new guide- lines for the development and implementation of robotics in Slovenia. To understand the broader context and local trends, he needs to explore the global perception of robotics and compare it with local perspectives. Steps: Step 1: Searching for a Global Overview: Jure logs into AI News Monitor and searches for "robotics" under the global overview section. The system displays a line chart showing how robotics has been mentioned over time, along with a sentiment graph for the past year. He finds that robotics is globally discussed with mostly positive sentiment, particularly in Asia and North America. Step 2: Global Trends: Jure selects "robotics" among the topics and reviews recent events on this subject. He chooses an event focusing on robotics in the EU and examines the sentiment of the publications and the main themes. In his browser, he looks at the specific articles and discovers that discussions predomi- Figure 4: A detailed view of Local Trends, showing the nantly revolve around automation and industrial robotics. Step option to select news events based on chosen AI fields. 3: Local Trends in Slovenia: Next, Jure is interested in a review for Slovenia to understand how robotics is perceived at the local level. The dashboard for the selected topic displays an analysis 3.3.2 Global Trends. In our examination of global trends, we selected the news story "AI and heat waves pose dual threats to of recent articles from Slovenian media. By using the browser, he the power grid" and found that two specific newspapers published discovers that discussions mainly focus on the impact of robotics more articles on this topic compared to others. The sentiment of on employment and the potential use of robots in healthcare. these articles, as shown in the middle graph (Figure 4), fluctuates Jure finds that local concerns are more focused on social and between positive and neutral. Upon delving into the content of economic impacts. He includes these insights in his preparatory these publications, we found that Forbes focused on the issue of documents for the new guidelines. Step 4: Compiling the Report fake news generated by AI, while Lexology explored future AI and Recommendations: Finally, Jure exports key data, including applications in various fields. sentiment graphs and media summaries, from AI News Monitor. He compiles a report that summarizes global trends and local 3.3.3 Local Trends. In the last month (at the time of writing the concerns and proposes balanced guidelines that promote innova- report, this was June 2024), there was an increase in AI-related tion in robotics while addressing social impacts. news from Slovenian news providers, particularly from Delo.si Journalists: Scenario: A policymaker uses AI News Monitor to and Sta.si (Figure 5). When analyzing the sentiment of these arti-track trends in robotics. cles, most were neutral, with a few expressing positive opinions Background: Ana, a journalist at a technology magazine, is tasked about AI. Delo.si focused on the growing adoption of AI by com- with writing an article on the growing trend of using genera- panies in Slovenia, highlighting discussions on the potential of tive AI to create videos. She needs to explore both global trends quantum computing and recent advancements in AI technology. and local perspectives in Slovenia to provide a comprehensive This coverage indicates a balanced view of AI’s impact and po- overview. tential. Sta.si reported on the construction of a state-of-the-art Steps: Step 1: Searching for a Global Overview: Ana searches for data center in Maribor, which will also house a supercomputer. "generative AI" under the global overview section. The system This event represents a major development in Slovenia’s techno- displays a line chart showing that this topic is on the rise, identi- logical infrastructure. Additionally, Sta.si wrote about AI trends fies the media outlets reporting on generative AI, and provides that benefit semiconductor manufacturers, reflecting a positive a sentiment graph for the past year. Step 2: Global Trends: Ana outlook on the economic impact. selects "generative AI" and reviews recent events on this topic. She focuses on deepfake video generation, checking who has written about it and what the main themes are. She then looks 4 User Scenarios and Applications up these articles in her browser. Step 3: Local Trends in Slovenia: The AI News Monitor can cater to a range of stakeholders with Ana shifts her focus to Slovenia to understand local views. The varying use case objectives [10], [6], [5]. Policy makers can utilize dashboard reveals that Slovenian media coverage is largely posi-the developed system to track global and local trends in AI-related tive, particularly for certain providers. However, Ana realizes the topics, enabling them to craft data-driven policies that balance need to include concerns about authenticity and misinformation innovation with societal concerns. Journalists can leverage the to provide a balanced perspective. Step 4: Compiling and Writing: system to gather comprehensive insights into public sentiment Ana exports key data, including sentiment graphs and media and media coverage, enriching their reporting with accurate and summaries, from AI News Monitor. She drafts her article, start- timely information [11]. Detailed scenarios for both policy making with global trends and then delving into specific concerns in ers and journalists are explained below, illustrating how the AI Slovenia, enriched with visual data. 87 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Trovato et al. Figure 5: Time series of the number of news articles by news provider in Slovenia (at the top). Sentiment analysis (in the middle) and frequency of topics for this period (at the bottom). 5 Discussion of Digital Transformation and Slovenian Research and Innovation Services like AI News Monitor can play a role in fostering greater Agency under CRP V2-2272. transparency around AI by offering detailed insights into how AI is being discussed across various media platforms. By tracking References public sentiment and highlighting both positive and negative [1] Iyad AlAgha. 2021. Topic modeling and sentiment analysis of twitter dis- cussions on covid-19 from spatial and temporal perspectives. trends, it helps to ensure that the development and deployment Journal of Information Science Theory and Practice, 9, 1, 35–53. of AI technologies are aligned with public concerns and expecta- [2] David Alvarez-Melis and Martin Saveski. 2016. Topic modeling in twitter: tions. aggregating tweets by conversations. In Tenth international AAAI conference on web and social media. While AI News Monitor offers valuable insights, it has limitations, [3] Stephen Cave, Kate Coughlan, and Kanta Dihal. 2019. " scary robots" exam-such as its reliance on media reporting, which may not capture ining public responses to ai. In Proceedings of the 2019 AAAI/ACM Conference the full spectrum of public opinion. Additionally, potential biases on AI, Ethics, and Society, 331–337. [4] Ethan Fast and Eric Horvitz. 2017. Long-term trends in the public perception in media sources or the algorithms used for sentiment analysis of artificial intelligence. In Proceedings of the AAAI conference on artificial could skew the results, presenting challenges in ensuring a fully intelligence number 1. Vol. 31. [5] Fabian Gilson, Matthias Galster, and François Georis. 2020. Generating use accurate and balanced representation of public perception. case scenarios from user stories. In Proceedings of the international conference on software and system processes, 31–40. [6] Debasish Kundu and Debasis Samanta. 2007. A novel approach of prioritizing use case scenarios. In 14th Asia-Pacific Software Engineering Conference 6 Conclusions (APSEC’07). IEEE, 542–549. AI News Monitor was developed to understand and track pub- [7] Gregor Leban, Blaz Fortuna, Janez Brank, and Marko Grobelnik. 2014. Event registry: learning about world events from news. In Proceedings of the 23rd lic sentiment around AI, offering policymakers, journalists, and International Conference on World Wide Web, 107–110. other stakeholders the insights needed to make informed de- [8] Kalle Lyytinen, Heikki Topi, and Jing Tang. 2021. Information systems cur- cisions. AI perceptions can be monitored globally and locally, riculum analysis for the macude project. Communications of the Association for Information Systems, 49, 1, 38. for the context of Slovenia. However, there are opportunities [9] Rishabh Mehrotra, Scott Sanner, Wray Buntine, and Lexing Xie. 2013. Im- for future work to enhance its capabilities. Expanding the its proving lda topic models for microblogs via tweet pooling and automatic coverage to include more languages and diverse sources would labeling. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, 889–892. provide a more global perspective, while refining sentiment anal- [10] Frank Moisiadis. 2000. Prioritising use cases and scenarios. In Proceedings ysis techniques could improve accuracy and reduce potential 37th International Conference on Technology of Object-Oriented Languages and Systems. TOOLS-Pacific 2000. IEEE, 108–119. biases. [11] Abdul Sittar, Daniela Major, Caio Mello, Dunja Mladenić, and Marko Grobel- nik. 2022. Political and economic patterns in covid-19 news: from lockdown to vaccination. IEEE Access, 10, 40036–40050. [12] Abdul Sittar, Dunja Mladenić, and Marko Grobelnik. 2022. Analysis of 7 Acknowledgments information cascading and propagation barriers across distinctive news This work was supported by the European Union through AI4Gov events. Journal of Intelligent Information Systems, 58, 1, 119–152. (101094905) and TWON (101095095) EU HE projects and Ministry 88 What will happen tomorrow? Predicting future event types for businesses Tesia Šker Jože M. Rožanec Jožef Stefan Institute Jožef Stefan International Postgraduate School Ljubljana, Slovenia Ljubljana, Slovenia tesia.sker@gmail.com joze.rozanec@ijs.si Gregor Leban Dunja Mladenić Event Registry d.o.o. Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia gregor@eventregistry.org dunja.mladenic@ijs.si ABSTRACT location prediction, semantics prediction, and a combination of Strategic foresight helps organizations anticipate future chal- these. Each goal is divided into subgoals for which various tech- lenges and opportunities, allowing them to handle uncertainty niques can be applied. According to the classification provided better. While strategic foresight is becoming more widely adopted by Zhao, our technique can be classified as a semantic prediction. across organizations, the process still heavily relies on expert In this research, we explore how graphs can be used to model knowledge, and little of it has been automated through artificial media news events and to forecast event types in the near future. intelligence. In this research, we explore how media news events By doing so, we provide a valuable tool for decision-makers, can be analyzed to forecast event types that will take place in offering them a clearer view of potential outcomes. Specifically, the near future. In particular, we consider it a supervised ma- our research focuses on using a JSON dataset containing a variety chine learning problem with a well-defined set of event types and of articles about a particular business company. We create a leverage graph representation of the media news events to create graph representation of the articles and use Graph2Vec to create graph embeddings, train a classifier, and predict event types that embeddings that can be used downstream to fit other machine- will likely occur one day ahead. We validated our approach on a learning models. Using this information, we apply a Random real-world dataset of an American multinational conglomerate Forest Classifier to predict the categories of articles about the operating in industry, worker safety, healthcare, and consumer company for the following day. goods. In particular, we expect this to be useful to give organizations a competitive advantage in fast-changing markets [5]. While KEYWORDS human expertise is valuable, it varies from person to person, strategic foresight, event prediction, machine learning, graphs leading to inconsistent predictions. Manually analyzing large datasets is also time-consuming and prone to errors. AI, however, can process vast amounts of data, spot patterns, and predict future 1 INTRODUCTION event types more accurately. Strategic foresight helps organizations anticipate future chal- This work is structured as follows. Section 2 presents related lenges and opportunities, allowing them to handle uncertainty work that is relevant for this paper. Section 3 describes the data in better [9]. Therefore, predicting future event types as a part of the dataset, and the data extraction process. Section 4 introduces strategic foresight became necessary for businesses to manage a new approach to predict future event types. Section 5 presents their operations without significant losses. Various events on the results of this research. Section 6 concludes this work and a major scale, such as floods, earthquakes, internet failures, or proposes future improvements. pandemics, as we are witnessing recently, or on a minor scale, such as road closures due to sports events or promotions at fairs, can have a major impact on business operations. By predicting the next event type, businesses can adjust prices, reschedule 2 RELATED WORK staff, manage stocks, reschedule transportation routes to avoid In recent decades there has been an increasing interest in strategic delays, and more, and thus reduce losses or increase their sales foresight in the academic field. According to Fergnani (2020) and profits. [2] this is because by "using corporate foresight, organisations There is currently a massive number of articles written on Fu-can reconfigure their strategy based on the analysis of business ture Event Predictions. Based on Zhao [11], the event prediction opportunities suggested by future possibilities". Even in academia methods can be classified in terms of goals into time prediction, "one of the domains heavily impacted by Artificial Intelligence is innovation management and in this context especially the area Jože M. Rožanec and Tesia Šker are co-first authors with equal contribution and of Strategic Foresight (SF)" as per Brandtner et. al (2021) [1]. importance. Corresponding author: Jože M. Rožanec: joze.rozanec@ijs.si. However it seems that strategic foresight methods related to AI only end up being used by bigger companies with a larger Permission to make digital or hard copies of part or all of this work for personal number of resources. As noted by Kim and Seo (2023) [6], "except or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and for AI start-ups and players in the consumer electronics and infor-the full citation on the first page. Copyrights for third-party components of this mation and communication industry, small- and medium-sized work must be honored. For all other uses, contact the owner/author(s). enterprises (hereafter SMEs) in other industries do not demon- Information Society 2024, 7–11 October 2023, Ljubljana, Slovenia © 2023 Copyright held by the owner/author(s). strate competence in AI." Therefore, effective implementation of https://doi.org/https://doi.org/10.70314/is.2024.sikdd.24 AI solutions for strategic foresight in smaller and medium sized 89 Information Society 2024, 7–11 October 2023, Ljubljana, Slovenia Tesia Šker, Jože M. Rožanec, Gregor Leban, and Dunja Mladenić companies would be one of the topics to be explored in future research. In this research however, we focus more on the general imple- mentation of strategic foresight by means of next event predic- tion. Exploring similar fields, we found that there was already some research exploring the field of event predictions, which Figure 1: Sample of relevant data considered when parsing rather than focusing on businesses focused on other domains. In an event type to build the dataset. the field of sequential event prediction, several researchers are exploring diverse methods. Although the methods share some conceptual similarities with our research, they differ significantly in methodology and focus. Letham, Rudin, and Madigan (2013) [7] developed a model that predicts the next event using an ERM-based approach with logistic regression, focusing on the presence of events rather than their order. On the other hand, our work uses labeled article databases and considers the sequence of past events, using techniques like graph construction, random walks, and random forests. Yeon, Kim, and Jang (2015) [10] focus on predicting event flow through visual analytics, using LDA for topic extraction and emphasizing specific keywords, while our approach is entirely text-based and relies on graphs. On the other hand, Hu et al. (2017) [4] use LSTM networks for predicting future subevents, which offers an alternative method to our non-LSTM-based text analysis. Although these studies provide useful insights and have of- fered significant improvement in sequential event prediction, they may face certain challenges. For instance, Letham, Rudin, and Madigan (2013) [7] emphasize event presence over sequence, potentially missing key temporal relationships, while Yeon, Kim, and Jang (2015) [10] depend heavily on keywords, overlooking Figure 2: Event Type Taxonomy broader context. Additionally, LSTM-based models like those used by Hu et al. (2017) [4] are powerful however they require significant computational power. In contrast, our work addresses 3.2 Data Description these limitations by employing a graph-based approach that pri- oritizes event sequences and leverages standardized data from For our research, we used a dataset of events provided by Event sources like DMOZ and Wikipedia. This enables us to make more Registry, with media events encoded in JSON format. Specifically, accurate and efficient predictions, offering a practical and scalable we analyzed 4,216 events related to the company 3M, recorded solution that enhances predictive accuracy. between June 23, 2021, and July 23, 2024. We used a URI to clas- sify each event, drawing from DMOZ and Wikipedia categories (Fig. 1). These were selected because they provide standardized descriptions of the events being reported, which makes the data consistent and reliable. The events are categorized into 94 distinct types, which are further grouped into three primary domains: 3 DATASET business, environment, and society. The business domain makes 3.1 Data Extraction Pipeline up the largest proportion of events, accounting for 65 types (69% The event detection pipeline processes about 300.000 English of the total), while the environment and society domains contain news articles per day. Each news article is first annotated using 13 types (14%) and 16 types (16%), respectively. Within these tools like entity linking, topic classification and sentiment detec- domains, the event types are further divided into smaller subdo- tion. Each article is then split into sentences where each sentence mains, which can be aggregated into larger subdomain units as retains it’s annotations and other meta-data. For each pair of the demonstrated in the event type taxonomy (Fig. 2). entities in the sentence, an event classifier then determines if there is a particular relation of interest expressed in the sentence 4 METHODOLOGY between the two entities. The predefined taxonomy currently This study uses graph-based techniques to predict future event includes 133 event types of interest, ranging from security, en- types from news articles about a specific company. The process vironment, natural disasters, accidents, politics, and other areas. starts by building a graph that maps relationships between event To classify the events, a neural network transformer architecture types and concepts from Wikipedia and DMOZ. Random walks with a pretrained encoder is used. The entire network, including are then performed on this graph to extract key information the encoder, is trained on our supervised dataset using best prac- such as URIs, dates, and event types, which are then transformed tices like online hard example mining, class balancing, dropout, into embeddings using Graph2Vec [8]. Next, the event types are and consistency regularization. The sentences for which the clas-encoded and adjusted through a process called target shifting. sifier finds that it mentions a relation of interest are then stored This step aligns the features to better forecast future outcomes in a database, together with the pair of associated entities and based on previous data. The predictions are made using a Random other available meta-data. Forest classifier, which is then validated through stratified k-fold 90 What will happen tomorrow? Predicting future event types for businesses Information Society 2024, 7–11 October 2023, Ljubljana, Slovenia 4.4 One Hot Encoding & Target Shifting To transform the categorical event types into binary vectors, One hot encoding is applied. This allows the model to treat each event type as a separate class. After extracting relevant column names, the encoded target data is concatenated with the feature embed- dings, creating a dataset for model training and evaluation. The a) b) c) dataset is then aggregated by averaging out the embeddings and calculating the maximum value of the encoded target columns Figure 3: Event Type Graphs for a given day. Finally the ’target’ data is shifted by one day, which allows the embeddings to forecast the event types for the cross-validation for higher accuracy. The following sections will following day. present each step of this process in more detail (see Fig. 4). 4.5 Random Forest Classification & Stratified 4.1 Graph Construction K-Fold Cross Validation For each article in the JSON dataset, a detailed graph G is gen- To ensure an effective classification and prediction of the data, A erated using the NetworkX library [3]. The graph construction Random Forest classifier is created. When employing this method, process starts by extracting key information such as the article’s embeddings are used as features and the one-hot encoded event URI(unique identifier), as well as the date associated with the types are used as labels. The data itself is split into testing and article and the event types, which are represented by specific training sets, followed by the incorporation of the Stratified K- URIs. In addition to these elements, each article also includes two Fold cross validation. This technique splits the data into 10 folds, important lists: ’slots’ and ’categories’. The ’slots’ list contains while ensuring that the event type proportion in each fold re- wiki and dmoz addresses that are directly related to the event mains equal. The model is then trained on 9 folds, with the re- described in the article, while the ’categories’ list includes vari- maining fold being used for validation. This ensures balanced ous classifications of the event. To complete the graph, labels are representation of each class across the folds resulting in a more created by extracting URIs from the ’slots’ list and filtering the effective performance. ’categories’ to focus on those with the "dmoz" prefix. 4.2 Random Walks for Feature Extraction 5 RESULTS Once the graphs for each article are constructed, random walks As mentioned above, the model was trained on a training set, are performed, starting at a given node (event type) and moving and then evaluated on a test set. The training set included ap- to adjacent nodes based on specific probabilities. Several random proximately 508 samples for each fold, and the test set included walks are generated for each node, forming the foundation for about 10% of the whole set, which amounted to 56 samples per feature extraction processes. A single random walk begins by fold. Using this, the model then predicted the probabilities for initializing the path with the starting node and iterating over a event types for each set. When training the model for each class, specified path length. At each step, a random number is compared we noticed certain classes did not have enough occurrences to with a probability p. If the number is less than p, the walker stays have at least one entry of such a class per dataset fold and were at the current node, otherwise it moves to a random neighbor. If skipped. We, therefore, trained the model and predicted for a no neighbors are available, the walk ends. total of 45 classes. Generating multiple random walks for every node follows To evaluate the discriminative performance of the model, the a similar approach, using p as the probability of staying at the ROC AUC score was used. The results produced showed us how current node (set at 0.05). The process involves creating an empty well the model distinguishes between different classes, as well list to store all random walks and iterating through each node in as the model’s ability to predict future event types. The ROC the graph. For each node, the specified number of random walks AUC score showed us that the average performance of the model is generated, and each walk is appended to the list. was around 0.5674, and the median was close to it, with an AUC ROC score of 0.5559, with the highest score reaching 0.8194 and 4.3 Embedding Generation Using Graph2Vec the lowest reaching a value of 0.3338. While the best scores The random walks from the graphs are processed similarly to demonstrate we can effectively forecast event types ahead of word sequences in a document. The ’embedding_data’ function time, further work is required to enhance results, which in most generates vector embeddings for graph data using the Doc2Vec cases remain close to 0.5. model. It begins by converting each random walk into a Tagged- Document, storing these in ’documents_gensim’. The Doc2Vec 6 CONCLUSIONS model, with a vector size of 5 is trained on these documents, This study was used to develop a graph-based approach to pre- creating a vector space where similar sequences are positioned dicting event types in articles. In the process, we utilized ran- close together. dom walks for feature extraction and Doc2Vec for embedding The function then processes each graph in the graphs dictio- generation. Then, we trained the resulting model on a Random nary, extracting uri, date, and event type, and generating addi- Forest classifier and evaluated it with a Stratified K-Fold Cross tional random walks. These walks are converted into embeddings Validation. The model demonstrated solid performance with an using the ’infer_vector’ method, and the resulting vectors are average ROC AUC score of around 0.5674, reaching a peak at averaged into one final embedding. This embedding is stored in approximately 0.8194. This indicates the model’s effectiveness a dictionary across ’embedding1’ to ’embedding5’, alongside the in capturing relationships within the data and predicting future graph’s metadata. event types. 91 Information Society 2024, 7–11 October 2023, Ljubljana, Slovenia Tesia Šker, Jože M. Rožanec, Gregor Leban, and Dunja Mladenić Figure 4: Data Extraction Pipeline However, while the model performed well overall, occasional [11] Liang Zhao. 2021. Event Prediction in the Big Data Era. Comput. Surveys 54, 5 fluctuations in accuracy suggest space for further improvement. (2021), 1–37. We are currently striving to find ways to make graphs more in- formative. In future work we could refine the feature extraction process by incorporating larger datasets, with a wider variety samples and a larger number of companies. ACKNOWLEDGMENTS The Slovenian Research Agency supported this work. This re- search was developed as part of the Graph-Massivizer project funded under the Horizon Europe research and innovation pro- gram of the European Union under grant agreement 101093202. REFERENCES [1] Patrick Brandtner and Marius Mates. 2021. Artificial intelligence in strate- gic foresight–Current practices and future application potentials: current practices and future application potentials. In Proceedings of the 2021 12th International Conference on E-business, Management and Economics. 75–81. [2] Alex Fergnani, Andy Hines, Alessandro Lanteri, and Mark Esposito. 2020. Corporate foresight in an ever-turbulent era. European business review 25 (2020), 26–33. [3] Aric Hagberg, Pieter J Swart, and Daniel A Schult. 2008. Exploring network structure, dynamics, and function using NetworkX. Technical Report. Los Alamos National Laboratory (LANL), Los Alamos, NM (United States). [4] Linmei Hu, Juanzi Li, Liqiang Nie, Xiao-Li Li, and Chao Shao. 2017. What happens next? Future subevent prediction using contextual hierarchical LSTM. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31. [5] Jon Iden, Leif B. Methlie, and Gunnar E. Christensen. 2017. The nature of strategic foresight research: A systematic literature review. Technological Forecasting and Social Change 116 (2017), 87–97. https://www.sciencedirect. com/science/article/pii/S0040162516306035 [6] Jong-Seok Kim and Dongsu Seo. 2023. Foresight and strategic decision-making framework from artificial intelligence technology development to utilization activities in small-and-medium-sized enterprises. foresight 25, 6 (2023), 769– 787. [7] Benjamin Letham, Cynthia Rudin, and David Madigan. 2013. Sequential event prediction. Machine learning 93 (2013), 357–380. [8] Annamalai Narayanan, Mahinthan Chandramohan, Rajasekar Venkatesan, Lihui Chen, Yang Liu, and Shantanu Jaiswal. 2017. graph2vec: Learning distributed representations of graphs. arXiv preprint arXiv:1707.05005 (2017). [9] Freija van Duijne and Peter Bishop. 2018. Introduction to strategic foresight. Future 1 (2018), 67. [10] Hanbyul Yeon, Seokyeon Kim, and Yun Jang. 2015. Visual Analytics using Topic Composition for Predicting Event Flow. KIISE Transactions on Computing Practices 21, 12 (2015), 768–773. 92 Generating Non-English Synthetic Medical Data Sets Lenart Dolinar Erik Calcina Erik Novak University College London Jožef Stefan International Jožef Stefan International London, United Kingdom Postgraduate School Postgraduate School Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia Abstract The experiment setting is presented in Section 4, followed by the experiment results in Section 5. We discuss the results in Using synthetic datasets to train medicine-focused machine learn-Section 6 and conclude the paper in Section 7. ing models has been shown to enhance their performance, how- ever, most research focuses on English texts. In this paper, we ex- plore generating non-English synthetic medical texts. We propose 2 Related Work a methodology for generating medical synthetic data, showcasing This section describes the related work, focusing on large lan- it by generating Greeklish medical texts relating to hypertension. guage models and methods for generating synthetic data. We evaluate our approach with seven different language models and assess the quality of the datasets by training a classifier to 2.1 Large language models distinguish between original and synthetic examples. We find that the Llama-3 performs best for our task. Large Language Models (LLMs) are models that were trained to generate human-like texts based on an extensive process of Keywords training on vast amounts of data. Models, such as Llama 3 [2], GPT-4 [9], Aya 23 [3] and Mistral [7], are often easy to work Synthetic data, healthcare data, multilingual data, large language with by providing an input textual prompt, based on which the models, classification models respond. The LLMs are helpful in specialized fields, such 1 Introduction as medicine, since they can be fine-tuned on extensive data sets containing medical terms and concepts. This enables them to per- The healthcare domain produces a lot of medical data that can be form well in tasks such as medical synthetic data generation [12]. used to train machine-learning models to help medical person- Despite that, they are sometimes unable to follow the instruc- nel. For example, a machine-learning model designed to perform tions in the prompt accurately, leading them to hallucinate, i.e. Named Entity Recognition (NER) on electronic health records confidently produce wrong responses [5]. (EHRs) needs extensive labeled datasets to accurately identify In our experiments, we investigate the LLMs’ performance medical terms like diseases, treatments, and patient details. How- in generating synthetic medical data given specific constraints ever, the data contains a lot of personal information, and hospitals and detailed prompts to simulate the original data set as best as cannot share it freely due to data protection. In addition, there possible. are not enough examples to train the models for some problems, such as those relating to rare diseases. Because of this, synthetic 2.2 Synthetic medical data generation data is being used as a substitute to train the models. Most synthetic data generation approaches focus on generat- Recently, synthetic medical data, generated using LLMs, has been ing English texts. These usually utilize large language models used to enhance the performance of models for solving different trained on predominantly English documents retrieved from the natural language processing tasks in medicine. web. However, there are few examples of using them to gener- One work focuses on generating a synthetic dataset of elec- ate non-English texts. Furthermore, the language models have tronic health records of Alzheimer’s Disease (AD) patients based difficulties generating texts that do not reflect the distributions on a label that is provided [8]. They find that the performance of found in the training sample. This includes medical texts, which their system for detecting AD-related signs and symptoms from are usually not accessible to the general public. EHRs improves vastly when trained on synthetic and original This paper proposes a methodology for generating medical data sets as opposed to training the system only on the origi- synthetic data using open-source large language models. We nal one. Another work investigated using LLMs for extracting apply the methodology to a medical data set written in Greeklish, structured information from unstructured healthcare text [13]. a combination of Greek and English scripts. We test it with seven By generating synthetic data using LLMs and fine-tuning the large language models and assess performance by training a model, they significantly improved the models’ performance for classifier to distinguish original examples from synthetic ones. medical-named entity extraction and relation extraction tasks. Using the same prompt, we find that the open-source Llama-3 Most related works focus on English synthetic data due to model best generates synthetic data that reflects the original data scarce non-English training data and the dominance of English set. in medical terminology [6]. This paper focuses on generating The remainder of the paper is as follows: Section 2 presents the non-English texts, specifically medical texts written in Greeklish related work on generating synthetic data using large language about hypertension. models. Next, the proposed methodology is described in Section 3. Permission to make digital or hard copies of part or all of this work for personal 3 Methodology or classroom use is granted without fee provided that copies are not made or This section outlines our research methodology. We first present distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this the pre-processing of the data set, followed by describing the syn-work must be honored. For all other uses, contact the owner /author(s). thetic data generation process. Finally, we present the description Information Society 2024, 10–14 October 2024, Ljubljana, Slovenia of synthetic dataset evaluation using a classifier. Figure 1 shows © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2024.sikdd.4 the diagram overviewing the proposed methodology. 93 Information Society 2024, 10–14 October 2024, Ljubljana, Slovenia Lenart Dolinar, Erik Calcina, and Erik Novak evaluation 3.3 Technical details with classifier In this section, we describe the models and the parameters used in the experiment. All models used are available via the Hugging- Face’s transformer library [15]. We tested five open-source models to generate the synthetic LLM and prompt data sets, all of which can be run on a 32GB GP U: Llama-3 [2] choice only has support for the English language but has been fine-tuned translate to Latin script DataDreamer synthetic patient records to understand user prompts, which is a feature we expected would 2 help a lot with the synthetic data generation. Aya-23 [3] is a multilingual language model and offers support for 23 languages, 3 including Greek. Mistral [7] supports a variety of languages train entity 4 but omits Greek . The models Gemma-2 [4] and Phi-3 [1] were extraction LLM 5 6 patient records also tested and compared in the experiments. In addition, we experimented with GPT-4o [9] and GPT-3.5-Turbo, which are Figure 1: An overview of the methodology. The image was accessible via the OpenAI API. designed using resources from flaticon. All models were given the same prompt containing instruc- tions that included (1) generating Greek texts written in Latin script and (2) containing a label randomly selected from the orig- inal data set, (3) examples are supposed to be at most 6 words 3.1 Data pre-processing long, (4) should provide concise responses, (5) structured format The data set used consisted of 1,299 examples of medical history (all text must be in a single line, must use // and commas as sepa- in Greeklish, where the Latin and Greek scripts were used inter- rators, and must be similar in format as the provided few-shot changeably. It also contained 1,495 labels, most of which were examples). To stress some more important instructions, some in English. The labels consisted of drugs, medical events, and instructions were given in capital letters and were also repeated. measurements. To translate the labels into Greek, we used the NLLB-200 [14] 4 Experiment Setting 1 translation model . Since LLMs were predominantly trained on This section describes the experiment setting, which consists texts written in Latin script, we decided to transliterate both the of the evaluation process and the metrics used to measure the labels and examples from Greek to Latin script. This allowed the approach’s performance. LLMs to generate longer tokens with richer information. We split the original data set into two subsets to ensure no 4.1 Evaluation approach data leakage. The first one, consisting of 930 examples, was used The quality of the generated synthetic data was measured in two for synthetic data generation. The second one, containing the parts. The first consisted of statistical measurements, such as remaining 369 examples, was used for evaluation. calculating the average length of the generated examples and finding the proportion of examples that included the required 3.2 Synthetic data generation labels. These statistics were then compared to the original data We utilized the datadreamer library [10] to generate the synthetic set. data set. The library enables open-source models to create syn- The second part consisted of training a classifier to discern if thetic data sets and was developed to work in research settings, the input text was from the original or from the synthetic data set. supporting prompt templates and few-shot learning. The data set used to train and evaluate the classifier involved 369 We developed a prompt containing the instructions and re- randomly selected synthetic examples and 369 examples from strictions on generating the examples. To better showcase the the original data set, transliterated into Latin script. We chose structure of the generated text, we also provided five random 5-fold validation as our classification procedure and calculated examples from the original data set as few-shot examples. Next, the mean performance across all trials. using datadreamer, we sent the prompt to the chosen LLM. We The classifier was trained using the BERT [11] language model, 7 experimented with multiple LLMs, and about 800 examples were specifically the bert-base-multilingual-cased variant . The generated for each used LLM. When experimenting with LLMs classifier was trained using the following parameters: batch size that required calling an external provider (e.g., OpenAI), we pro- = 16, epochs = 3, and learning rate = 2e-5. The same parameters vided five static few-shot examples that did not include any pa- were used for all synthetic data sets. tient personal data due to data privacy concerns. To ensure the quality of generated data, we implemented a 4.2 Metrics post-processing step. This included formatting the generated To assess the quality of the generated synthetic data sets, we used text into one line and excluding examples where the length was the F1 score as our main metric for evaluating the classifier’s too long or where the model started repeating words meaning- performance. The target value was 0.5; if the performance is lessly. This ensured that all generated examples followed the greater than 0.5, the classifier can discern the original from the same format and could be used for evaluation. synthetic examples. Hence, the synthetic data does not reflect Table 2 presents generated examples for the label "OSTEO- the original data set. If the performance is less than 0.5, the POROSH". Similarities in the examples highlight the need for rigorous methods to evaluate how closely they resemble the 2 https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct original data set. The methods are explained in Section 4.1. 3 https://huggingface.co/CohereForAI/aya-23-8B 4 https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 5 https://huggingface.co/google/gemma-2-9b-it 6 https://huggingface.co/microsoft/Phi-3-medium-4k-instruct 1 7 https://huggingface.co/facebook/nllb- 200- distilled- 600M https://huggingface.co/google- bert/bert- base- multilingual- cased 94 Generating Non-English Synthetic Medical Data Sets Information Society 2024, 10–14 October 2024, Ljubljana, Slovenia classifier has difficulties separating the synthetic from the original 6.1 LLM performance data, which can be because the synthetic data contains copies of Results in Table 1 show significant quality differences among the original examples. In addition to the F1 score, we measured synthetic datasets from different LLMs, with label occurrence the classifier’s accuracy, precision, and recall, which are also ranging from 0.740 for Mistral to 0.996 for GPT-4o, and average reported. example length from 3.691 for GPT-4o to 8.040 for Aya-23. However, Table 3 indicates no significant performance differ- 5 Results ences within a single synthetic dataset, with a maximal standard In this section, we present the results of our experiment. We deviation of the metrics being 0.021 for the Llama-3 dataset. first present the statistical results, followed by the classifier’s We can also notice that the F1 and accuracy scores are very evaluation. close for all synthetic data sets. This means the classifier was likely performing relatively similarly on both classes (synthetic 5.1 Statistical analysis and original datasets) without significant bias to either class. Table 1 compares the synthetic data sets and the original one We can observe much better performance on the Llama-3 regarding label occurrence and average example length. The label data set, which is primarily trained on English data, than on the occurrence is 1.000 in the original data set, as all examples from Aya-23 data set, which is also trained on Greek data. This shows the original data set are assumed to include relevant labels and that a model does not need to be extensively trained in Greek information. texts to generate this type of synthetic medical data well. The most aligned synthetic data set regarding label occurrence was generated using GPT-4o, followed by Llama-3. However, in 6.2 Limitations terms of average example length, the data set generated using Due to limited computing power, only one GP U with 32GB of Gemma-2 performed the best, followed by Llama-3. space was available, restricting the testing of larger LLMs. To ad- The worst-performing models, in terms of label occurrence, dress these challenges, using cloud-based resources or distributed were Mistral and Phi-3, which in about 25% did not include the computing could help run larger models and improve the variety selected label. The data set generated using the Aya-23 had the of synthetic data generated. largest difference in terms of average example length, on average Due to privacy concerns, when using GPT-4o and GPT-3.5-Turbo generating examples with three extra words. models, which are not locally-run models, we had to use five fixed Table 1: Statistical comparison between the original and examples when generating synthetic data instead of a larger vari- synthetic data sets. The bold and underlined values repre- ety. This potentially led to larger similarities of the GPT synthetic sent the best and second-best statistics, respectively. datasets to the examples instead of the original dataset and, con- sequently, worse performance. LLM Label occurrence Avg example length 6.3 Potential improvements original dataset 1.000 4.682 The prompt was the same for all seven LLMs and was primarily Llama-3 0.990 5.330 (+0.648) tested on Llama-3. Hence, the performance might be biased to- Aya-23 0.949 8.040 (+3.358) wards the model. The method could be improved by tailoring the Mistral 0.740 6.376 (+1.694) prompts to each model individually. Gemma-2 0.988 4.207 (-0.475) The evaluation of synthetic datasets could be further extended Phi-3 0.782 6.071 (+1.389) by checking for repeating examples in the synthetic dataset or GPT-4o 0.996 3.691 (-0.991) by checking how different the generated example is from the five GPT-3.5-Turbo 0.867 6.764 (+2.082) provided examples. The evaluation could also be improved by checking for overfitting to the original data set. Looking at both statistics, we can conclude that Llama-3 had the best alignment to the original data set in terms of label oc- 7 Conclusion and Future Work currence and example length, closely followed by GPT-4o. This paper presents a method for generating Greek synthetic To better imagine the differences between the generated ex- medical data sets. To synthetically create datasets similar to the amples, we handpicked an example from each synthetic data set original, we carefully craft a prompt and perform pre-processing related to the label “OSTEOPOROSH”, shown in Table 2. and post-processing of the data to increase performance and eliminate the effect of hallucinations. 5.2 The classifier evaluation Using a classifier and considering the inclusion of labels and Table 3 shows the F1, Precision, Recall, and Accuracy perfor- generated text length, we conclude that Llama-3 is best for gen- mances of the trained classifier on different synthetic data sets. erating examples that most closely resemble the original dataset. The best performance was achieved by Mistral with approxi- In the future, we plan to explore the underlying architectures mately 0.85 scores in all four metrics, followed by Llama-3, with of the models to understand their performance differences in approximately 0.88 scores in all metrics. The worst performances multilingual contexts. This will allow us to further refine our were on data sets generated by the Aya-23 and GPT-3.5-Turbo methods and create more accurate data sets. models. Surprisingly, the Aya-23 is a language model supporting Furthermore, we intend to use the synthetic dataset to train a Greek; thus, it was expected to generate better examples. named entity recognition (NER) system to recognize medical la- bels from medical history examples. Measuring the performance 6 Discussion of the NER trained on synthetic datasets will give us another This section discusses the synthetic data generation performance, way of evaluating their quality. We also intend to create a more outlines our methodology’s limitations and drawbacks, and pro- general pipeline enabling the code to generate synthetic medical poses potential improvements to the approach. data in a wider variety of languages and formats. 95 Information Society 2024, 10–14 October 2024, Ljubljana, Slovenia Lenart Dolinar, Erik Calcina, and Erik Novak Table 2: Generated examples for label "OSTEOPOROSH". LLM Examples original dataset APO 2O ETON YPERTASH ME AGOGI// OSTEOPOROSH // YPOTHYROIDISMOS Llama-3 YPOTHYROEIDISMOS, OSTEOPOROSH, APO// Aya-23 CA ORTHOU, ANEYRISMA KOILAKHS AORTHOU, OSTEOPOROSH. Mistral OSTEOPOROSH, APO 60 ETOS, APO 2 MHNES KAI APO 10 GRAMM Gemma-2 OSTEOPOROSH, ARTHROSITIS, ETOVIR Phi-3 OSTEOPOROSH, XAROSTHROMA, ALPHA-BISFIOVITINI, 2018, DIATHRHSH, DIA gpt-4o OSTEOPOROSH, ANEMIA gpt-3.5-Turbo OSTEOPOROSH, GASTREKTOMH, EMFISIMA, YDRONERFOSI, PSIXROS. Table 3: Mean performance metrics of the classifier for synthetic data sets, with standard deviation. Performances that are closer to 0.5 are considered better. The bold and underlined values represent the best and second-best performances, respectively. LLM F1 Precision Recall Accuracy Llama-3 0.875 ± 0.021 0.881 ± 0.020 0.875 ± 0.020 0.875 ± 0.020 Aya-23 0.945 ± 0.005 0.947 ± 0.004 0.945 ± 0.005 0.945 ± 0.005 Mistral 0.848 ± 0.012 0.856 ± 0.001 0.849 ± 0.011 0.849 ± 0.011 Gemma-2 0.928 ± 0.005 0.930 ± 0.005 0.928 ± 0.005 0.928 ± 0.005 Phi-3 0.927 ± 0.009 0.932 ± 0.008 0.927 ± 0.009 0.927 ± 0.009 GPT-4o 0.906 ± 0.014 0.912 ± 0.012 0.907 ± 0.014 0.907 ± 0.014 GPT-3.5-Turbo 0.940 ± 0.013 0.944 ± 0.011 0.940 ± 0.013 0.940 ± 0.013 Acknowledgments [7] Albert Q. Jiang et al. Mistral 7B. 2023. arXiv: 2310.06825 [cs.CL]. url: https://arxiv.org/abs/2310.06825. This work was supported by the Slovenian Research Agency. [8] Rumeng Li, Xun Wang, and Hong Yu. “Two Directions for Funded by the European Union. UK participants in Horizon Eu- Clinical Data Generation with Large Language Models: rope Project PREPARE are supported by UKRI grant number Data-to-Label and Label-to-Data”. In: Findings of the Asso- 10086219 (Trilateral Research). Views and opinions expressed are ciation for Computational Linguistics: EMNLP 2023. 2023, however those of the author(s) only and do not necessarily reflect pp. 7129–7143. doi: 10.18653/v1/2023.findings- emnlp.474. those of the European Union or European Health and Digital Ex- [9] OpenAI et al. GPT-4 Technical Report. 2024. arXiv: 2303. ecutive Agency (HADEA) or UKRI. Neither the European Union 08774 [cs.CL]. url: https://arxiv.org/abs/2303.08774. nor the granting authority nor UKRI can be held responsible for [10] Ajay Patel, Colin Raffel, and Chris Callison-Burch. DataDreamer: them. Grant Agreement 101080288 PREPARE HORIZON-HLTH- A Tool for Synthetic Data Generation and Reproducible LLM 2022-TOOL-12-01. Workflows. 2024. arXiv: 2402.10379 [cs.CL]. url: https: References //arxiv.org/abs/2402.10379. [11] Telmo Pires, Eva Schlinger, and Dan Garrette. “How Mul- [1] Marah Abdin et al. Phi-3 Technical Report: A Highly Ca- tilingual is Multilingual BERT?” In: Proceedings of the 57th pable Language Model Locally on Your Phone. 2024. arXiv: Annual Meeting of the Association for Computational Lin- 2404 . 14219 [cs.CL]. url: https : / / arxiv. org / abs / 2404 . guistics. Association for Computational Linguistics, 2019, 14219. pp. 4996–5001. doi: 10.18653/v1/P19- 1493. [2] AI@Meta. “Llama 3 Model Card”. In: (2024). url: https : [12] Karan Singhal et al. “Large language models encode clin- / / github. com / meta - llama / llama3 / blob / main / MODEL _ ical knowledge”. In: Nature 620 (2023), pp. 172–180. doi: CARD.md. 10.1038/s41586- 023- 06291- 2. [3] Viraat Aryabumi et al. Aya 23: Open Weight Releases to [13] Ruixiang Tang et al. Does Synthetic Data Generation of Further Multilingual Progress. 2024. arXiv: 2405 . 15032 LLMs Help Clinical Text Mining? 2023. arXiv: 2303.04360 [cs.CL]. [cs.CL]. url: https://arxiv.org/abs/2303.04360. [4] Google DeepMind Gemma Team. Gemma 2: Improving [14] NLLB Team et al. No Language Left Behind: Scaling Human- Open Language Models at a Practical Size. 2024. url: https: Centered Machine Translation. 2022. arXiv: 2207 . 04672 / / storage . googleapis . com / deepmind - media / gemma / [cs.CL]. url: https://arxiv.org/abs/2207.04672. gemma- 2- report.pdf . [15] Thomas Wolf et al. “Transformers: State-of-the-art natural [5] Xu Guo and Yiqiang Chen. Generative AI for Synthetic language processing”. In: Proceedings of the 2020 Confer- Data Generation: Methods, Challenges and the Future. 2024. ence on Empirical Methods in Natural Language Processing: arXiv: 2403 . 04190 [cs.LG]. url: https : / / arxiv. org / abs / System Demonstrations. Association for Computational 2403.04190. Linguistics, 2020, pp. 38–45. doi: 10.18653/v1/2020.emnlp- [6] Rainer Hamel. “The dominance of English in the inter- demos.6. national scientific periodical literature and the future of language use in science”. In: AILA Review 20 (Dec. 2007), pp. 53–71. doi: 10.1075/aila.20.06ham. 96 LLNewsBias: A Multilingual News Dataset for Lifelong Learning Swati Swati Dunja Mladenić swati.swati@unibw.de dunja.mladenic@ijs.si Jožef Stefan International Postgraduate School Jožef Stefan Institute and Ljubljana, Slovenia Jožef Stefan International Postgraduate School Ljubljana, Slovenia Abstract In this study, we address these challenges by introducing a The rise of digital media enhances information accessibility but novel dataset LLNewsBias specifically designed for the detection also introduces challenges related to the quality and impartiality and analysis of political bias in multilingual news headlines. Our of news reporting, particularly regarding biases that influence dataset spans four major global events from 2019 to 2022: Brexit, public perception during key global events. In response, this COVID-19, the 2020 U.S. election, and the Ukraine-Russia war, study introduces LLNewsBias, a dataset designed to detect and capturing a wide range of political discourse across 17 languages. analyze political bias in multilingual news headlines, covering To collect this dataset, we use Media Bias/Fact Check for the four major events from 2019 to 2022 — Brexit, COVID-19, the assignment of bias labels, and Event Registry [2] for the extrac-2020 U.S. election, and the Ukraine-Russia war. With over 350,000 tion of relevant headlines and metadata. The resulting dataset is headlines in 17 languages, annotated with bias labels, this dataset not only comprehensive in its linguistic diversity but also struc- is compiled using Media Bias/Fact Check and Event Registry. Our tured to support both event-wise and year-wise analyses, with contributions include a structured framework for data collection an emphasis on lifelong learning. and organization, enabling event-wise and year-wise analysis while supporting lifelong learning. We also highlight potential 1.1 Contributions use cases that demonstrate the dataset’s utility in advancing bias Our study makes the following contributions: prediction models, multilingual adaptation, and model robustness. • Multilingual bias-annotated dataset: We introduce a Additionally, we discuss the dataset’s limitations, addressing po- multilingual bias-annotated dataset containing over 350,000 tential biases, sample size constraints, and contextual factors. This news headlines in 17 languages, each annotated with po- work provides a valuable resource for improving bias detection litical bias labels. in dynamic, multilingual news environments, contributing to the • Data collection and organization framework: We pre- development of more accurate and adaptable models in natural sent a structured framework for data collection and or- language processing and media studies. For code and additional ganization, enabling event-wise and year-wise analysis insights, visit: https://github.com/Swati17293/LLNewsBias while ensuring adaptability for lifelong learning. Keywords • Potential use-cases: We outline several potential applica- tions of our dataset, highlight its potential for advancing Dataset, News, Bias, Multilingual, Headline, Low-resource, Media lifelong learning models, particularly in bias prediction, Bias, News Bias, Continual Learning, Lifelong Learning multilingual adaptation, and model robustness. • Discussion of limitations: We identify and discuss the 1 Introduction dataset’s limitations, such as biases in data collection, sam- The rapid growth of digital media has greatly enhanced the ac- ple size constraints, and contextual influences, offering a cessibility of information, but it has also introduced significant transparent assessment of its applicability. challenges concerning the quality and impartiality of news re- In summary, our paper introduces a comprehensive dataset porting. Political bias in news content is particularly concerning, and a framework for the study of political bias in multilingual as it has the potential to influence public perception and shape news headlines. By focusing on key global events and providing societal narratives, especially around key global events. Under- support for lifelong learning, our study contributes to the ongoing standing and predicting such biases, particularly in multilingual effort to develop more accurate and adaptable models for bias contexts where biases can manifest differently across cultural and detection in diverse linguistic and cultural contexts. linguistic boundaries, is essential for promoting fair and balanced journalism. Traditional approaches to bias detection often rely 2 Related Work on monolingual datasets and static models that may not effec- tively capture the evolving nature of news content [6]. These Several datasets focus on news articles and political bias [5], limitations underscore the need for more robust datasets and but there is a notable scarcity of multilingual, bias-annotated methodologies that can adapt to the dynamic and multilingual datasets designed for lifelong learning [4]. While resources like landscape of modern news reporting. the media bias chart by Ad Fontes Media and PolitiFact provide insights into bias, they are often limited to English-language Permission to make digital or hard copies of all or part of this work for personal sources or specific fact-checked claims, lacking the continuous, or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and event-centric data necessary for broader analysis. GDELT [3], the full citation on the first page. Copyrights for third-party components of this a large-scale event-oriented news dataset, covers multiple lan-work must be honored. For all other uses, contact the owner/author(s). guages but focuses on location, network, and temporal attributes Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia rather than political bias or the event-outlet relationship. Exist- © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2024.sikdd.8 ing multilingual datasets are often domain-specific [1], limiting 97 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Swati et al. their utility for general bias analysis. In contrast, LLNewsBias exclude outlets labeled as questionable and assign each remain- dataset fills these gaps by offering a generalized, multilingual, ing outlet 𝑜 ∈ 𝑂 a bias label 𝑏 ∈ 𝐵, where 𝐵 = {𝑏 } 𝑖 𝑖 1, 𝑏2, ..., 𝑏𝑞 and bias-annotated data designed for event-wise and year-wise represents the set of bias labels, with 𝑞 representing the number analyses, particularly suited for lifelong learning models. of distinct bias labels. Next, we define a temporal query 𝑄 to extract article headlines 𝑡 3 Dataset Description (𝐻 = {ℎ1, ℎ2, ..., ℎ }), where 𝑟 represents the total number of 𝑟 headlines retrieved from the Event Registry (ER). The query 𝑄 In this section, we introduce our dataset LLNewsBias and describe 𝑡 is formulated as: the framework used for its collection and organization. We begin by detailing the primary data sources that form the foundation of 𝑄 = {𝑄 , 𝑄 , 𝑄 , 𝑄 } (1) 𝑡 𝑒 𝑜 𝑐𝑎𝑡 𝑑 𝑡 this dataset. Following this, we present a comprehensive overview where 𝑄 , 𝑄 , 𝑄 specify the event, media outlet, and news 𝑒 𝑜 𝑐𝑎𝑡 of the data collection process, with a focus on the methodologies categories (limited to those classified as ’news’ by ER 𝑄 = 𝑐𝑎𝑡 employed to ensure robustness and reliability. Finally, we provide {‘𝑝𝑜𝑙𝑖𝑡𝑖𝑐𝑠’, ‘𝑏𝑢𝑠𝑖𝑛𝑒𝑠𝑠’, ‘𝑠𝑝𝑜𝑟𝑡𝑠’, ‘𝑎𝑟𝑡𝑠 𝑎𝑛𝑑 𝑒𝑛𝑡𝑒𝑟𝑡𝑎𝑖𝑛𝑚𝑒𝑛𝑡 ’, an in-depth overview of the dataset’s structure, including its ‘𝑠𝑐𝑖𝑒𝑛𝑐𝑒’, ‘𝑡𝑒𝑐ℎ𝑛𝑜𝑙𝑜𝑔𝑦’, ‘ℎ𝑒𝑎𝑙𝑡ℎ’, ‘𝑒𝑛𝑣𝑖𝑟𝑜𝑛𝑚𝑒𝑛𝑡 ’}), respectively. directory organization, file contents, and the various ordering The time constraint is represented as 𝑄 = [𝑄 , 𝑄 ], where 𝑑 𝑡 𝑠𝑑 𝑒𝑑 methods applied to facilitate detailed analysis. Our dataset is 𝑄 and 𝑄 denote the start and end dates. To scrape all the 𝑠𝑑 𝑒𝑑 documented in accordance with the FAIR Data Principles. article headlines (𝐻 ), we utilize 𝑄 to query ER. 𝑡 We then associate the extracted headlines 𝐻 with the corre- 3.1 Primary Data Sources sponding bias labels in 𝐵 and structure the dataset according to In this section, we outline the two primary data sources used in two classification types: event-wise and year-wise. To organize our study: Media Bias/Fact Check (MBFC) and Event Registry the data, we define an event-based order 𝑂 and a year-based 𝑒 𝑣𝑒𝑛𝑡 (ER). MBFC serves as the bias rating portal, providing bias la- order 𝑂 as follows: 𝑦𝑒𝑎𝑟 bels for selected media outlets, while ER is used to extract the 𝑂 = {𝑒 } (2) 𝑒 𝑣𝑒𝑛𝑡 1 → 𝑒2 → ... → 𝑒𝑛 headlines and corresponding metadata from articles published by these outlets. 𝑂 = {𝑦 } (3) 𝑦𝑒𝑎𝑟 1 → 𝑦2 → ... → 𝑦𝑚 For lifelong learning, we design the dataset to be extendable, 3.1.1 Media Bias/Fact Check. For bias labeling in this study, allowing for the integration of new events and years as they we utilized Media Bias/Fact Check (MBFC), a well-established emerge, denoted by ′ ′ ′ ′ 𝐸 ⊆ 𝐸 and 𝑌 ⊆ 𝑌 , where 𝐸 and 𝑌 repre- platform known for its comprehensive coverage and frequent sent the sets of newly added events and years. updates. Although other platforms like allsides.com and adfontes- We designed the dataset with a flexible framework that allows media.com also provide bias ratings, MBFC was selected for its for the seamless integration of new events and years as they reliability and particular focus on low-resource languages. MBFC emerge, represented as ′ ′ ′ ′ 𝐸 ⊆ 𝐸 and 𝑌 ⊆ 𝑌 , where 𝐸 and 𝑌 de- assigns bias labels based on political orientation and evaluates note the newly added events and years. This structured approach outlets for credibility and factual accuracy. These labels are de- ensures scalability for continuous learning without requiring termined by a team of contractors and volunteers who follow major restructuring and supports the training of adaptive mod- a standardized methodology, ensuring that the ratings are both els capable of integrating new information effectively. Unlike consistent and dependable for our analysis. standard multi-year datasets, our dataset includes annotations 3.1.2 Event Registry. In this study, we use Event Registry [2] that facilitate contextual understanding, enabling models to learn platform as the primary source for collecting multilingual news from historical data while adapting to evolving trends and pat- headlines. It aggregates content from over 150,000 news sources terns in news reporting. This ensures that the models remain across more than 60 languages, making it an ideal resource for relevant as new information becomes available. analyzing bias in diverse and low-resource languages. Apart from Finally, we split the dataset into training and test sets using a the headlines, it allows access to numerous metadata such as stratified sampling approach to ensure the preservation of bias publication date, news category, and political bias. By leveraging label distributions across both events and years. We perform this its Python API, we efficiently filtered and extracted headlines step as it is critical for maintaining the integrity of the model relevant to our study. This ensured a comprehensive dataset training process in a lifelong learning context. that supports the analysis of bias in a lifelong learning setup, exploring how emerging events and domain shifts influence the 3.3 Data Synopsis and Structure performance of bias prediction models over time. In this section, we present an overview of the data and explain how it is systematically organized, making it easier to understand 3.2 Data Collection Framework both the content and format of our dataset. Our data collection framework as depicted in Figure 1, is designed 3.3.1 Data Synopsis. The dataset features 356,060 headlines to support both event-wise and year-wise analyses, with the on four major events from 2019 to 2022: Brexit, COVID-19, the additional capability of facilitating lifelong learning. election, and the Ukraine-Russia war. These headlines, sourced For data collection, we begin by defining two sets: a set of from 45 unique news outlets in 17 different languages, are anno- significant global events (𝐸 = {𝑒1, 𝑒2, ..., 𝑒 }), and a set of years tated with 3 political bias labels: Left Centre, Least Biased, and 𝑛 (𝑌 = {𝑦1, 𝑦2, ..., 𝑦 }), where 𝑛 and 𝑚 represent the total num- Right Centre covering diverse topics such as politics, business, arts 𝑚 ber of events and years, respectively. We then use the Media and entertainment, sports, science, technology, health, and environ- Bias/Fact Check (MBFC) platform to select media outlets (𝑂 = ment. The dataset is structured into 7 distinct columns within .csv {𝑜1, 𝑜2, ..., 𝑜 }) and determine their respective political bias, with files. Table 1 presents a comprehensive summary of the dataset 𝑝 𝑝 as the total number of outlets. To maintain data reliability, we statistics. 98 LLNewsBias: A Multilingual News Dataset for Lifelong Learning Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Figure 1: Data Collection Framework. The framework uses MBFC for bias labeling and ER for headline retrieval. Table 1: Summary of Dataset Statistics. • article_ID: A unique identifier for the raw news article in the Event Registry platform from which the headlines Language-wise Distribution are extracted. • language: The source language of the published news Catalan 882 Romanian 17,038 article. Croatian 13,929 Russian 10,511 • date: The date on which the news was published. Czech 1,876 Slovak 5,642 • headline_text: The text of the news headline. Danish 4,330 Spanish 83,940 • news_category: The category assigned by Event Registry. Dutch 10,905 Swedish 6,441 • political_bias: The political bias of the news outlet as Finnish 1,512 Ukrainian 10,616 provided by the bias rating portal Media Bias/Fact Check. French 85,007 Italian 48,450 Hungarian 105 The dataset is annotated with bias labels: Left Centre (LC), Least Biased (LB), and Right Centre (RC). To ensure model ro- Event-wise Distribution bustness across varying data distributions, we concatenate and Brexit 32,286 COVID 309,329 shuffle files for each event and year in four distinct random orders. Election 3,829 Ukraine 10,616 This prevents overfitting to specific sequences and helps evaluate generalization across diverse configurations. While chronolog- Year-wise Distribution ical order is ideal for practical use, this randomized approach 2019 20,664 2021 4,638 tests broader performance, with the original event and year splits 2020 258,871 2022 71,887 provided for user flexibility. Event-wise Ordering: (1) brexit → covid → election → ukr-rus-war 3.3.2 Directory Structure. The dataset is organized in a main (2) election → covid → ukr-rus-war → brexit ‘data’ directory with subdirectories categorized by events (‘brexit’, (3) brexit → ukr-rus-war → election → covid ‘covid’, ‘election’, ‘ukr-rus-war’) and years (2019-2022). Addi- (4) covid → brexit → ukr-rus-war → election tional subdirectories consolidate data across all events (ordered_events) and all years (ordered_years). Each subdirectory contains .csv Year-wise Ordering: files for training and testing, structured across the following (1) 2019 → 2020 → 2021 → 2022 columns. (2) 2021 → 2020 → 2022 → 2019 • news outlet: The name of the news outlet. (3) 2019 → 2022 → 2021 → 2020 99 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Swati et al. (4) 2020 → 2019 → 2022 → 2021 Event Registry (ER). MBFC is publicly accessible, while ER The dataset captures the distribution of headlines related to provided comprehensive but limited coverage, potentially various events over the years, reflecting the temporal dynamics missing relevant articles. The use of ER’s paid version also of news coverage and the evolving reporting on these events. restricted the extent of data collection. The differences in coverage levels reveal important patterns in • Sample Size: The dataset is constrained by its focus on media attention, which are essential for developing datasets that four major events over a span of four years. This limited support lifelong learning models. number of events and time frame may not fully capture the broader spectrum of news and media biases, affecting 4 Potential Use-Cases the diversity of the samples. • Biases: Selection bias is a significant factor, as only news Our dataset introduced in this study has a wide range of potential outlets labelled by MediaBiasFactCheck were included. use-cases, particularly in the fields of natural language processing This restriction may limit the number of languages and and media studies. It is particularly valuable for research and perspectives represented in the dataset, thereby influenc- applications that require understanding and predicting news bias ing the overall analysis. in a continual, multilingual environment. Below we list some • Contextual Factors: The dataset is limited by its tem- potential use cases: poral scope, covering only four specific events over four • Lifelong learning for news bias prediction: Our dataset years. While it reflects the dynamic nature of news media, is ideal for developing and testing lifelong learning mod- it does not account for all future events and years to come. els. It allows models to adapt to new events and evolving entities. With its year-wise structure from 2019 to 2022, 6 Conclusions the dataset addresses the challenges of emerging events In this study, we present LLNewsBias, a comprehensive dataset and domain shifts (e.g., Brexit, COVID-19, Ukraine-Russia designed to tackle the challenges of detecting and analyzing polit- War), providing the data needed to develop and evaluate ical bias in multilingual news headlines. By spanning four major robust models. global events from 2019 to 2022 across 17 languages, this dataset provides a valuable resource for research in natural language • Domain Adaptation in Multilingual Contexts: Our processing and media studies. Our framework supports both dataset enables researchers to investigate domain adapta- event-wise and year-wise analysis, emphasizing lifelong learning tion techniques in a multilingual context, featuring head- and enabling models to adapt continuously to new data. The lines in 17 languages. This facilitates the development of dataset’s potential use cases include enhancing bias prediction models that generalize across languages and adapt to vari- models, facilitating domain adaptation in multilingual contexts, ous cultural and political contexts, ensuring accurate bias and improving model robustness. While LLNewsBias offers sig- prediction. It addresses the challenges faced by generic nificant contributions, we also acknowledge limitations such as models in the news domain, which often struggle with potential biases in data collection, sample size constraints, and topic and language diversity. contextual factors. Addressing these challenges in future work will be crucial for maximizing the dataset’s impact, ultimately • Sparse Experience Replay for Continual Learning: contributing to fairer and more balanced journalism. Our dataset is particularly well-suited for the news do- main, supporting efficient experience replay by allowing 7 Acknowledgments the selection of specific topics and categories. With its This work was supported by the Slovenian Research Agency and event-wise and year-wise classifications, our dataset en- National grants (CRP V2-2272; V5-2264; CRP V2-2146) and by the hances memory utilization, improves generalization, re- European Union through enrichMyData EU HORIZON-IA project duces catastrophic forgetting, and ensures that models under grant agreement No 101070284 and ELIAS HORIZON-RIA remain accurate and up-to-date in real-time applications. project under grant agreement No 101120237. In a nutshell, our dataset serves as a valuable resource for advancing news bias prediction, particularly in the context of References lifelong learning, by providing a flexible framework for integrat- [1] Jason Armitage, Endri Kacupaj, Golsa Tahmasebzadeh, Swati, Maria Maleshk- ing new events and years. Unlike many news-based datasets with ova, Ralph Ewerth, and Jens Lehmann. 2020. Mlm: a benchmark dataset for multitask learning with multiple languages and modalities. In Proceed- timestamps, it offers structured annotations and contextual in- ings of the 29th ACM International Conference on Information & Knowledge formation that enhance the understanding of evolving trends Management, 2967–2974. in news coverage, making it particularly suitable for lifelong [2] Gregor Leban, Blaz Fortuna, Janez Brank, and Marko Grobelnik. 2014. Event registry: learning about world events from news. In Proceedings of the 23rd learning applications. It supports a range of research activities, International Conference on World Wide Web, 107–110. from model development and evaluation to the exploration of [3] Kalev Leetaru and Philip A Schrodt. 2013. Gdelt: global data on events, loca- tion, and tone, 1979–2012. In ISA annual convention. Vol. 2, 1–49. new techniques for handling dynamic and multilingual news [4] Swati Swati, Adrian Mladenić Grobelnik, Dunja Mladenić, and Marko Grobel- environments. nik. 2023. A commonsense-infused language-agnostic learning framework for enhancing prediction of political bias in multilingual news headlines. Knowledge-Based Systems, 277, 110838. 5 Limitations [5] Swati Swati, Dunja Mladenić, and Tomaž Erjavec. 2021. Eveout: an event- centric news dataset to analyze an outlet’s event selection patterns. Informat- Several limitations are associated with the dataset presented in ica, 45, 7. this article and should be carefully considered in any further [6] Swati Swati, Dunja Mladenić, and Marko Grobelnik. 2023. An inferential research or analysis: commonsense-driven framework for predicting political bias in news head- lines. IEEE Access. • Data Collection Issues: The dataset was gathered using Media Bias Fact/Check (MBFC) and the paid version of 100 Creating Local World Models using LLMs Mark David Longar Erik Novak Marko Grobelnik Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia Abstract e.g. by providing LLMs a framework for responding with logi- cally consistent and pedagogically sound explanations. Moreover, A key limitation of state-of-the-art large language models is their by modifying some of the components, the approach can also be lack of a consistent world model, which hinders their ability to applied to other domains, such as industry, finance, and law. perform unseen multi-hop reasoning tasks. This paper addresses The remainder of the paper is as follows: Section 2 presents this by extracting local world models from text into a system- the related work on LLMs and creating world models. Next, the atic first-order logic framework, enabling structured reasoning. proposed approach is described in Section 3. The experiment set-Focusing on the educational domain, we present a multi-step ting is presented in Section 4, followed by the experiment results approach using Prolog to represent and reason with these mod-in Section 5. We discuss the results in Section 6 and conclude the els. Our method involves segmenting educational texts, generat-paper in Section 7. ing Prolog definitions, and merging them into a comprehensive knowledge graph. We successfully extracted several small models 2 Related Work and manually verified their accuracy, demonstrating the poten- tial of this approach. While promising, our results are currently The recent surge in large language models, such as GPT-3 [3] and limited to small-scale models. GPT-4 [1], has significantly advanced natural language processing, showing emergent reasoning abilities across various tasks. Keywords However, despite their impressive performance, LLMs are often criticized for lacking factual consistency, interpretability, and Large language models, local world models, knowledge represen- logical coherence, especially in complex, multi-hop reasoning tation, educational technology, structured reasoning, knowledge tasks [8]. To address these shortcomings, efforts have been made graphs to integrate LLMs with structured knowledge frameworks, like knowledge graphs (KGs) and ontologies, to enhance reasoning 1 Introduction and knowledge flow between structured data and language mod- In recent years, Large Language Models (LLMs) have revolu- els [9]. tionized the field of Natural Language Processing (NLP), offer- In the field of ontology and KG development, early initiatives ing unprecedented capabilities in understanding, reasoning over, like Cyc [6] laid the groundwork for large-scale structured knowl-and generating human-like text. Despite their impressive per- edge representation. More recent efforts [8, 5] have explored formance across various language tasks, a significant limitation using LLMs to assist in ontology generation and KG construction. persists – the absence of a consistent and coherent world model While LLMs can automate parts of the ontology development within these systems [8]. This limitation hampers their ability process, they struggle with ensuring logical consistency and to perform advanced reasoning tasks that require not only tex- managing complex domain-specific knowledge [5, 2]. Comple- tual understanding but also logical consistency and structured mentary approaches, like using LLMs for ontology learning [2] knowledge representation. and structured knowledge extraction [10], highlight the need for While current LLMs are powerful, they are inherently con-human validation and formal methods to ensure accuracy. strained by their reliance on statistical correlations within vast Our work builds on these insights by focusing on using LLMs datasets, often resulting in shallow and contextually inconsistent to extract structured local world models in the form of Prolog- reasoning. To address this limitation, we propose an approach based representations. This approach addresses the limitations of for extracting local world models, i.e., small, context-specific LLMs in handling complex reasoning and provides a more robust, representations of knowledge that capture the relationships and logically consistent framework for educational applications. rules governing a particular domain or scenario. The approach is multi-step. First, the input text is segmented into manageable 3 Methodology parts. Each segment is analyzed to extract key concepts and their This section introduces the approach for creating local world interrelationships, which are then represented as Prolog defini- models by generating and utilizing structured data in Prolog. The tions. Then, the definitions are merged into a comprehensive methodology is designed to systematically identify and map the knowledge graph that reflects the structure and content of the concepts and their interrelationships within a given educational input text. document, such as a textbook, facilitating the generation of a We focus specifically on the educational domain, where the knowledge graph. ability to generate and utilize local world models could signifi- cantly enhance the effectiveness of AI-driven educational tools, 3.1 Document segmentation To manage the document’s complexity and ensure accurate con- Permission to make digital or hard copies of all or part of this work for personal cept extraction, the source material was divided into several or classroom use is granted without fee provided that copies are not made or shorter parts, each up to 10 pages long. This segmentation was distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this crucial in allowing us to focus on smaller, more manageable sec-work must be honored. For all other uses, contact the owner /author(s). tions of the content, enabling a thorough analysis and avoiding Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia problems that come with long-context LLM outputs. The length © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2024.sikdd.22 of each part was determined based on the natural divisions within 101 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Mark David Longar, Erik Novak, and Marko Grobelnik the text, such as chapters or major sections, to maintain the co- concept map that helped identify key learning paths and pre- herence of concepts within each segment. requisites. Prolog (specifically SWI Prolog [11]) was chosen for this task because it can handle structured data, is widely used 3.2 Generating Prolog definitions (increasing the likelihood that LLMs have encountered it during training), and can be executed and analyzed immediately. For each segmented part, we created a prompt to generate Prolog definitions of the concepts and their relationships. The prompt 4 Experiment Setting was carefully crafted to guide the extraction of educational con- tent in a structured format. It consisted of three main components: This section outlines the experiment setting for evaluating our the context, the predicates and the structured output. approach to extracting local world models from educational texts and generating structured Prolog representations. We describe Context. A description of the educational context and a brief the data sources, the large language model used, and the evalua- narrative to position the content within a learning scenario. This tion framework. helped to align the LLM-extracted concepts and relationships with our downstream tasks. The following is an example of the 4.1 Data sources prompt used: We evaluated our approach on two widely used textbooks in deep learning and natural language processing. These texts were You are a teacher and an expert in natural language process- chosen because they are relevant to both structured reasoning ing (NLP). You wrote a chapter in an NLP textbook and would tasks and the representation of complex, multi-step concepts. like to convert the content of the chapter into a classroom The following chapters were selected for analysis: lesson. You would like to step into the shoes of a student in order to understand their learning process of this material. Deep Learning Preliminaries from the book Dive into Deep You need to understand which concepts are being taught and Learning [12]. This chapter provides foundational knowledge their relationships. of deep learning, covering key concepts such as linear algebra, calculus, and probability, which are essential for understanding Predicates. the field. The textbook’s teaching approach is highly hands-on, List of predicates and their descriptions, which were with a significant portion devoted to code. It is open-sourced, and essential for identifying concepts (isConcept(A)), prerequisites 1 we used the Markdown files provided on their GitHub page . (isPrerequisiteOf(A, B)), and sections (isSection(S)). These predicates were used to simulate the learning process, where con- Chapter 2: Regular Expressions, Tokenization, and Edit cepts are linked to sections. A concept may have prerequisite Distance from Speech and Language Processing [4]. This chapter concepts or sections that must be understood before a student introduces basic NLP techniques, focusing on regular expressions can advance to learning the concept. and tokenization, which are pivotal in text preprocessing tasks. Structured output. Clear instructions to output the extracted 4.2 Used large language model predicates in the form of a Prolog program. The LLM responding in a structured format a crucial part of our approach, as it has been We employed GPT-4o via the ChatGPT interface to extract con- shown that structured responses can improve LLM reasoning cepts and their interrelationships. We leveraged the model’s mul- and generation quality [13]. timodal capabilities, allowing it to process text and PDF docu- ments. In summary, this prompt allowed us to extract detailed sum- maries of the concepts taught and their relationships, which 4.3 Evaluation Framework were then represented in Prolog. Each segment was processed We developed an evaluation framework to assess the performance independently to generate a corresponding Prolog program. of our approach based on three primary aspects: accuracy, com- pleteness, and consistency. To validate the results, we manually 3.3 Merging Prolog definitions reviewed the extracted knowledge graphs and compared them After generating the Prolog definitions for each segment, the with the source texts. We ensured that the extracted concepts next step was to merge them into a single cohesive program. To were accurate, complete, and logically consistent. achieve this, we created a prompt, which was nearly identical to the first, but with instructions to combine the disjoint parts into Assessment Criteria. The following criteria were used to eval- one integrated Prolog program added to the end of the prompt: uate the effectiveness of our approach: • Accuracy. This aspect examines how accurately the approach Now you need to combine the parts into a single Prolog pro- extracted the concepts and their relationships from the text. gram. Make sure to include all the concepts and relationships, We evaluated the correctness of each Prolog definition against but also properly connect them. Merge concepts from different the source material. sections where necessary and make sure to include all the • Completeness. This evaluates whether the system captured all sections and their relationships. the key concepts from the educational material. The assess- ment ensured that no significant concepts or relationships 3.4 Use of the knowledge graph were omitted during extraction. • Consistency. This aspect assesses the extent to which the ex- The generated knowledge graph, represented by the Prolog pro- tracted models maintained logical coherence across different gram, was then used to recommend the next steps in the learn- ing process. Using the structured output, we created a detailed 1 https://github.com/d2l-ai/d2l-en 102 Creating Local World Models using LLMs Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia segments of the text. This was crucial in determining whether there were rare occasions where the output required manual in- the segmented Prolog definitions could be merged into a cohe- terventions to fix inconsistent formatting of the Prolog variable sive KG. names. 6 Discussion 5 Results Our approach to extracting local world models from educational In this section, we review the knowledge graphs of the two tested texts demonstrated strong performance in generating logically texts generated by our model. coherent knowledge graphs from high-level concepts, but certain limitations were identified. The synthetic data generation effec- 5.1 Dive into Deep Learning tively captured core concepts from both textbooks, particularly in structuring major branches such as Linear Algebra, Calculus, The selected chapter covered six sub-chapters in the following and Probability from Dive into Deep Learning. However, some order: Data Manipulation, Data Preprocessing, Linear Algebra, restructured sections, while logical, differed significantly from Calculus, Automatic Differentiation, and Probability and Statis- the source material’s flow. tics. The results are represented by the graph in Figure 1. In the Speech and Language Processing textbook, the Regular The system accurately identified three major independent Expressions subsection was extracted with sufficient accuracy. branches of the chapter – Linear Algebra, Calculus, and Probabil- Other sections, such as Tokenization and Edit Distance, suffered ity and Statistics – which reflects the structure of the source ma- from detail omissions, where only top-level concepts were ex- terial. The extracted knowledge graph also logically restructured tracted. This issue was more prominent due to the higher in- the content in ways that differed from the original organization formation density of the NLP textbook, exposing limitations in but made sense pedagogically. This restructuring highlights the handling detailed, densely packed content. logical flow of how data handling techniques naturally feed into Regarding the evaluation framework, the model generally per- more abstract mathematical concepts despite differing from the formed well on metrics like accuracy and consistency but strug- original structure. gled with completeness in more detailed sections. The model’s However, some omissions and reassignments were noted, par- tendency to restructure content logically, though sometimes de- ticularly within the Linear Algebra section. Concepts such as viating from the original, suggests that while it captures core vectors and matrices were omitted, likely due to the high-level relationships, further refinements are needed to preserve peda- nature of the extraction process. Additionally, matrix multipli- gogical flow and details. cation, though identified, was separated from Linear Algebra basics and Tensor operations. This disjunction represents a slight 6.1 Potential improvements deviation from the expected conceptual hierarchy. To address the limitations, improving the prompt engineering Similarly, in the Calculus section, the extracted model restruc- could lead to more detailed extractions while maintaining the tured the sequence of topics. This restructuring captured the structure of the source material. Additionally, enhancing the relationship between fundamental calculus concepts and their model’s ability to handle complex, dense information would mit- practical applications in machine learning. Furthermore, the sys- igate the loss of key concepts. Future iterations may benefit from tem included concepts like Gradient Descent and Backpropaga- automated post-processing checks to ensure logical consistency tion which were only briefly mentioned in the source material. and reduce manual interventions. Overall, while the approach shows promise, refining it to handle finer details and complex 5.2 Speech and Language Processing sequences more effectively will be essential for broader applica- tions. The Regular Expressions section, seen in Figure 2, was extracted accurately, capturing the core concepts effectively. However, a 7 Conclusion and Future work noticeable limitation was the loss of the original sequencing of the concepts presented in the textbook. While the key ideas were In this paper, we proposed a novel approach to extracting local identified, the pedagogical flow, which is essential for gradual world models from educational texts by generating structured learning, was somewhat disrupted in the extraction process. Prolog representations. Our methodology demonstrated the abil- For the other sections, including Tokenization and Edit Dis- ity to capture core concepts and their interrelationships in a logi- tance, the model extracted only the most prominent concepts, cal and coherent manner, especially in the Dive into Deep Learning omitting many important details. As a result, these sections are textbook. However, the results from the more information-dense less comprehensive than they need to be for in-depth understand- Speech and Language Processing text revealed limitations, partic- ing. Despite this, the overall connections between sections in ularly in handling detailed content, large knowledge graphs, as the knowledge graph were logically structured, showing that the well as preserving pedagogical flow. system was still able to create a coherent representation of the The use of Prolog proved effective in organizing educational material at a high level. material, allowing for structured reasoning and enabling appli- It is important to note that this textbook is significantly more cations in AI-driven educational tools. Despite these successes, information-dense and longer compared to the Dive into Deep certain challenges remain, such as the omission of detailed con- Learning book. This added complexity exposed some limitations cepts and the system’s occasional tendency to deviate from the in the current approach, mainly when dealing with texts that re- original sequence of topics. quire detailed extraction of concepts and their interrelationships. Future work will address these limitations by improving the The model’s ability to handle such dense material is limited by prompt engineering and enhancing the system’s ability to handle its tendency to focus on top-level ideas while losing much of the complex, information-dense material. Additionally, we plan to depth and sequencing provided in the source text. Additionally, explore automating the segmentation process and scaling up the 103 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Mark David Longar, Erik Novak, and Marko Grobelnik Deep Learning Prerequisites Linear Algebra Basics Calculus Basics Probability Basics Tensor Operations Matrix Multiplication Gradient Descent Chain Rule Statistics Basics Data Preprocessing Broadcasting Techniques Optimization Techniques Backpropagation Stochastic Models Automatic Differentiation Loss Function Optimization Figure 1: Knowledge graph of the Preliminaries section from Dive into Deep Learning. Regular Expressions Concatenation Square Brackets Kleene Star Period Anchors Disjunction Precedence Word Boundary Substitution Question Mark Kleene Plus Parenthesis Greedy Matching Capture Group Non-Greedy Matching Lookahead Assertion Figure 2: Knowledge graph of the Regular Expressions section from Speech and Language Processing. model to generate larger, more intricate knowledge graphs. Other [5] Vamsi Krishna Kommineni, Birgitta König-Ries, and Sheeba potential directions include integrating retrieval-augmented gen- Samuel. “From human experts to machines: An LLM sup- eration [7] to enrich knowledge extraction and comparing gen- ported approach to ontology and knowledge graph con- erated world models across different texts to evaluate their peda- struction”. In: arXiv preprint arXiv:2403.08345 (2024). gogical alignment. Self-evaluation and correction mechanisms [6] Douglas B Lenat. “CYC: A large-scale investment in knowl- could also be introduced to improve accuracy and completeness. edge infrastructure”. In: Communications of the ACM 38.11 (1995), pp. 33–38. Acknowledgments [7] Patrick Lewis et al. “Retrieval-augmented generation for knowledge-intensive nlp tasks”. In: Advances in Neural This work was supported by the Slovenian Research Agency Information Processing Systems 33 (2020), pp. 9459–9474. and the European Union’s Horizon 2020 project Humane AI Net [8] Fabian Neuhaus. “Ontologies in the era of large language (Grant No. 952026). models–a perspective”. In: Applied ontology 18.4 (2023), References pp. 399–407. [9] Shirui Pan et al. “Unifying large language models and [1] Josh Achiam et al. “Gpt-4 technical report”. In: arXiv preprint knowledge graphs: A roadmap”. In: IEEE Transactions on arXiv:2303.08774 (2023). Knowledge and Data Engineering (2024). [2] Hamed Babaei Giglou, Jennifer D’Souza, and Sören Auer. [10] Mohammad Javad Saeedizade and Eva Blomqvist. “Navi- “LLMs4OL: Large language models for ontology learning”. gating Ontology Development with Large Language Mod- In: International Semantic Web Conference. Springer. 2023, els”. In: European Semantic Web Conference. Springer. 2024, pp. 408–427. pp. 143–161. [3] Tom Brown et al. “Language Models are Few-Shot Learn- [11] Jan Wielemaker et al. “SWI-Prolog”. In: Theory and Practice ers”. In: Advances in Neural Information Processing Systems. of Logic Programming 12.1-2 (2012), pp. 67–96. issn: 1471- Vol. 33. Curran Associates, Inc., 2020, pp. 1877–1901. (Vis- 0684. ited on 08/27/2024). [12] Aston Zhang et al. Dive into Deep Learning. https://D2L.ai. [4] Daniel Jurafsky and James H. Martin. Speech and Language Cambridge University Press, 2023. Processing: An Introduction to Natural Language Processing, [13] Pei Zhou et al. “How FaR Are Large Language Models Computational Linguistics, and Speech Recognition with From Agents with Theory-of-Mind?” In: arXiv preprint Language Models. 3rd. Online manuscript released August arXiv:2310.03051 (2023). 20, 2024. 2024. url: https://web.stanford.edu/~jurafsky/ slp3/. 104 Semantic video content search and recommendation ∗ ∗ ∗ Mark David Longar Jakob Fir Bor Pangeršič Jožef Stefan Institute University of Ljubljana University of Ljubljana Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia Abstract a recommendation system that interacts with users to capture their immediate preferences, thereby overcoming the cold start The rapid growth of video streaming platforms has intensified problem and enhancing the relevance of recommendations. Addi- the demand for personalized content recommendations. How- tionally, ensuring consistency in the quality of recommendations ever, current solutions often rely on historical user data, leading across different languages is increasingly important as many to challenges like the cold start problem and overlooking users’ streaming services operate globally. immediate preferences. We present a conversational recommen- Our approach utilizes LLMs to generate keyword descriptions dation system that leverages large language models (LLMs) to for both content and user queries. These keywords serve as the generate keyword-based content and query descriptions. By in- basis for recommendations, with a Retrieval-Augmented Gen- tegrating Retrieval-Augmented Generation (RAG), our system eration (RAG) [6] model efficiently retrieving relevant content. efficiently retrieves relevant content, independent of prior user in- By crafting query keywords using LLMs, the system adapts to teractions, and ensures consistent performance across languages. user preferences in real time, providing relevant and language- Preliminary testing shows our system outperforms the RAG base- consistent recommendations. line by up to 24% in less descriptive queries and demonstrates This paper makes the following contributions: (1) Develop- consistent performance across three languages. While the results ment of a Keyword-Based Recommendation System: We in- are promising, further evaluation focusing on user interaction troduce a novel approach that utilizes LLMs to generate keyword- and satisfaction is necessary. Our approach can potentially be based descriptions for content and user queries, enabling more extended to other recommendation systems, offering broader personalized and adaptive recommendations. (2) Exploration of applicability and enhanced content personalization. Two User Interaction Models: We propose and evaluate two Keywords distinct interfaces for user interaction—a conversational chat- based model and a structured question-answering model, where large language models, recommendation system, search system, the system refines recommendations through a series of targeted retrieval augmented generation yes/no questions generated by the LLM. (3) Comprehensive 1 Introduction Evaluation Strategy: We outline a detailed plan for evaluating the system’s performance in a production environment, focusing The surge in video streaming platforms has accelerated the de- on its ability to deliver consistent, high-quality recommendations mand for personalized content recommendations. As these plat- across different languages and user contexts. forms expand their libraries and user bases, the challenge of delivering precise, user-specific recommendations intensifies. In this dynamic environment, streaming services must quickly adapt 2 Related Work to provide accurate recommendations, which are crucial for main- Recommender systems have progressed from techniques such taining user engagement and ensuring satisfaction. as collaborative filtering and matrix factorization to more com- Existing recommendation systems primarily rely on historical plex models that incorporate deep learning. The advent of large user interaction data, such as viewing history and ratings. This language models (LLMs) has enabled innovative methods for dependence leads to significant challenges, such as the cold start interacting with these systems [11], particularly when combined problem, where new users or newly added content lack sufficient with retrieval techniques [9]. One of the most promising advance-data for accurate recommendations. Additionally, these systems ments in this area is the use of Retrieval-Augmented Generation often fail to account for users’ immediate preferences, which can (RAG) models, which integrate the powerful text generation ca- change dynamically due to various factors such as mood, viewing pabilities of LLMs with retrieval-based methods to improve rec- context (e.g., watching alone or with a group), or recent events ommendation accuracy and relevance [6]. in the user’s life. This gap highlights the need for more adaptive Recent advancements in conversational recommender systems and responsive recommendation mechanisms. have focused primarily on integrating LLMs with traditional rec- Recent advancements in Large Language Models (LLMs) present ommender systems or fine-tuning LLMs using user-item interac- an opportunity to address these limitations. LLMs offer significant tion data [9], [10], e.g., [8], [4], and [5]. These approaches, while potential due to their emergent reasoning abilities, their capacity effective, often rely heavily on historical user data, leading to to extract high-quality representations of textual features, and challenges such as the cold start problem. This reliance under- their ability to leverage the vast external knowledge encoded scores the need for novel methods that reduce dependency on within them [10], [7]. By harnessing LLMs, it is possible to create past interactions and leverage real-time retrieval mechanisms to enhance content recommendations [2]. ∗ All authors have contributed equally. To address these challenges, recent work by Di Palma et al. Permission to make digital or hard copies of part or all of this work for personal (2023) [2] introduced a Retrieval-Augmented Recommender Sys-or classroom use is granted without fee provided that copies are not made or tem, which combines the strengths of LLMs and retrieval-based distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this methods. Their approach employs LLMs both at the conversa-work must be honored. For all other uses, contact the owner /author(s). tional layer and the backend retrieval process, thereby improving Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia recommendation relevance, particularly in scenarios with sparse © 2024 Copyright held by the owner/author(s). https://doi.org/10.70314/is.2024.sikdd.10 data or new users. Their experimental results demonstrated that 105 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Mark David Longar, Jakob Fir, and Bor Pangeršič this RAG-based framework performs comparably to state-of-the- art systems, even in zero-shot scenarios, underscoring the poten- tial of such an approach to mitigate cold start and hallucination problems inherent in LLMs. Our approach builds on the strengths of RAG-based models by introducing a keyword-based recommendation system that oper- ates within a RAG framework. This system ensures consistent performance across multiple languages and adapts to real-time user preferences without relying on historical user data. 3 Data The data used in this study was provided by our partner United Cloud, who operate a multinational streaming service in the 1 Balkan region, EON TV . The EON platform encompasses a vari- ety of content, such as video-on-demand (VOD) movies and TV Figure 1: Overview of the Recommendation Pipeline. shows, as well as live TV channels. We focused exclusively on VOD movie data, although our approach is capable of accommo- dating multiple content types. 4.2 User Interface The VOD movies data set comprises nearly 5000 movies in various languages. Each movie is accompanied by a brief descrip- Our proposed user interface designs (see Figure 2) offer two main tion averaging around 460 characters (5-6 sentences) in multiple ways for users to interact with our recommendation core. Be- languages. In cases where multiple translations were available, sides a direct search, where the user submits a query and receives we opted for the original language of the movie; otherwise, we recommendations in a single step, we propose: (a) A chatbot, chose the first available translation. which assists users in narrowing down their options through a conversational interface. The chatbot provides recommendations 4 Methodology at each response, allowing for a multi-step interaction that re- 4.1 Recommendation Mechanism fines the search results progressively. (b) An inquisitive method, where an agent asks the user a series of Yes/No questions to The core of our recommendation system is the generation of tex- narrow down the search. Keywords are generated based on the tual representations of content. Instead of using movie descrip- user’s responses, making it particularly useful for users who are tions directly, we employ the LLM to generate a set of English uncertain about what they want to watch. This approach shifts keywords and related movies. This approach prevents the model the burden of knowing what to query from the user to the system, from overemphasizing less relevant details, such as specific plot streamlining the recommendation process. points, that may not be central to the user’s query. User queries Each of these designs aims to enhance user engagement and follow a similar approach, where the LLM generates a set of satisfaction by providing tailored interactions that cater to differ- relevant keywords, as well as any possibly relevant movies. ent user preferences and needs. One of the key advantages of this method is its ability to abstract core concepts from user queries using the LLM, aligning 5 Evaluation better with the keywords generated from movie descriptions. We have developed a twofold approach for addressing the evalu- The LLM-generated keywords from both the movie descriptions ation of our model: and user queries are designed to encapsulate the essential topics First, to gauge the effectiveness of our keyword-based ap- and themes. By aligning the keywords generated from movie proach for recommendation, we curated a small multilingual descriptions with those derived from user queries, our system evaluation dataset to test our core recommendation mechanism. enhances the relevance of the recommendations. This alignment This dataset includes queries in various languages along with is crucial in ensuring that the retrieved movies resonate with their expected recommendations. We compared the performance the user’s expressed interests, even when these interests are of our mechanism with a baseline RAG system that directly em- not articulated well. Furthermore, the use of in-context learning bedded user queries and movie descriptions. allows the system to maintain its performance without extensive Second, to assess the efficiency and user satisfaction of our sys- fine-tuning [3], making it both efficient and effective. tem in real-world situations, we have devised an evaluation plan The rest of the recommendation system follows the Retrieval- to test our system in production. This strategy utilizes a struc- Augmented Generation (RAG) [6] pipeline (see Figure 1). The tured A/B testing framework to conduct precise comparisons RAG pipeline operates by first generating textual representations between our semantic recommendation system and conventional of movies, which are then embedded into a vector space. These search, addressing distinct aspects of user experience and system embeddings are stored in a vector database, allowing for efficient performance. similarity searches. When a user submits a query, the system generates a corresponding representation, embeds it into the 5.1 Evaluation dataset same vector space, and retrieves the top 𝑘 most similar movie embeddings from the database. This process ensures that the rec- To create our evaluation dataset, we carefully selected 25 movies ommendations are both contextually relevant and semantically across multiple languages, including both well-known and lesser- aligned with the user’s input. known titles. For each movie, we formulated two types of queries to assess the system’s retrieval accuracy: Descriptive and General 1 No EON user data was used. queries. 106 Semantic video content search and recommendation Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia The Descriptive queries were designed to simulate scenarios recommendation links, and Watch Time to gauge the duration where the user knows exactly what they are looking for. For users engage with recommended content. Additionally, immedi- instance, a query for the movie Messi (2014) might be, "I am ate user reactions are captured through Like/Dislike Ratios, while looking for inspirational documentaries about famous athletes, more detailed user feedback is collected via surveys administered such as Lionel Messi and his rise through football." In contrast, after interactions. the General queries were intended to test situations where the Behavioral Metrics: We analyze User Interaction Patterns, such user has only a rough idea of what they want to watch, which is as search frequency and refinement actions, and System Usage likely more common in real-world environments. An example Frequency to determine how different demographics utilize the of a general query for the same movie might be, "soccer movies system and to identify any potential biases in system engagement. that will inspire me." We also record the search time and number of queries needed for To evaluate the system’s performance across different linguis- a decision. tic contexts, we manually translated these queries into English, Serbian, and Slovenian. We then compared the performance of 6 Results our keyword-based retrieval mechanism against a baseline RAG The outcomes presented in Table 1 showcase the performance of model that directly used user queries and movie descriptions both models in various query types and languages, as measured without generating keywords. by accuracy at the top 5 and top 10 recommendations. The results reveal that the baseline model surpasses (or matches) 5.2 Experiment Design the performance of the keyword mechanism in the case of De- scriptive queries, particularly in terms of Accuracy@5. However, We have divided our user base into four distinct groups to facili- in terms of Accuracy@10, the two models demonstrate relatively tate a detailed comparative analysis, aligned with our proposed similar performance. Conversely, the keyword model shows sig- user interface designs: nificant performance enhancements for General queries, partic- Baseline Group: This control group doesn’t use our system, but ularly in Accuracy@10, indicating its capacity to adapt to non- instead finds movies and receives recommendations based on the specific content descriptions. Additionally, the keywords model traditional recommendation methods, a common practice in the consistently performs well across different languages, whereas industry. the baseline model shows fluctuations of up to 28% across lan- Direct Semantic Search Group: This control group interacts guages. with a straightforward search interface. Users submit a query In summary, the keywords model allows for more general and and receive recommendations in a single step. This approach multilingual queries, while the baseline model excels at retrieving provides immediate suggestions based on the user’s input, mim- very specific content. icking traditional full-text search practices. Chatbot Group: Participants in this treatment group use a con- Table 1: Evaluation results on the descriptions and gen- versational interface (interface a), where a chatbot assists in eral queries data sets. LLM embeddings were generated narrowing down options. The chatbot provides recommenda- using OpenAI’s text-embedding-3-large model. The Key- tions at each response, enabling a multi-step interaction that words model used GPT-4o. progressively refines the search results. This design enhances engagement by simulating a natural conversation. Inquisitive Method Group: Users in this group engage with Accuracy@5 Accuracy@10 an agent that asks a series of Yes/No questions to narrow down Keywords Baseline Keywords Baseline the search (interface b). Keywords are generated based on the Descriptive Queries user’s responses. English 60% 64% 68% 68% The evaluation will be conducted continuously, starting with a Serbian 56% 80% 72% 84% focused initial phase over the first month post-implementation Slovenian 56% 80% 72% 84% to address immediate usability and performance issues, followed by ongoing monitoring to capture long-term user engagement General Queries and satisfaction. English 44% 28% 68% 44% By implementing this structured evaluation framework, we aim Serbian 44% 52% 68% 52% to comprehensively understand the impact and effectiveness of Slovenian 44% 56% 72% 56% our semantic recommendation system, guiding further refine- ments and ensuring that the system meets user needs and expec- tations. 6.1 User Interface Implementation 5.2.1 Metrics We would like to measure how users interact We implemented our proposed interface design using Flutter, with our system in two main ways: First, we would like to know which guarantees functionality across a variety of devices, includ- how engaged and satisfied they are with our recommendations, ing iOS, Android, Windows, and web browsers. This cross-device i.e., do users find our system frustrating to navigate, and whether compatibility is crucial as it ensures that all users, regardless they watch movies recommended by our system. The second of their preferred platform, have access to our application. The set of metrics will aim to capture how different demographics support for mobile devices is particularly useful in our interroga- interact with our system, as a major goal is to remove any biases tion design, where users can easily navigate through options by such as language or age. swiping cards left or right. Engagement and Satisfaction Metrics: These include Click- Additionally, we integrated Tipko [1], a Slovenian transcrip- Through Rate (CTR), which measures the percentage of clicked tion service, to facilitate voice-to-text capabilities. This feature 107 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Mark David Longar, Jakob Fir, and Bor Pangeršič enhances user convenience by enabling voice communication integrations, such as the user’s calendar. We also intend to expand with our chat bot, removing the necessity for typing. our user interface by introducing new forms of interaction, such as movie trailers and multiple-choice questions. To overcome the limitations of our movie information, we are interested in delving deeper into the content by analyzing subtitles using a local language model. Additionally, we aim to broaden our database to include other types of content, such as live channel content and special time-limited events like Eurovi- sion, Eurobasket, and the FIFA World Cup. Finally, we are interested in the integration of a traditional recommendation models that utilize historical watch data or ratings to re-rank our recommendations. Acknowledgments This project was made in collaboration with United.Cloud and In516ht for the 2024 Data Science Competition, organized by The Faculty of Computer and Information Science at the University of Ljubljana. We thank our advisors Slavko Žitnik, Aljaž Košmerlj, Klementina Pirc, and Rebeka Merhar for their contributions. References [1] Primož Bratanič. Transkript app | Samodejna transkripcija slovenskega govora. May 2024. url: https://transkript.si/. Figure 2: Implementations of our (a) Chatbot (left) and (b) [2] Dario Di Palma. “Retrieval-augmented recommender sys- Inquisitive (right) user interface designs. tem: Enhancing recommender systems with large lan- guage models”. In: Proceedings of the 17th ACM Conference on Recommender Systems. 2023, pp. 1369–1373. 7 Discussion [3] Elnara Galimzhanova et al. “Rewriting Conversational Utterances with Instructed Large Language Models”. In: This report introduces a new content recommendation mecha- (Oct. 2023). doi: 10.1109/wi- iat59888.2023.00014. (Visited nism and three ways to interact with it. Table 1 demonstrates the on 05/22/2024). success of our keyword retrieval model in understanding general [4] Yunfan Gao et al. “Chat-rec: Towards interactive and ex- user preferences while still performing well when searching for plainable llms-augmented recommender system”. In: arXiv specific content. Moreover, its consistency across languages and preprint arXiv:2303.14524 (2023). its ability to retrieve content using specific descriptions as well [5] Xu Huang et al. “Recommender ai agent: Integrating large as general themes make it well-suited for a diverse user base. language models for interactive recommendations”. In: Additionally, the keyword model allows seamless integration arXiv preprint arXiv:2308.16505 (2023). with both the Chatbot and Inquisitive methods. Moreover, our [6] Patrick Lewis et al. “Retrieval-augmented generation for system could be extended to dynamically adjust keyword genera- knowledge-intensive nlp tasks”. In: Advances in Neural tion based on user-specific factors such as viewing history, local Information Processing Systems 33 (2020), pp. 9459–9474. time, weather, and current mood indicators. This personalization [7] Peng Liu, Lemei Zhang, and Jon Atle Gulla. “Pre-train, ensures that the recommendations are not only relevant to the Prompt, and Recommendation: A Comprehensive Survey content but also tailored to the user’s immediate context and of Language Modeling Paradigm Adaptations in Recom- preferences. mender Systems”. In: Transactions of the Association for Our approach has some limitations, including the cost per Computational Linguistics 11 (2023), pp. 1553–1571. query, which is higher than traditional search, although not ex- [8] Zihan Liu et al. “ChatQA: Building GPT-4 Level Conver- orbitant. Furthermore, our model’s performance is commendable sational QA Models”. In: arXiv preprint arXiv:2401.10225 given our limited knowledge about the movie content but relies (2024). on the assumption that the language model may have more infor- [9] Arpita Vats et al. “Exploring the Impact of Large Language mation about a movie than our dataset. It’s worth noting that, in Models on Recommender Systems: An Extensive Review”. the short term, it appears that models are continually improving, In: arXiv preprint arXiv:2402.18590 (2024). becoming faster, more knowledgeable, and more cost-effective. [10] Likang Wu et al. “A survey on large language models for Lastly, as with any chat application that involves user inputs, recommendation”. In: World Wide Web 27.5 (2024), p. 60. security is a crucial consideration. While improvements can be [11] Bowen Zheng et al. “Adapting Large Language Models made through better prompting and fine-tuning, ongoing moni- by Integrating Collaborative Semantics for Recommenda- toring is essential when the system is in production. tion”. In: 2024 IEEE 40th International Conference on Data 8 Future work Engineering (ICDE). 2024, pp. 1435–1448. doi: 10 . 1109 / ICDE60146.2024.00118. In future work, we plan to further explore methods for improving user experience and personalization. Our initial experiments have involved incorporating the user’s time, location, and weather to enhance results. Moving forward, we aim to explore additional 108 Continuous Planning of a Fleet of Shuttle Vans as Support for Dynamic Pricing Filip Stavrov Luka Stopar stavrovf@gmail.com luka.stopar@ijs.si Jožef Stefan Institute Jožef Stefan Institute Jamova cesta 39 Jamova cesta 39 Ljubljana, Slovenia Ljubljana, Slovenia ABSTRACT Once we receive these predictions, our goal is to simulate reservations based on this data. For instance, if the predictions This paper solves the problem of estimating the number and type indicate that 12 passengers will travel from Ljubljana to Koper on of required resources for pickup and delivery of passengers at some October 20, 2024, we would simulate reservations using sampling time in the future. By combining optimization and sampling techniques. One particular example is creating four separate methods, as well as making plans based on several statistical bookings—one for five passengers, one for three, and two for two samples, we estimate the real values for the required resources passengers each. We will introduce the sampling techniques used and show how the sample values converge towards the real values. in this process in greater detail later on. Our approach combines machine-learning based demand predictions, for the number of passengers, and a route After generating these reservations, the next step is to input them optimization engine that assigns the passengers into shared shuttle into the Route Optimization Engine to generate a plan for that day. vehicles. In order to validate our method we create a baseline data This plan will specify the number of vehicles required and the that is representative of the real values. We test our approach using specific reservations each vehicle will serve. this baseline data, and we obtain statistically significant results. The main hypotheses that our approach explores and KEYWORDS experimentally tests are the following:  H1: We can accurately estimate the number of required statistical samples, demand predictions, route optimization engine, resources using optimization methods based on sampling techniques, optimization technique predicted passenger numbers. 1 INTRODUCTION  H2: Monte Carlo sampling of historical distributions can effectively model uncertainty in demand predictions, The effective allocation of resources is a critical topic in the mobility leading to stable resource estimations. industry. Anticipating the number and type of resources required  H3: Creating plans based on several sample values will can significantly enhance a company's ability to plan accurately for converge towards the actual number of required the future. Our work addresses this challenge by focusing on how resources. to estimate the number and type of vehicles needed for passenger pickup and delivery at a future time. The input to our problem On the other hand, the key assumptions and limitations that consists of machine learning-based demand predictions, which underline our research are: provide estimates of the number of passengers across various  Prediction Accuracy: We assume that the predictions routes offered by the company. These predictions are provided effectively estimate the number of future passengers. daily and further broken down into hourly estimates for each day.  Passenger Distribution: We assume that the number of ∗ passengers follows a Poisson distribution and that the Both authors contributed equally to this research. distributions on different routes are independent. Permission to make digital or hard copies of part or all of this work for  Independence: We assume that the passenger personal or classroom use is granted without fee provided that copies are not distribution and the window type distributions are made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party independent to each other. components of this work must be honored. For all other uses, contact the  Concept Drift: We assume there is no concept drift in the owner/author(s). data, meaning the underlying data patterns do not Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia © 2024 Copyright held by the owner/author(s). change over time. https://doi.org/10.70314/is.2024.sikdd.27 109 2 RELATED WORK The process begins with demand predictions and culminates in the generation of reservation data. Critical steps include sampling the The problem of resource allocation in the mobility industry, number of passengers per reservation, the window type, and the particularly in the context of vehicle routing and passenger demand window length. Sampling is done from probabilistic distributions prediction, has been extensively studied. Traditional methods for derived from historical data, with the distributions illustrated vehicle routing often rely on static models that assume known and below. deterministic demand. However, recent advances in machine learning and optimization have enabled more dynamic approaches that can account for uncertainty and variability in demand. [3][4] For instance, predictive analytics has been employed to forecast passenger demand using historical data, which can then be fed into optimization algorithms to determine the optimal allocation of vehicles. Monte Carlo simulation is another technique commonly used to model uncertainty in demand predictions, providing a probabilistic framework for decision-making under uncertainty. [2] Moreover, dynamic vehicle routing approaches, have demonstrated the benefits of real-time adjustments to routing plans based on updated demand information. [1] The integration of these methodologies into a continuous planning framework is Figure 2. Window type distribution relatively novel and addresses the limitations of static planning approaches, particularly in highly variable and uncertain environments. [1][5] 3 METHODOLOGY Our methodology begins with demand predictions for the number of passengers, and the ultimate goal is to determine the number and type of vehicles required, as well as the reservations each vehicle will serve. The figure below provides a detailed overview of this process. Figure 3. Window length distribution Figure 1. Methodology Starting with the demand predictions, we apply sampling techniques to simulate reservation data. Specifically, we take the predicted number of passengers for different routes at various Figure 4. Number of passengers distribution times and generate reservations through sampling. This reservation data follows a specific format, including fields such as Please note that from a single demand prediction input file, we ID, start location, end location, pickup time, and more. Key generate 100 independent samples of reservation data. This attributes include the number of passengers per reservation and approach introduces uncertainty through probabilistic sampling. the window type, which reflects travel preferences. For instance, Each independent sample is then submitted as a separate job to some passengers may prefer a private vehicle (VIP), while others the Route Optimization Engine, where it solves a vehicle routing are open to sharing the ride. Additionally, the window interval is problem with time constraints. The output for each job is a plan crucial—it can be a specific time or a more flexible period, affecting corresponding to the reservation data. Our final objective is to both the service pricing and overall experience. These factors will aggregate these results and analyze the insights they provide. be incorporated into the dynamic pricing model later on. 110 4 RESULTS is acceptable given the overall similarity to the global mean, and the sampling of values. Thus, despite the variance, the sampled After solving all 100 jobs, we obtained 100 independent plans and values converge towards the actual values. This error distribution began analyzing the results. As shown in the figure below, the is displayed on the figure below. distribution of the number of passengers yielded a mean value of 325.87 with a standard deviation of 16.85. For the number of vehicles, the mean was 38.01 with a standard deviation of 3.06. It’s notable that the passenger data exhibits significantly more variance compared to the vehicle data. This is expected, as passengers are grouped into visits, and visits are then allocated to vehicles, resulting in less variation in the vehicle count. Figure 7. Required vehicles - error distribution To statistically test whether the sampled and baseline data have the same mean number of vehicles, we conducted a Welch's t-test. The results showed a test statistic of 0.59, a p-value of 0.55, and a 95% confidence interval ranging from -0.64 to 1.23. Given the p- value, we fail to reject the null hypothesis, meaning there is no Figure 5. Sampled data: visits, vehicles and passengers statistically significant difference between the sampled and distributions baseline vehicle counts. Additionally, the range of the mean difference of vehicles between the sampled and the baseline data, To further validate our approach, we created a baseline using the which is from - 0.64 to 1.23, falls within our practical significance same data from which the demand predictions were generated. threshold of up to 2 vehicles, further supporting the similarity We generated 100 samples from this baseline and submitted them between the two datasets. This indicates that we can effectively as independent jobs. Upon completion, we compared the baseline estimate the number of required resources by applying results with those of our sampled data. The mean number of optimization techniques on top of the demand prediction values. vehicles from the baseline was 37.81 with a standard deviation of 3.01, which closely aligns with the values from our sampled data. We also analyzed the mean number of vehicles and observed that You can observe the comparison on the figure below. this value converges toward the actual values as the number of samples increases. This is shown on the figure below. Figure 6. Comparison of required vehicles between sampled and baseline data Figure 8. Convergence of means of sampled vehicles We also analyzed the error distribution for the number of vehicles Finally, after obtaining both the number of passengers and the between the baseline and sampled data, finding a mean absolute number of vehicles, we decided to fit a linear regression to explore error of 3.16. This suggests that the difference between the two whether we could simplify the process and avoid the detailed sets is minor, considering the sampling of data, and it is indicating approach previously described. As illustrated in the figure below, a good alignment. Additionally, the average number of vehicles in the regression line serves as a reasonable estimator for the number both the sampled and baseline data is quite similar. While the mean of vehicles based on the number of passengers. However, this absolute error reflects some variability in the sampled values, this model struggles to capture the non-linear relationships influenced 111 by various optimization types, window lengths, and travel modes, ACKNOWLEDGMENTS resulting in considerable variance around the regression line. While it is generally true that a higher number of passengers correlates Our research is part of a broader, multi-partner initiative called with an increased number of vehicles, this relationship can be CONDUCTOR. The primary objective of this project is to design, misleading. Different travel types can accommodate more integrate, and demonstrate advanced, high-level traffic and fleet passengers per vehicle, which can disrupt the linear relationship, management systems. These systems aim to optimize the transport especially in cases where these travel types dominate. of passengers and goods efficiently on a global scale, ensuring Consequently, although the linear regression provides a solid seamless multimodality and interoperability. The CONDUCTOR approximation, it overlooks essential non-linear factors that are project is co-funded by the European Union’s Horizon Europe critical to our analysis. Our approach, which integrates these research and innovation programme under the Grant Agreement factors, demonstrates greater robustness and effectiveness. The No 101077049. linear regression line and the data correlation are presented in the figure below. REFERENCES [1] Berbeglia, G., Cordeau, J. F., & Laporte, G. (2010). Dynamic pickup and delivery problems. Transportation Research Part B: Methodological, 44(5), 667-684. https://doi.org/10.1016/j.trb.2009.10.004 [2] Ulmer, M. W., Thomas, B. W., & Mattfeld, D. C. (2018). Preemptive depot returns for same-day delivery under uncertain customer availability. European Journal of Operational Research, 269(2), 356-371. https://doi.org/10.1016/j.ejor.2017.08.008 [3] Bertsimas, D., & Sim, M. (2004). The Price of Robustness. Operations Research, 52(1), 35-53. Figure 9. Regression Analysis https://doi.org/10.1287/opre.1030.0065 [4] Ghiani, G., Guerriero, F., Laporte, G., & Musmanno, R. 5 CONCLUSION (2003). Real-time vehicle routing: Solution concepts, algorithms and parallel computing strategies. European In conclusion, our findings demonstrate that we can effectively Journal of Operational Research, 151(1), 1-11. estimate the number of required resources by employing https://www.sciencedirect.com/science/article/abs/pii/ optimization methods based on predicted passenger numbers. As S0377221702009153 the number of samples increases, the sampled values consistently [5] Psaraftis, H. N., Wen, M., & Kontovas, C. A. (2016). converge toward the actual resource requirements, reinforcing the Dynamic vehicle routing problems: Three decades and reliability of our approach. Alternative methods, such as linear counting. Networks, 67(1), 3-31. regression, fail to adequately address the non-linear complexities https://doi.org/10.1002/net.21628 inherent in resource allocation, such as varying optimization types and window lengths. Our method, which incorporates these factors, proves to be a far more accurate and effective solution for resource estimation in the mobility industry. 112 Knowledge graph Extraction from Textual data using LLM Khasa Gillani Erik Novak khasagillani22@gmail.com erik.novak@ijs.si Jožef Stefan Postgraduate School Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia Klemen Kenda Dunja Mladenić klemen.kenda@ijs.si dunja.mladenic@ijs.si Jožef Stefan Institute and Qlector Jožef Stefan Institute and Ljubljana, Slovenia Jožef Stefan Postgraduate School Ljubljana, Slovenia ABSTRACT Entity 1 Entity 2 Relation The advent of Large Language Models (LLMs), such as Chat- JSI Assets Slovenia located in GPT and GPT-4, has revolutionized natural language process- Termboard Research ing, opening avenues for advanced textual understanding. This AI JSI area study explores the application of LLMs in developing Knowledge KG Note graphs from textual data. Knowledge graphs offer a structured Note Input Generate text GPT ptompt representation of information, facilitating enhanced comprehen- sion and utilization of unstructured text. We intend to construct Knowledge graphs that capture relationships and entities within diverse textual datasets by harnessing LLMs’ contextual under- Figure 1: Overview of proposed approach where input text standing and language generation capabilities. The primary goal is processed through a Termboard to generate a structured is to explore and understand how well LLMs can identify and prompt for LLM, creating an entity-relation table to build extract relevant entities and relationships from textual data using a Knowledge graph (KG). prompt engineering while contributing to structured knowledge representation. labor-intensive and requires expert knowledge. However, con- structing Knowledge graphs from unstructured text is intricate KEYWORDS and depends on sophisticated natural language processing (NLP) Knowledge graph, Large Language Models, prompt engineering, methods, including named entity recognition (NER) and relation information extraction, textual data extraction. The advancement of LLMs like GPT-4 presents an op- portunity to automate and improve this process as illustrated in 1 INTRODUCTION Figure 1. Utilizing LLMs can lead to more efficient, scalable, and In an era where data is ubiquitous, efficient organization, retrieval, accurate Knowledge graph construction, thereby unlocking new and interpretation of textual information are crucial. Knowledge possibilities in information management and AI applications. graphs, representing facts and relationships in structured forms, play a pivotal role in various AI applications, from enhancing 2 BACKGROUND search engines to powering recommendation systems. However, An overview of recent research in Large Language Models and the construction of these graphs is often hindered by the complex- Knowledge graphs is provided in this section, which also empha- ity and variability of human language. This paper explores the sizes the potential for their integration. potential of Large Language Models, like GPT-4, to revolution- ize this process. By leveraging their advanced natural language 2.1 Large Language Model (LLM) understanding capabilities, we aim to automate and refine the Large Language Models are advanced AI systems pre-trained extraction of knowledge from textual datasets. The fundamental on extensive data, enabling them to comprehend and produce purpose of this research is to understand the extent to which human language. Their recent surge in popularity is due to their LLMs can identify and extract relevant entities and relationships proficiency in various language-processing tasks, including text from textual data and then build a Knowledge graph using the completion, translation, summarization, and answering ques- extracted information. tions. These models, primarily based on transformer architecture, The motivation behind this study stems from the growing need utilize self-attention mechanisms through encoder-decoder mod- to effectively manage and utilize the vast amounts of textual data ules. Encoders transform input text into numerical embeddings generated daily. Knowledge graphs offer a structured and intu- that reflect the context and meaning, while decoders use these itive way to represent information, but their construction is often embeddings to generate coherent and pertinent textual output. The large language models feature a decoder-only architecture Permission to make digital or hard copies of part or all of this work for personal and, thus, make a prediction of the target output text using only or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the decoder module. The training paradigm for these models is the full citation on the first page. Copyrights for third-party components of this to predict the next word in the sentence. Generally, large-scale work must be honored. For all other uses, contact the owner/author(s). decoder-only LLMs such as ChatGPT [7] and GPT-4 [2], focus on Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia © 2024 Copyright held by the owner/author(s). human-like language output, predicting subsequent words based https://doi.org/https://doi.org/10.70314/is.2024.sikdd.15 on the preceding text for tasks like text generation. 113 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Khasa and Dunia, et al. Table 1: Simplified comparison between Large Language Models (LLMs) and Knowledge graphs (KGs) Feature LLM KG Knowledge type Broad, general knowledge Structured, domain-specific knowledge Data handling Flexible, can process varied inputs Requires structured data Accuracy May lack precision in understanding Highly accurate with structured data Understanding Can interpret and generate language Designed for specific queries and relationships Adaptability Adapts to new information by retraining Adaptable when updated with new data Transparency Often seen as "black boxes" with unclear reasoning Clear decision-making pathways Error rate Can make mistakes due to broad generalizations Can be prone to errors if data is incorrect or missing Complexity Handles complex language tasks Manages complex relationships and attributes Usage Broad applications in text generation, translation, Used for specific tasks like recommendations, search etc. optimization Scalability Scales with computational power Scales with the amount of structured data available 2.2 Knowledge graph (KG) and KGs can perpetuate biases present in their training data or Knowledge graphs are structured representations of information construction methodologies. In conclusion, both LLMs and KGs that depict the relationships between entities in a specific domain. have their unique strengths and challenges. While LLMs excel They are used extensively in various applications, such as search in general language processing and knowledge extraction from engines, recommendation systems, and question-answering sys- vast corpora, KGs provide a structured and interpretable way to tems. These graphs use detailed connections between data to organize explicit knowledge. These differences underscore the help with smart thinking, finding specific information easily, and potential benefits of integrating LLMs and KGs to create more running applications that use knowledge. Hence, allows us to robust AI systems that leverage the strengths of both approaches. better understand and use information across multiple fields. Knowledge graphs provide a structured way of representing in- 3 PROOF OF CONCEPT: ANALYSIS AND terconnected knowledge. They are precise and consistent, aiding KNOWLEDGE GRAPH GENERATION in decisive and informed decision-making. KGs are particularly This section demonstrates how to process and analyze textual valuable for their interpretability and explainability due to the data to build a Knowledge graph using LLM. It is important to explicit representation of entities and relationships. They can mention that prompt engineering [5] is of great importance when capture domain-specific information accurately and evolve to it comes to the results generated from ChatGPT. Since it is a gen- incorporate new data. However, KGs may suffer from incom- erative model, small variations in the input sequence can create pleteness and may not always reflect the most recent or unseen large differences in the produced output as demonstrated below. facts. They also typically cannot understand natural language in We use two different textual files containing contextual data: (i) an unstructured format [3][6]. Moreover, KGs are preferred in APRIORI proposal (containing project details, job description, scenarios where explainability and interpretability are crucial, as potential candidate skills, hosting organizations, etc.) and (ii) they provide structured knowledge representation. ADRIA Motorhome instruction manual (containing textual as well as tabular data). Moreover, building KG out of the ADRIA 2.3 Combining LLM and KG instruction manual has potential applications for the manufac- The comparison between Large Language Models and Knowl- turing industry. edge graphs (Table 1) can be supported by various references that highlight their respective strengths and weaknesses [4]. Large 3.1 Using ChatGPT Prompts: Language Models like ChatGPT [7] are celebrated for their generalizability and ability to process diverse text data, allowing We compare ChatGPT-3.5 and GPT-4 extracted entities and rela- them to perform various language-related tasks without exten- tions using the same prompts. We use Termboard1 which offers sive task-specific training. They can act as reservoirs of general customized ChatGPT prompts to create terms, entities, and rela- knowledge, aiding in information synthesis and research. Their tions to visualize larger graphs from the provided text. proficiency in language processing is useful in tasks like natural Prompt: Extract an ontology and create a table of relations with language understanding and sentiment analysis. However, they 3 columns in this order: source, target, and relation name. Also can suffer from hallucinations, where they generate plausible but Create a table with 2 columns: put in the first column the name factually incorrect information. Their "black-box" nature makes of the term and in the second column an elaborate definition of it difficult to understand the internal decision-making processes, the term. Use this text as a basis: ÄPRIORI¨- (contains textual and they can be indecisive, producing uncertain responses to data about the job description, candidate skills, project description, ambiguous inputs. Additionally, while they have vast general hosting organization, etc). knowledge, they may not be up-to-date with domain-specific or Observing the Knowledge graphs generated by ChatGPT-3.5 the latest information. Critics of LLMs argue that these models (Figure 2) and GPT-4 (Figure 3); we notice, that it didn’t extract all lack transparency and interoperability. entities and relations and missing terms/concepts. For this reason, Recent research [3] [4]efforts are, however, improving LLM’s we ran the second prompt, where we redefined a more detailed interpretability through techniques like attention mechanisms prompt to ask GPT-4 to explicitly generate a comprehensive and model introspection. KGs also present advantages over LLMs ontology including all entities and relations from the provided by providing knowledge about long-tail entities, thus improv- text, categorize entities into types like Persons, Organizations, ing recall for knowledge computing tasks. However, both LLMs 1https://termboard.com/ 114 Knowledge graph Extraction from Textual data using LLM Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia AI Production Research Katholieke Horizon TMA Universiteit Partner Integrated Department MSCA Leuven Organization strategy of AI Slovenia CO2 Research conducts facility Address emission Aims to Active develop Funds uncertainty in Jožef Stefan learning CO2 Participant Doctoral Located in End-use with Institute emissions candidate Recommends reduction of Address Complexity of energy Collaborates Explainability trains Manufacturing efficiency APRIORI APRIORI trains Job Develops goods Contributes to Linked to Enroll at description Active DC9 in Collaborates with Doctoral Research Data Mining Sustainable learning for Involves candidate in Educates manufacturing explainaibilty Collaborates with Research in Collaborates Focuses on Support transition of Participate Develops with Machine End-use learning Jožef Stefan semantic energy Manufacturing Visual Institute technologies efficiency sector Host inspection Internation Industrial Linked to Partner Energy agency Organization Machine applications Research in Horizon TMA learning Manufacturing MSCA Partner EU Machine Language sector Doctoral network Organization manufacturing learning,Data mining technologies Industry 4.0 sector Sensor networks Additive Manufacturing Part of Figure 2: The KG generated using ChatGPT-3.5 contains 20 entities. It was able to extract entities and link them to re- Figure 4: The KG generated by GPT-4 contains 22 entities. It lations, but it failed in abstracting concepts and specifying Identified more key entities and relevant concepts and iden- entities (i.e. partner organizations, location, etc.). tified suitable relations to connect them (i.e. participant- Katholieke Universiteit Leuven). However, it didn’t cover Horizon TMA MSCA Doctoral all relations and classes (i.e. skills). We also notice a few MSCA network Slovenia Recruits Salary duplicated entities(i.e. data mining, CO2 emission, etc.) and some independent entities (i.e. sustainable manufac- Supports Doctoral Regulates Located in Part of candidate AI Active turing). Research area APRIORI trains learning Natural Ehnace Subject of Sciences in Manufacturing is experiencing Specilaizes in fers PhD position Products Of sector an evolving trend Participate End-use Is located close to the Jožef Stefan city center of Ljubljana Research area energy Customizations Machine Institute Research efficiency learning careers Research area Explainable Msc degree Specilaizes in varios AI Manufacture certificates Data Jožef Stefan research area, engineering Institute Research area including AI EU Linked to Data Language List of manufacturing Critical part of mining technologies publications sector components Horizon TMA MSCA Marie Sklodowska- Is a doctoral network: Curie doctoral Application APRIORI network June 15, system The manufacture 2023 Figure 3: The KG generated by GPT-4 contains 16 entities. CV(curriculum Include Alborg sector Vitae) Industry 4.0 universitet , End-use It was able to identify abstract concepts, and geographic letter of Denmark motivation energy sector Additive manufacturing entities that ChatGPT-3.5 doesn’t. Extracted more elabo- Salary Beneficiaries Mobility rules Include Temporary works rated entities with relations. Artificial Design 9,B.V, Netherlands CO2 intelligence emissions Include Sustainability Materialize NV, Include Qlector, MSCA doctoral guidlines and Belgium Slovenia network rates criteria and concepts, and Geographic Locations, and then identify the Include Katholieke Doctoral relations between these entities. Providing additional information Universiteit Leuven, Include Jožef Stefan Belgium Institute, slovenia candidate (DC) Determines the salary for researchers in MSCA to GPT-4 resulted in an improved Knowledge graph (Figure 4). doctoral network Identifies critical areas for Internation reducing CO2 emission by However, ChatGPT-3.5 didn’t produce a quality graph (Figure 5) energy agency Will enhance Europe 2050 position in Engineering compared to Figure 2. sciences 3.2 Python Implementation Figure 5: ChatGPT-3.5 was able to extract a larger number We use a free, open-source library called spaCY 2 for advanced of entities but it was not successful at abstracting concepts NLP in Python. We employ the named entity recognition tech- and missing relations. Entities and relations found fre- nique to identify named entities from a given text using the spaCY quently represented complete sentences rather than con- model (en-core-web-sm). We used a chunk of textual data from cepts. This occurs because ChatGPT is a conversational the ADRIA Motorhome manual for experiment purposes. Table 2 model trained on a task to create responses to a given compares entities, relations, and triplets extracted from the raw prompt and is not particularly trained to recognize en- texts. The table shows that the number of triplets extracted by tities and relations algorithms is similar–(Figure 6 and Figure 7). However, the number of entities that spaCY extracts are larger but not every pair of entities is connected by meaningful relation, leading to fewer or provide additional context for better recognition. Hence re- triplets. Thus defeating the purpose of creating a Knowledge sults can be improved by pre-processing data into a structured Base. When using spaCy for entity extraction, the entities are format. typically recognized based on the named entities present in the text. Named entities are often specific nouns, such as names of 4 EVALUATION people, organizations, locations, dates, or product names. spaCy When there is no ground truth data available, creating an auto- might not identify it as a specific entity by default. So to extract mated evaluation metric for a Knowledge graph becomes chal- specific entities, it might need to customize spaCy’s NER model lenging. In such cases, the evaluation relies on qualitative prin- ciples to assess the results. Based on the practical framework 2https://spacy.io/models defined in the study [1], the following principles were identified: 115 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Khasa and Dunia, et al. Table 2: Knowledge extraction comparison. (ADRIA mo- Optical equipment Special manuals Approvals torhome manual dataset) Provides Original parts Safety information Required for information Vehicle data Optical Follow Algorithm Entities Relations Triplets equipment Used in Standard General safety equipment GPT-4 18 20 20 Adhere to Included in instructions Describes equip Provide/Update ChatGPT-3.5 24 18 18 Describes Equipment Must read ADRIA Vehicle ADRIA Vehicle state Performed on spaCY 22 14 17 Saftey chapter Users Recommended Operates Utilize Carry in for Service & Comply with Repair Technical Driving on Innovate/Develops Emergency Subject to system ADRIA). We analyzed that extracted entities are duplicated and re- public roads equipment Warranty Imposed on ADRIA design lations have some noise and incomplete information. If you have obligations team specific patterns or structures in mind that you want to extract entities and relations based on, you may need to customize the Figure 6: The KG generated by GPT-4 contains 18 enti- relation extraction logic. Alternatively, more advanced natural ties using the ADRIA motorhome instruction manual. It language processing techniques or pre-trained models designed extracted concepts relevant to ADRIA users and vehicle for relation extraction tasks might provide better results. Also, instructions, their functions, and how they are connected. we analyzed half of the relations-entities extracted by spaCY and ChatGPT are overlapped. Provides User ADRIA home 5 CONCLUSION Safety accessories regulations The proposed exploration of using LLMs for Knowledge graph tyre pressure Instruction check & tighten manuals extraction holds promise for advancing our understanding of Optional must comply with equipment how advanced language models can contribute to structured driving license must have has has weight alter dimension & knowledge representation. This paper explores using LLMs to has Special has ADRIA Vehicle Vehicle ensure quality & readiness approvals generate Knowledge graphs out of source documents. We uti- nameplates impact check & repair lized ChatGPT-3.5 and GPT-4 models to generate the Knowledge contains Passengers Technical Graphs for two different textual data and compared the structure operating system brake system manual pay attention to Service of the KGs. GPT-4 performed better as it successfully identified gross weight Warranty Affects warranty work more abstract concepts and key entities compared to ChatGPT- rating ADRIA design obligations ADRIA dealer team doesn't 3.5. Therefore, it provides insights into the practical application tolerate provides technical Service & Specialist assist with of LLMs in developing structured knowledge from unstructured stanstill Repair workshop textual data, with potential applications in knowledge-based AI Figure 7: The KG generated by ChatGPT-3.5 contains 24 applications, paving the way for more effective information pro- entities. Extracted more entities relevant to ADRIA vehi- cessing and utilization. In future studies, we intend to use a more cles but relations between entities are more generic and formal framework to evaluate the quality of created Knowledge entities are duplicated. graphs. Such a framework will allow us to efficiently analyze the quality of KG and provide a standardized method to forecast missing linkages between concepts and relationships within a • Triplets should be concise. given domain. • Contextual information of entities should be captured. • The Knowledge graph does not contain redundant triples. ACKNOWLEDGEMENTS • Entities should be densely connected. This research is supported by EU funding HE MSCA Project • Relations among different types of entities should be in- Apriori (GA: 101073551). The author acknowledges the usage of cluded. ChatGPT and Grammarly for content paraphrasing, grammar, • Knowledge graphs should be organized in structured triples and error checking. for easy processing by machine. • For tasks specific to a particular domain, it’s essential REFERENCES that the Knowledge graph is tailored and relevant to that [1] Haihua Chen, Gaohui Cao, Jiangping Chen, and Junhua Ding. 2019. A practi- specific field cal framework for evaluating the quality of knowledge graph. In Knowledge Graph and Semantic Computing: Knowledge Computing and Language Under- According to these principles, in our use case, we manually in- standing: 4th China Conference, CCKS 2019, Hangzhou, China, August 24–27, spected the Knowledge graphs generated above, and we can con- 2019, Revised Selected Papers 4. Springer, 111–122. clude that the ChatGPT-3.5 approach provides a more detailed [2] R OpenAI. 2023. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2, 13. Knowledge graph without abstract concepts compared to the [3] Jeff Z Pan et al. 2023. Large language models and knowledge graphs: oppor- GPT-4. However, to create these Knowledge graphs, a few steps tunities and challenges. arXiv preprint arXiv:2308.06374. of refining the answers from ChatGPT are needed. Sometimes [4] Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu. 2024. Unifying large language models and knowledge graphs: a roadmap. the produced output is incorrect and needs to be corrected before IEEE Transactions on Knowledge and Data Engineering. proceeding. When we redefined the prompt, GPT-4 identified [5] Elvis Saravia. 2022. Prompt engineering guide. (2022). [6] Milena Trajanoska, Riste Stojanov, and Dimitar Trajanov. 2023. Enhancing more specific entities, and concepts compared to ChatGPT-3.5. knowledge graph construction using large language models. arXiv preprint Even though ChatGPT extracted a larger number of entities, it arXiv:2305.04676. failed to provide abstract concepts and entity-relation. [7] Ce Zhou et al. 2023. A comprehensive survey on pretrained foundation models: a history from bert to chatgpt. arXiv preprint arXiv:2302.09419. In the second part of the experiment, we employed the NER method to extract relations and entities from the given text (i.e. 116 Solving hard optimization problems of packing, covering, and tiling via clique search Sándor Szabó Bogdán Zaválnij sszabo7@hotmail.com bogdan@renyi.hu University of Pécs HUN-REN Alfred Renyi Institute of Mathematics Pecs, Hungary Budapest, Hungary Abstract numerically hard to solve problem of brick packing popular- In the paper we propose to convert NP-hard combinato- ized by M. Gardner. We will focus on different approaches rial optimization problems of packing, covering, and tiling of how to construct an auxiliary graph in order that to types into maximum or 𝑘-clique problems. The key step is translate this problems into a clique search problem. We to come up with a tactically constructed auxiliary graph will try to investigate how these different approaches – whose maximum or 𝑘-cliques correspond to the sought com- based on packing, covering and tiling– affect the solving binatorial structure. As an example, we will consider the time and if they have other consequences as well. First, problem of packing a given cube by copies of a brick. The we describe the basic problem, then we present theoretical aim of the paper is two fold to illustrate (i) the modeling discussion of different reformulations, and finally we de- power and (ii) the feasibility of the clique approach. Since scribe the results of numerical experiments. The emphasis theoretical tools are not readily available to study the effec- is on the modeling aspect of the computation and not on tiveness of the solution of the resulting clique problems we reaching new records, as the proposed problem was solved will carry out carefully conducted numerical experiments. in theoretical manner within months of its formulation. Here we use it as a prototype of similar problems, and our Keywords aim to show the versatility of our approach, that is model a problem by a graph. mathematical programming, 𝑘-clique problems, combina- Graphs in this paper will be finite simple graphs. Further torial optimization all graphs we use will not have loops or double edges. A finite simple graph 𝐺 can be described with its set of nodes 1 Introduction 𝑉 and a subset 𝐸 of the Cartesian product 𝑉 × 𝑉 . The subset 𝐸 can be identified by the set of edges of 𝐺. One can see graphs as a mathematical models that can Let 𝐺 = (𝑉, 𝐸) be a finite simple graph. A non-empty describe various fields of interest. Like numbers, functions, subset 𝐶 of 𝑉 is called a 𝑘-clique if each two distinct nodes or Linear Programming graph based approach can model of 𝐶 are adjacent in 𝐺 and in addition 𝐶 has exactly 𝑘 interesting problems and aid us in solving them. Some elements. If 𝐶 has only one element, then we consider it a of these approaches are quite straightforward like cliques 1-clique. The 2-cliques of 𝐺 are the edges of 𝐺. A 𝑘-clique of people in a social interaction graphs or shortest path 𝐶 of 𝐺 is called a maximum clique if 𝐺 does not have problem in a road map. Other approaches are less obvious any (𝑘 + 1)-clique. For each finite simple graph 𝐺 there is but still easily constructed, like conflict graphs in a set of an integer 𝑘 such that 𝐺 contains a 𝑘-clique but 𝐺 does codewords where a maximum independent set represents a not contain any (𝑘 + 1)-clique. This well defined integer maximum set of suitable error correcting codes [9]. 𝑘 is called the clique number of 𝐺. We state two clique But the approach of modeling and solving various prob- problems formally. lems by graphs are more versatile. Namely, we can see graphs as a language for mathematical programming – if Problem 1. Given a finite simple graph 𝐺 and an inte- certain combinatorial problems can be solved by construct- ger 𝑘. Decide if 𝐺 has a 𝑘-clique. ing a suitable auxiliary graph and finding a maximum or 𝑘-clique of this graph gives the solution. The authors have Problem 2. Compute the clique number of a given finite already used this approach in connection with mathemat- simple graph. ical conjectures [1], hyper graph coloring [11], subgraph isomorphism [2], scheduling problems [12], graph coloring Problem 1 is a decision problem, it is referred as the 𝑘- problems [13] and protein docking problems in chemistry clique problem, and it is an NP-complete problem included [8]. in the original list of 21 NP-complete problems by Karp Here we would like to give an example, where a hard [7]. Problem 2 is an optimization problem and referred as combinatorial optimization problem can be solved by this the maximum clique problem, and as the decision problem approach. For this we chose a simple to understand but belongs to the NP-complete class it follows that it belongs to the NP-hard class. Permission to make digital or hard copies of all or part of this We color the nodes of a finite simple graph 𝐺 with the work for personal or classroom use is granted without fee provided colors 1, 2, . . . , 𝑘 such that each node receives exactly one that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on color and adjacent nodes never receive the same color. Such the first page. Copyrights for third-party components of this work a coloring of the nodes of 𝐺 is called a well coloring, a must be honored. For all other uses, contact the owner/author(s). proper coloring, or a legal coloring (the terminology is not Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia © unified). The set of nodes of 𝐺 receiving the color 𝑖 is called 2024 Copyright held by the owner/author(s). https://doi.org/https://doi.org/10.70314/is.2024.sikdd.9 the 𝑖-th color class. Clearly, a color class is an independent 117 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Szabó et al. set of 𝐺, that is, two nodes from a fixed color class are We state five problems related to packings, coverings, never adjacent. and tilings in a formal manner. Given a finite set 𝑈 and If the nodes of a finite simple graph can be legally colored its subsets (1). using 𝑘 colors, then we say that 𝐺 is a 𝑘-partite graph. Problem 4. Decide if 𝑈 has a 𝑘-packing using the mem- The reason is that in this situation the nodes of 𝐺 form a bers of the family (1). union of 𝑘 independent sets and these sets are pair-wise disjoint. Problem 5. Decide if 𝑈 has a 𝑘-covering using the In this paper we will focus on the following clique prob- members of the family (1). lem. Problem 6. Decide if 𝑈 has a 𝑘-tiling using the mem- Problem 3. Given a finite simple graph 𝐺 whose nodes bers of the family (1). are legally colored using 𝑘 colors. Decide if 𝐺 has a 𝑘-clique. Problem 7. Compute the packing number of 𝑈 with Problem 3 is a 𝑘-clique problem particularized to case respect to the family (1). of 𝑘-partite graphs. This problem is still an NP-complete Problem 8. Compute the covering number of 𝑈 with problem, as the graph coloring problem can be reduced to respect to the family (1). such question as shown in [13], and should not be confused with the problem of complete graphs. Problem 4 can be reduced to Problem 1. We construct a The problem class we will be focusing on in the present finite simple graph 𝐺. The nodes of 𝐺 are the members of paper consists of packing, covering, or tiling problems. the family (1). Two distinct nodes 𝐴𝑖 and 𝐴𝑗 are adjacent Obviously many real world and mathematical problems in 𝐺 whenever 𝐴𝑖 and 𝐴𝑗 are disjoint. A 𝑘-clique in 𝐺 fall into this class, and here we would show some ideas how corresponds to a 𝑘-packing of 𝑈 . such problems can be modeled by a suitably constructed Problem 5 can be reduced to Problem 3. We sketch the auxiliary graph where a 𝑘-clique search would solve the main points of this reduction. We construct a finite simple original problem. graph 𝐺. The first type of nodes of 𝐺 are ordered pairs (𝐵, 𝑥), where 𝐵 ∈ {𝐴1, . . . , 𝐴𝑚}, 1 ≤ 𝑥 ≤ 𝑘. The intuitive 2 Packing, covering, and tiling meaning of the pair (𝐵, 𝑥) that the subset 𝐵 is the 𝑥-th member of a 𝑘 element family of (1). To the node (𝐵, 𝑥) First, we describe the problem class in question. Second, we assign the color 𝑥. Two nodes receiving the same color we draw up some basic concepts how these problems can will be non-adjacent in 𝐺. Therefore the first type nodes be modeled by graphs. of 𝐺 are legally colored with 𝑘 colors. Let 𝑈 be a finite ground set and let We are adding second type nodes to 𝐺. Namely, we are 𝐴1, . . . , 𝐴𝑚 (1) adding the ordered pairs (𝐴, 𝑢), where 𝐴 ∈ {𝐴1, . . . , 𝐴𝑚}, be subsets of 𝑈 . A family of subsets 𝑢 ∈ 𝑈 and in addition 𝑢 ∈ 𝐴 holds. The intuitive meaning of the pair (𝐴, 𝑢) is that the element 𝑢 is covered by set 𝐵1, . . . , 𝐵𝑛 (2) 𝐴. To the node (𝐴, 𝑢) we assign 𝑢 as a color. Two nodes with {𝐵 receiving the same color will not be adjacent in 𝐺. Thus 1, . . . , 𝐵𝑛} ⊆ {𝐴1, . . . , 𝐴𝑚} is called a packing of 𝑈 if the members of the family (2) are pair-wise disjoint. A the second type nodes of 𝐺 are legally colored using 𝑡 = |𝑈 | family of subsets (2) is called a covering of 𝑈 if the union of colors. Now if we are locating a (𝑘 + 𝑡)-clique in 𝐺, then (2) is equal to 𝑈 . Phrasing it differently, a family of subsets we select exactly 𝑘 subsets from (1) and each element of (2) is a covering of 𝑈 if each element of 𝑈 belongs to at 𝑈 will belong to at least one of these subsets. The missing least one member of the family (2). If a family of subsets part of the construction, what we left for the reader, is how (2) is a packing and a covering of 𝑈 in the same time, then the first and second types of nodes are connected by edges. it is called a tiling of 𝑈 . A tiling of 𝑈 some times referred Problem 6 can be reduced to Problem 3. As a tiling is as exact covering of 𝑈 . a packing and covering at the same time, we can add the A packing of 𝑈 is called a 𝑘-packing if it consists of packing restrictions, namely not connecting two sets if they 𝑘 subsets of 𝑈 . Similarly, a covering of 𝑈 is called a 𝑘- intersect, to the second type of nodes. On the other hand covering if it consists of 𝑘 subsets of 𝑈 . Finally, a tiling – in case of equal size sets –, we do not need to count the of 𝑈 is called a 𝑘-tiling if it consist of 𝑘 subsets of 𝑈 . For used sets, so we won’t need the first type of nodes, they a given ground set 𝑈 and for its given subsets (1) there can be omitted. is an integer 𝑘 such that 𝑈 has a 𝑘-packing using subsets The computational difficulties of the 𝑘-packing, 𝑘-covering, of the family (1) but there is no any (𝑘 + 1)-packing of 𝑈 and 𝑘-tiling problems are different. It seems that the cov- using members of the family (1). This well defined integer ering problems are the computationally most demanding 𝑘 is the packing number of 𝑈 with respect to the family and the tiling problems are the most manageable. (1). If the packing number of 𝑈 is equal to 𝑘, then each 𝑘-packing of 𝑈 is called maximum packing of 𝑈 . 3 Gardner’s bricks problem For a given ground set 𝑈 and for its given subsets (1) We picked Gardner’s problem because it is intuitive and there is an integer 𝑘 such that 𝑈 has a 𝑘-covering using easy to comprehend among such problems that can be subsets of the family (1) but there is no any (𝑘 −1)-covering reduced to Problem 3 and so it serves as a good illustration of 𝑈 using members of the family (1). This well defined of the kind of clique modeling we are dealing with. We do integer 𝑘 is the covering number of 𝑈 with respect to the not claim any originality in connection with the problem. family (1). If the covering number of 𝑈 is equal to 𝑘, then We do not prove any new results. Each of the facts we each 𝑘-covering of 𝑈 is called minimum covering of 𝑈 . use are known from the folklore and we present them only 118 Solving hard problems via clique search Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia for the reader convenience. The problem was raised by Proof. Note that a fixed slab can contain only 0, 2 or 4 Foregger in March 1975 [10], popularized by Gardner in unit cubes from any brick of the packing. The point is that February 1976 [5], and solved by Foregger and Mather in the numbers 0, 2, 4 are all even. Each slab consists of an odd November 1976 [3]. number of unit cubes. Therefore, each slab must contain an Let us consider a brick 𝐵 of dimensions 1 × 2 × 4. The odd number of unpacked unit cubes. The number of slabs brick 𝐵 is a union 8 unit cubes whose edges are parallel to is 7 and so each slabs must contain exactly one unpacked the coordinate axis. From some reason unknown for us the unit cube. □ brick 𝐵 is referred as canonical brick. Suppose we have a large supply of congruent copies of 𝐵 and we want to pack We can also form slabs by slicing 𝐶 with planes per- as many as possible into a 7 × 7 × 7 cube 𝐶. The cube 𝐶 pendicular to the second coordinate axes. Each of these is a union of 343 unit cubes. Let us divide 343 by 8 with 7 slabs contains exactly one unpacked unit cube. Finally, remainder. As 343 = (42)(8) + (7), 43 copies of 𝐵 cannot slicing 𝐶 by planes perpendicular to the third axes we get be packed into 𝐶. M. Gardener advanced the question if that each of these slabs contains exactly one unpacked unit 42 copies of 𝐵 can be placed into 𝐶. One can place a copy cubes. These constraints on the uncovered unit cubes are of 𝐵 into 𝐶 in any possible rotated position as long the independent, but can also be checked independently during edges of 𝐵 are parallel to the coordinate axis. (The answer an extended search, and as such can reduce the search to this question is actually: No, one cannot place 42 bricks space well. into a cube of size 7 × 7 × 7.) Gardner’s problem can be expressed in terms of comput- 4 Numerical experiments ing the clique number of a suitable constructed graph 𝐺. Gardner’s brick packing problem can be turned into various In other words, Gardner’s problem can be reduced to an clique search problems and we carried out numerical exper- instance of the maximum clique problem. Let us denote the iments with them. We will observe that the same geometric set of the 343 unit cubes forming 𝐶 by 𝑈 . An 8 elements problem will lead to very different clique search problems. subset 𝑣 of 𝑈 is a vertex of 𝐺 if the union of the elements When we try to pack 42 congruent copies of the canonical of 𝑣 is a congruent copy of 𝐵. As it turns out 𝐺 has 1008 brick 𝐵 into the the big cube 𝐶, we get a 𝑘-clique problem. nodes. Two distinct nodes 𝑣 and 𝑣′ of 𝐺 are adjacent in 𝐺 When we notice that the nodes of the auxiliary graph can if 𝑣 and 𝑣′ are disjoint. If 𝐺 contains a (42)-clique, then be legally colored using 42 colors we get a 𝑘-clique prob- 42 congruent copies of 𝐵 can be packed into 𝐶. During lem in a 𝑘-partite graph which is a more tractable search our numerical experiments a greedy coloring procedure problem. When we try to pack 42 congruent copies of the provided a legal coloring of the nodes of 𝐺 using 42 colors. brick into the cube 𝐶 together with 7 unit cubes we get Note that this is just a coincidence, it could’ve happened tiling problem. When we try to pack 42 congruent copies otherwise. Thus we are facing with a particular case of of the brick into the cube 𝐶 together with 7 unit cubes the 𝑘-clique problem stated in Problem 3. The nodes of 𝐺 and in addition we distinguish the unit cubes among each are legally colored with 42 colors and we are looking for a other we get yet another version of the tiling problem. (42)-clique in 𝐺. Phrasing it differently, we are looking for In the first approach the auxiliary graph 𝐺1 had 1008 a 𝑘-clique in a 𝑘-partite graph, where 𝑘 = 42. vertices. The nodes of 𝐺1 were legally colored using 42 We introduce a coordinate system whose origin coincides colors and we tried to locate a (42)-clique in 𝐺. Note, that with a corner of the cube 𝐶. although this graph can be colored with 42 colors it was just a coincidence. There is no theoretical background to Observation 1. If 42 congruent copies of the brick 𝐵 this fact. Of course the expectation was that 𝐺1 do not can be packed into 𝐶, then there is such a packing which have any (42)-clique. contains the congruent copy of 𝐵 whose one corner is the Let us assume that it is possible to pack 42 congruent origin. Further the edges of lengths 1, 2, 4 are parallel to copies of the 1 × 2 × 4 canonical brick 𝐵 into the 7 × 7 × 7 the first, second and third coordinate axis, respectively. cube 𝐶. By Observation 1, we may assume that a brick appear in the packing such that one of the corners of the Proof. As 343 = (42)(8) + (7) holds, 7 unit cubes of brick coincides with the origin of the coordinate system 𝐶 are not contained by any bricks of the packing. The and the edges of lengths 1, 2, 4 are along the 1-st, 2-nd, cube 𝐶 has 8 corners and so at least one of the corners 3-rd coordinate axis. This information can be interpreted must be contained by a brick. At this point we introduce a such that there a (42)-clique 𝐶 coordinate system whose origin is this corner of 𝐶. Then 2 in 𝐺1 which has a specific node. Namely, the vertex 𝑣 we introduce the first, second, and third coordinate axis to 1 of 𝐺1 that corresponds to the special corner brick is a node of the of 𝐶 satisfy our requirement. □ 2. This suggests to restrict the graph 𝐺1 to the neighbors of the vertex 𝑣1 to get a new graph 𝐺 The cube 𝐶 can be sliced into 7 slabs using planes 2. Then we are looking for a (41)-clique in 𝐺 perpendicular to the first coordinate axes. Each slab is a 2. Plainly, the nodes of 𝐺2 are legally colored using 41 colors. This coloring is inherited from the coloring of the 1𝑥7𝑥7 slice of the big cube, that is a union of 49 unit cubes. nodes of 𝐺 The centers of these cubes are in a plane perpendicular to 1. Since the graph 𝐺2 has fewer vertices than 𝐺 the first coordinate axis. The 7 unit cubes of 𝐶, that are 1 (actually 960) and we are looking for a smaller clique in 𝐺 not contained by any brick of the packing, are referred as 2 than in 𝐺1. The new clique problem probably requires less computational effort because the graph is smaller, and unpacked unit cubes. because we introduced a symmetry breaking to it. Observation 2. Two distinct uncovered unit cubes of The problem of packing 42 bricks into a bigger cube can 𝐶 cannot be in the same slab. be viewed as a tiling problem. Namely, we try to tile the 119 Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Szabó et al. 7 × 7 × 7 cube 𝐶 by 42 copies of the canonical brick and The results presented here have interesting consequences 7 additional copies of a unit cube. Thus we are facing to and suggest further research problems. First, and as an- a tiling problem using two different types of tiles and the ticipated, different auxiliary graphs lead to very different number of the tiles is given. To ensure that we use 42 bricks search space sizes. And although the usual concept in our we numerate the small cubes as {1, . . . , 7} and ensure in the research is that bigger graphs usually tend to be harder, graph that each small unit cube is used once, that is we do that is not always the case. Remarkably, numerical results not connect nodes where the unit square is covered by the indicate that the size of the auxiliary graph alone is not as same small cube. This tiling problem can also be reduced important as the type of the reformulation. Namely, the to a clique search problem. We denote the corresponding tiling type auxiliary graphs required less computational graph 𝐺3. Tiling problems are more manageable compared effort for clique search even if they were not the small- with packing problems as during the search back-tracking est graphs. Second, there are additional constraints that can be anticipated earlier. However, the graph associated can be added to some reformulations while they seemingly with the tiling in our case has more vertices than the graph cannot be incorporated into others. An example to such a associated with the packing, namely it has 10 465 nodes. constraint is the fact described after the proof of Observa- Therefore only computations can reveal which approach is tion 1. Namely, that no two distinct uncovered unit cubes preferable. can appear in the same slab in Gardner’s brick packing Obviously, in this case we can also fix a brick in the problem. That kind of restriction could be incorporated corner. This version will be the 𝐺4 graph. into the tiling version of reformulation, and possibly not In the last clique search equivalent of Gardner’s problem applicable to the packing reformulation. Taking advantage we construct a graph 𝐺5. In this construction we handle a of the extra constraint made possible to solve the brick mixed tiling problem but we utilize the extra information packing problem in reasonable time. that no two distinct unit cube can appear in the same slab. There are other problems that can be solved using similar By Observation 2, this may be assumed. This is done by approaches as detailed in the paper. Authors could solve not connecting two nodes associated with unit cubes if smaller instances of the Golomb ruler problem or the Salem- those unit cubes lay in the same slab. This graph is the Spencer set problem. The results, that lay outside the scope same size as 𝐺3, as we only delete some edged from it. Also, of the present paper, obtained with those instances open we can fix a brick in the corner in this case as well, that up even more interesting considerations. shall be the 𝐺6 graph. Once again only numerical experiments can guide us in Acknowledgements judging the merits of the possible clique search equivalents The present research was funded by National Research, of the problems. Further, the preconditioning methods per- Development and Innovation Office – NKFIH Fund No. form differently on the graphs 𝐺1, 𝐺2, 𝐺3, 𝐺4, 𝐺5, 𝐺6 and SNN-135643. this adds an extra layer of difficulty to the numerical work. We used a computer with AMD EPYC 7643 processors, References C++, and gcc v12.1 with settings -O3 -arch=znver3. [1] K. Corrádi and S. Szabó, A combinatorial approach for Keller’s We made all six graphs and performed 𝑘-clique search conjecture. Period. Math. Hungar. Vol. 21, 91–100, 1990. [2] M. Depolli, S. Szabó and B. Zaválnij, An Improved Maximum on them after preconditioning as described in [12, 13]. The Common Induced Subgraph Solver. MATCH Commun. Math. preconditioning run for 1-2 hours for the bigger graph, and Comput. Chem. 84 pp. 7–28. 2020. reduced it by half, namely to around 6 000 nodes for 𝐺 [3] T. H. Foregger, and M. Mather, M. E2524. The American 3, 𝐺4; Mathematical Monthly. Vol. 83, No. 9 (Nov., 1976), pp. 741– and to around 4 000 for 𝐺5, 𝐺6, that is the graphs where we 742 allow only one small cube in a slab. For the smaller graphs [4] D. Hespe, Ch. Schulz, D. Strash. Scalable Kernelization for (𝐺 Maximum Independent Sets. ACM Journal of Experimental 1, 𝐺2) the preconditioner runs for a couple seconds but Algorithmics. Volume 24, Article No.: 1.16, pp 1–22. 2019. cannot significally reduce the graph. Three of the six graph [5] M. Gardner, MATHEMATICAL GAMES – Some elegant brick- could be solved after preconditioning: 𝐺 packing problems, and a new order-7 perfect magic cube. Scien- 2, 𝐺5, and 𝐺6. tific American. Vol. 234, No. 2 (February 1976), pp. 122–127. The solution time of 𝐺2 (the original graph with fixed [6] M. R. Garey, and D. S. Johnson, Computers and Intractability: brick in the corner) was 50 days. The solution of 𝐺5 was a A Guide to the Theory of NP-completeness, Freeman, New bit faster, 29 days. Finally, the graph 𝐺 York, 2003. 6 could be solved [7] R.M. Karp. “Reducibility Among Combinatorial Problems.” In: more effectively. The running time was 123 484 seconds, Complexity of Computer Computations. New York: Plenum. that is 34 hours. This clearly show us the importance of pp. 85–103. 1972. the extra information of slabs. [8] K. Rozman, A. Ghysels, B. Zavalnij, T. Kunej, U. Bren, D. Janežič, and J. Konc, Enhanced Molecular Docking: Novel Al- gorithm for Identifying Highest Weight k-Cliques in Weighted General and Protein-Ligand Graphs. JOURNAL OF MOLEC- 5 Conclusions ULAR STRUCTURE. 1304 p. 137639 Paper: 137639. 2024. [9] N. J. A. Sloane. Challenge Problems: Independent Sets in We detailed several 𝑘-clique search reformulations of a Graphs. https://oeis.org/A265032/a265032.html certain combinatorial problem in terms of constructing [10] T. H. Foregger. Elementary Problem E2524. The American Mathematical Monthly. Vol. 82, No. 3 (Mar., 1975), p. 300. suitable auxiliary graphs. We do not claim, that these [11] S. Szabó and B. Zaválnij, Reducing hyper graph coloring to methods result more efficient practical computations than clique search. Discrete Applied Mathematics. 264. pp. 196–207. other approaches. The point we are trying to make is that 2019. [12] S. Szabó, and B. Zaválnij, Clique search in graphs of special the clique reformulations open up a possibility to use well class and job shop scheduling. Mathematics. 10(5), 697. 2022. tuned clique solvers, including preconditioning, to handle [13] S. Szabó, and B. Zaválnij, Graph Coloring via Clique Search with Symmetry Breaking. SYMMETRY (BASEL). 14 : 8 Paper: different combinatorial problems in a unified manner as a 1574, 16 p. 2022. general solver. 120 Indeks avtorjev / Author index Abkari M. Wahib.......................................................................................................................................................................... 39 Amiel Tel ..................................................................................................................................................................................... 35 Andrenšek Luka ........................................................................................................................................................................... 55 Batagelj Vladimir ......................................................................................................................................................................... 27 Calcina Erik .................................................................................................................................................................................. 93 Candia Vieira Joao Paulo ............................................................................................................................................................. 77 Cherakaoui Manal ........................................................................................................................................................................ 39 Čibej Jaka ..................................................................................................................................................................................... 23 Costa Luiz .................................................................................................................................................................................... 77 Dolinar Lenart .............................................................................................................................................................................. 93 Dupuis Aymeric ........................................................................................................................................................................... 31 Džeroski Sašo ............................................................................................................................................................................... 31 Evkoski Bojan .............................................................................................................................................................................. 19 Fijavž Zoran ................................................................................................................................................................................. 51 Fir Jakob ..................................................................................................................................................................................... 105 Gilliani Khasa ............................................................................................................................................................................. 113 Godoy Oliveira Cristina ............................................................................................................................................................... 77 Golob Luka ................................................................................................................................................................................... 47 Gourari Kamal .............................................................................................................................................................................. 39 Grigor Patricia-Carla .................................................................................................................................................................... 19 Grobelnik Marko ............................................................................................................................................................ 43, 81, 101 Guček Alenka ......................................................................................................................................................................... 81, 85 Hachimi Hanaa ............................................................................................................................................................................. 39 Hočevar Domen.............................................................................................................................................................................. 7 Hrib Ivo ........................................................................................................................................................................................ 67 Jermol Mitja ................................................................................................................................................................................. 35 Kenda Klemen ............................................................................................................................................................ 7, 11, 73, 113 Kholmska Ganna .......................................................................................................................................................................... 73 Klančič Rok .................................................................................................................................................................................. 11 Koloski Boshko ............................................................................................................................................................................ 31 Kralj Novak Petra ......................................................................................................................................................................... 19 Lachheb Hatim ............................................................................................................................................................................. 39 Leban Gregor ......................................................................................................................................................................... 63, 89 Longar Mark David ............................................................................................................................................................ 101, 105 Martinc Matej ............................................................................................................................................................................... 31 Massri M. Besher ......................................................................................................................................................................... 81 Meira Silva Rafael ........................................................................................................................................................................ 77 Mladenić Dunja ............................................................................................................................................ 63, 81, 85, 89, 97, 113 Mores Neto Antonio J. ................................................................................................................................................................. 35 Motamedi Elham .......................................................................................................................................................................... 59 Novak Erik ................................................................................................................................................................... 93, 101, 113 Novalija Inna ................................................................................................................................................................................ 59 Pangeršič Bor ............................................................................................................................................................................. 105 Pisanski Jan .................................................................................................................................................................................. 27 Pisanski Tomaž ............................................................................................................................................................................ 27 Pita Costa Joao ........................................................................................................................................................... 35, 39, 43, 77 Polajnar Anja ................................................................................................................................................................................ 35 Pollak Senja .................................................................................................................................................................................. 55 Purver Matthew ............................................................................................................................................................................ 55 Rei Luis ........................................................................................................................................................................................ 59 Rožanec Jože M. .............................................................................................................................................................. 63, 73, 89 Šinik Bogdan ................................................................................................................................................................................ 15 Sitar Šuštar Katarina ..................................................................................................................................................................... 55 Sittar Abdul ............................................................................................................................................................................ 47, 85 Šker Tesia ..................................................................................................................................................................................... 89 121 Škrjanc Maja ................................................................................................................................................................................ 67 Stavrov Filip ............................................................................................................................................................................... 109 Stegnar Jernej ............................................................................................................................................................................... 63 Stopar Luka ................................................................................................................................................................................ 109 Šturm Jan ...................................................................................................................................................................................... 67 Swati ............................................................................................................................................................................................. 97 Szabo Sandor .............................................................................................................................................................................. 117 Topal Oleksandra ......................................................................................................................................................................... 67 Tošić Aleksander .......................................................................................................................................................................... 15 Tounsi El Azzoiani Jad ................................................................................................................................................................ 39 Urbanč Luka ................................................................................................................................................................................. 43 Vake Domen ................................................................................................................................................................................. 15 Vičić Jernej ................................................................................................................................................................................... 15 Zaouini Mustafa ........................................................................................................................................................................... 39 Zavalnij Bogdan ......................................................................................................................................................................... 117 122 Odkrivanje znanja in podatkovna skladišča - SiKDD Data Mining and Data Warehouses - SiKDD Urednika > Editors: Dunja Mladenić, Marko Grobelnik Document Outline IS2024_Volume-C - DRAFT 02 - Naslovnica - notranja - C - DRAFT 03 - Kolofon - C - DRAFT 04 - IS2024 - Predgovor 05 - IS2024 - Konferencni odbori 07 - Kazalo - C 08 - Naslovnica - notranja - C - DRAFT 09 - Predgovor podkonference - C 10 - Programski odbor podkonference - C 12 - Index - C Blank Page Blank Page 11 - Prispevki - C.pdf IS2024_-_SIKDD_2024_paper_001 Abstract 1 Introduction 2 Data 3 Methodology 4 Results 4.1 Accuracy of Data Retrieval 4.2 Instance Fetching Accuracy 4.3 Manual Evaluation of Example Queries 5 Conclusions Acknowledgements IS2024_-_SIKDD_2024_paper_002 Abstract 1 Introduction 2 Methods 2.1 Traditional Batch Learning Methods 2.2 Time Series Deep Learning Methods 2.3 Time Series Foundation Models 3 Experiment Setting 3.1 Dataset 3.2 Evaluation Metrics 3.3 Baseline Methods 3.4 Implementation Details 4 Results 5 Conclusion and Future Work Acknowledgements A Hyperparameters B Selected Features IS2024_-_SIKDD_2024_paper_003 Abstract 1 Introduction 2 Literature review 3 Methodology 4 Results 5 Conclusion and future work IS2024_-_SIKDD_2024_paper_004 Abstract 1 Introduction 2 Related Work 3 Method 3.1 Datasets 3.2 Model Selection and Fine-Tuning 3.3 Model Evaluation 4 Results 4.1 Inter-Annotator and Model-Annotator Agreement 4.2 Model Comparison 5 Discussion 6 Conclusions 7 Acknowledgments IS2024_-_SIKDD_2024_paper_005 Abstract 1 Introduction 2 Dataset 3 Statistical Analysis and Feature Selection 4 Pronunciation Type Prediction 5 Manual Evaluation 6 Conclusion Acknowledgements IS2024_-_SIKDD_2024_paper_006 Abstract 1 Introduction 2 Higher-Order Bibliographic Services 3 OpenAlex 3.1 API 4 A collection of bibliographic networks 5 Report ingredients 5.1 Statistics 5.2 Network analysis 5.3 Special algorithms 5.4 Reports 6 Conclusions Acknowledgements IS2024_-_SIKDD_2024_paper_007 Abstract 1 Introduction 2 Related work 3 Methodology 3.1 Data Retrieval 3.2 Methods 4 Results 5 Conclusions 6 Acknowledgments IS2024_-_SIKDD_2024_paper_008 IS2024_-_SIKDD_2024_paper_009 IS2024_-_SIKDD_2024_paper_010 Abstract 1 Introduction 2 Data 3 Methodology 4 Main Results 5 Discussion 6 Conclusion 7 Acknowledgements IS2024_-_SIKDD_2024_paper_011 Abstract 1 Introduction 2 Related Work 3 Methodology 3.1 Data Collection 3.2 Characterization of Facts 3.3 Fact Extraction 3.4 Fact Manipulation and Synthetic News Generation 3.5 Fake News Annotation and Fact verification 4 Experimentation and Results 4.1 Experimental settings 4.2 Evaluation 4.3 Fact Extraction Results 4.4 Quality and coherence of synthetically generated fake news 4.5 Fact verification with LLMs 4.6 Final Dataset Description 5 Conclusion 5.1 Problems, Capabilities and Possible Improvements 6 Acknowledgments IS2024_-_SIKDD_2024_paper_012 Abstract 1 Introduction 2 Related Work 2.1 Role of Reported Speech 2.2 Existing Datasets and Modelling Approaches 3 Experimental Setting 3.1 Task Overview 3.2 Training and Test Data 3.3 Evaluation Procedure 3.4 Training Settings 4 Results 4.1 Model Results 4.2 Error Analysis Results 5 Discussion 6 Conclusion Acknowledgements IS2024_-_SIKDD_2024_paper_013 Abstract 1 Introduction & Related Work 2 Data and Methods 2.1 Hypotheses 2.2 Data and pre-processing 2.3 Financial performance analysis 3 Results and Discussion Acknowledgements IS2024_-_SIKDD_2024_paper_014 Abstract 1 Introduction 2 Related Work 3 Methods and Materials 3.1 Patent Collection and Preprocessing 3.2 Refining Hierarchical Structure Through Group Merging 3.3 Text Classification 3.4 Classification Evaluation 4 Results and Analysis 4.1 The Proposed Knowledge Mapping Taxonomy (KnowMap) 4.2 Classification Results 5 Discussion and Conclusions 6 Future Work Acknowledgements IS2024_-_SIKDD_2024_paper_015 - X Abstract 1 Introduction 2 Enriching causal graphs with domain knowledge 3 OntoGPT: a brief overview 3.1 OntoGPT's role 4 Templates and Python Code Generation 5 Limitations 5.1 Multiple Same-Class Concepts 6 Conclusions Acknowledgments References IS2024_-_SIKDD_2024_paper_016 Abstract 1 Introduction 1.1 Research Goals 2 Related Work 2.1 Research Gap and Contribution 3 Methodology 3.1 Dataset Generation 3.2 Data Preprocessing 3.3 CO2 Emission Measurement 3.4 Feature Extraction 3.5 Adding Hyperparameters to Final Dataset 3.6 Containerization 3.7 Data Storage 3.8 Modeling 4 Model Architecture 4.1 Model Training 4.2 Prediction 5 Web Application Interface for CO2 Emissions and Power Consumption Prediction 5.1 Key Features of the Web Application 5.2 User Experience and Accessibility 6 Results 6.1 Model Error 6.2 CO2 Emission Analysis Across Different Models 7 Discussion 8 Limitations 8.1 Training Duration and Model Learning 8.2 Lack of Meaningful Learning Objective 8.3 Hardware and Software Considerations 9 Future Work 10 Conclusion Acknowledgements IS2024_-_SIKDD_2024_paper_017 IS2024_-_SIKDD_2024_paper_018 IS2024_-_SIKDD_2024_paper_019 IS2024_-_SIKDD_2024_paper_020 Abstract 1 Introduction 2 Methodology 3 Analysis of trends of AI's Perception 3.1 Global Overview 3.2 Local Trends 3.3 EXAMPLES OF TRENDS 4 User Scenarios and Applications 5 Discussion 6 Conclusions 7 Acknowledgments IS2024_-_SIKDD_2024_paper_021 Abstract 1 Introduction 2 Related work 3 Dataset 3.1 Data Extraction Pipeline 3.2 Data Description 4 Methodology 4.1 Graph Construction 4.2 Random Walks for Feature Extraction 4.3 Embedding Generation Using Graph2Vec 4.4 One Hot Encoding & Target Shifting 4.5 Random Forest Classification & Stratified K-Fold Cross Validation 5 Results 6 Conclusions Acknowledgments References IS2024_-_SIKDD_2024_paper_022 Abstract 1 Introduction 2 Related Work 2.1 Large language models 2.2 Synthetic medical data generation 3 Methodology 3.1 Data pre-processing 3.2 Synthetic data generation 3.3 Technical details 4 Experiment Setting 4.1 Evaluation approach 4.2 Metrics 5 Results 5.1 Statistical analysis 5.2 The classifier evaluation 6 Discussion 6.1 LLM performance 6.2 Limitations 6.3 Potential improvements 7 Conclusion and Future Work Acknowledgments IS2024_-_SIKDD_2024_paper_023 Abstract 1 Introduction 1.1 Contributions 2 Related Work 3 Dataset Description 3.1 Primary Data Sources 3.2 Data Collection Framework 3.3 Data Synopsis and Structure 4 Potential Use-Cases 5 Limitations 6 Conclusions 7 Acknowledgments IS2024_-_SIKDD_2024_paper_024 Abstract 1 Introduction 2 Related Work 3 Methodology 3.1 Document segmentation 3.2 Generating Prolog definitions 3.3 Merging Prolog definitions 3.4 Use of the knowledge graph 4 Experiment Setting 4.1 Data sources 4.2 Used large language model 4.3 Evaluation Framework 5 Results 5.1 Dive into Deep Learning 5.2 Speech and Language Processing 6 Discussion 6.1 Potential improvements 7 Conclusion and Future work Acknowledgments IS2024_-_SIKDD_2024_paper_025 Abstract 1 Introduction 2 Related Work 3 Data 4 Methodology 4.1 Recommendation Mechanism 4.2 User Interface 5 Evaluation 5.1 Evaluation dataset 5.2 Experiment Design 6 Results 6.1 User Interface Implementation 7 Discussion 8 Future work Acknowledgments IS2024_-_SIKDD_2024_paper_026 IS2024_-_SIKDD_2024_paper_027 Abstract 1 Introduction 2 Background 2.1 Large Language Model (LLM) 2.2 Knowledge graph (KG) 2.3 Combining LLM and KG 3 Proof of Concept: Analysis and Knowledge graph generation 3.1 Using ChatGPT Prompts: 3.2 Python Implementation 4 Evaluation 5 Conclusion Acknowledgements IS2024_-_SIKDD_2024_paper_028 Abstract 1 Introduction 2 Packing, covering, and tiling 3 Gardner's bricks problem 4 Numerical experiments 5 Conclusions Acknowledgements 07 - Kazalo - C 08 - Naslovnica - notranja - C - DRAFT 09 - Predgovor podkonference - C 10 - Programski odbor podkonference - C 11 - Prispevki - C IS2024_-_SIKDD_2024_paper_001 Abstract 1 Introduction 2 Data 3 Methodology 4 Results 4.1 Accuracy of Data Retrieval 4.2 Instance Fetching Accuracy 4.3 Manual Evaluation of Example Queries 5 Conclusions Acknowledgements IS2024_-_SIKDD_2024_paper_002 Abstract 1 Introduction 2 Methods 2.1 Traditional Batch Learning Methods 2.2 Time Series Deep Learning Methods 2.3 Time Series Foundation Models 3 Experiment Setting 3.1 Dataset 3.2 Evaluation Metrics 3.3 Baseline Methods 3.4 Implementation Details 4 Results 5 Conclusion and Future Work Acknowledgements A Hyperparameters B Selected Features IS2024_-_SIKDD_2024_paper_003 Abstract 1 Introduction 2 Literature review 3 Methodology 4 Results 5 Conclusion and future work IS2024_-_SIKDD_2024_paper_004 Abstract 1 Introduction 2 Related Work 3 Method 3.1 Datasets 3.2 Model Selection and Fine-Tuning 3.3 Model Evaluation 4 Results 4.1 Inter-Annotator and Model-Annotator Agreement 4.2 Model Comparison 5 Discussion 6 Conclusions 7 Acknowledgments IS2024_-_SIKDD_2024_paper_005 Abstract 1 Introduction 2 Dataset 3 Statistical Analysis and Feature Selection 4 Pronunciation Type Prediction 5 Manual Evaluation 6 Conclusion Acknowledgements IS2024_-_SIKDD_2024_paper_006 Abstract 1 Introduction 2 Higher-Order Bibliographic Services 3 OpenAlex 3.1 API 4 A collection of bibliographic networks 5 Report ingredients 5.1 Statistics 5.2 Network analysis 5.3 Special algorithms 5.4 Reports 6 Conclusions Acknowledgements IS2024_-_SIKDD_2024_paper_007 Abstract 1 Introduction 2 Related work 3 Methodology 3.1 Data Retrieval 3.2 Methods 4 Results 5 Conclusions 6 Acknowledgments IS2024_-_SIKDD_2024_paper_008 IS2024_-_SIKDD_2024_paper_009 IS2024_-_SIKDD_2024_paper_010 Abstract 1 Introduction 2 Data 3 Methodology 4 Main Results 5 Discussion 6 Conclusion 7 Acknowledgements IS2024_-_SIKDD_2024_paper_011 Abstract 1 Introduction 2 Related Work 3 Methodology 3.1 Data Collection 3.2 Characterization of Facts 3.3 Fact Extraction 3.4 Fact Manipulation and Synthetic News Generation 3.5 Fake News Annotation and Fact verification 4 Experimentation and Results 4.1 Experimental settings 4.2 Evaluation 4.3 Fact Extraction Results 4.4 Quality and coherence of synthetically generated fake news 4.5 Fact verification with LLMs 4.6 Final Dataset Description 5 Conclusion 5.1 Problems, Capabilities and Possible Improvements 6 Acknowledgments IS2024_-_SIKDD_2024_paper_012 Abstract 1 Introduction 2 Related Work 2.1 Role of Reported Speech 2.2 Existing Datasets and Modelling Approaches 3 Experimental Setting 3.1 Task Overview 3.2 Training and Test Data 3.3 Evaluation Procedure 3.4 Training Settings 4 Results 4.1 Model Results 4.2 Error Analysis Results 5 Discussion 6 Conclusion Acknowledgements IS2024_-_SIKDD_2024_paper_013 Abstract 1 Introduction & Related Work 2 Data and Methods 2.1 Hypotheses 2.2 Data and pre-processing 2.3 Financial performance analysis 3 Results and Discussion Acknowledgements IS2024_-_SIKDD_2024_paper_014 Abstract 1 Introduction 2 Related Work 3 Methods and Materials 3.1 Patent Collection and Preprocessing 3.2 Refining Hierarchical Structure Through Group Merging 3.3 Text Classification 3.4 Classification Evaluation 4 Results and Analysis 4.1 The Proposed Knowledge Mapping Taxonomy (KnowMap) 4.2 Classification Results 5 Discussion and Conclusions 6 Future Work Acknowledgements IS2024_-_SIKDD_2024_paper_015 - X Abstract 1 Introduction 2 Enriching causal graphs with domain knowledge 3 OntoGPT: a brief overview 3.1 OntoGPT's role 4 Templates and Python Code Generation 5 Limitations 5.1 Multiple Same-Class Concepts 6 Conclusions Acknowledgments References IS2024_-_SIKDD_2024_paper_016 Abstract 1 Introduction 1.1 Research Goals 2 Related Work 2.1 Research Gap and Contribution 3 Methodology 3.1 Dataset Generation 3.2 Data Preprocessing 3.3 CO2 Emission Measurement 3.4 Feature Extraction 3.5 Adding Hyperparameters to Final Dataset 3.6 Containerization 3.7 Data Storage 3.8 Modeling 4 Model Architecture 4.1 Model Training 4.2 Prediction 5 Web Application Interface for CO2 Emissions and Power Consumption Prediction 5.1 Key Features of the Web Application 5.2 User Experience and Accessibility 6 Results 6.1 Model Error 6.2 CO2 Emission Analysis Across Different Models 7 Discussion 8 Limitations 8.1 Training Duration and Model Learning 8.2 Lack of Meaningful Learning Objective 8.3 Hardware and Software Considerations 9 Future Work 10 Conclusion Acknowledgements IS2024_-_SIKDD_2024_paper_017 IS2024_-_SIKDD_2024_paper_018 IS2024_-_SIKDD_2024_paper_019 IS2024_-_SIKDD_2024_paper_020 Abstract 1 Introduction 2 Methodology 3 Analysis of trends of AI's Perception 3.1 Global Overview 3.2 Local Trends 3.3 EXAMPLES OF TRENDS 4 User Scenarios and Applications 5 Discussion 6 Conclusions 7 Acknowledgments IS2024_-_SIKDD_2024_paper_021 Abstract 1 Introduction 2 Related work 3 Dataset 3.1 Data Extraction Pipeline 3.2 Data Description 4 Methodology 4.1 Graph Construction 4.2 Random Walks for Feature Extraction 4.3 Embedding Generation Using Graph2Vec 4.4 One Hot Encoding & Target Shifting 4.5 Random Forest Classification & Stratified K-Fold Cross Validation 5 Results 6 Conclusions Acknowledgments References IS2024_-_SIKDD_2024_paper_022 Abstract 1 Introduction 2 Related Work 2.1 Large language models 2.2 Synthetic medical data generation 3 Methodology 3.1 Data pre-processing 3.2 Synthetic data generation 3.3 Technical details 4 Experiment Setting 4.1 Evaluation approach 4.2 Metrics 5 Results 5.1 Statistical analysis 5.2 The classifier evaluation 6 Discussion 6.1 LLM performance 6.2 Limitations 6.3 Potential improvements 7 Conclusion and Future Work Acknowledgments IS2024_-_SIKDD_2024_paper_023 Abstract 1 Introduction 1.1 Contributions 2 Related Work 3 Dataset Description 3.1 Primary Data Sources 3.2 Data Collection Framework 3.3 Data Synopsis and Structure 4 Potential Use-Cases 5 Limitations 6 Conclusions 7 Acknowledgments IS2024_-_SIKDD_2024_paper_024 Abstract 1 Introduction 2 Related Work 3 Methodology 3.1 Document segmentation 3.2 Generating Prolog definitions 3.3 Merging Prolog definitions 3.4 Use of the knowledge graph 4 Experiment Setting 4.1 Data sources 4.2 Used large language model 4.3 Evaluation Framework 5 Results 5.1 Dive into Deep Learning 5.2 Speech and Language Processing 6 Discussion 6.1 Potential improvements 7 Conclusion and Future work Acknowledgments IS2024_-_SIKDD_2024_paper_025 Abstract 1 Introduction 2 Related Work 3 Data 4 Methodology 4.1 Recommendation Mechanism 4.2 User Interface 5 Evaluation 5.1 Evaluation dataset 5.2 Experiment Design 6 Results 6.1 User Interface Implementation 7 Discussion 8 Future work Acknowledgments IS2024_-_SIKDD_2024_paper_026 IS2024_-_SIKDD_2024_paper_027 Abstract 1 Introduction 2 Background 2.1 Large Language Model (LLM) 2.2 Knowledge graph (KG) 2.3 Combining LLM and KG 3 Proof of Concept: Analysis and Knowledge graph generation 3.1 Using ChatGPT Prompts: 3.2 Python Implementation 4 Evaluation 5 Conclusion Acknowledgements IS2024_-_SIKDD_2024_paper_028 Abstract 1 Introduction 2 Packing, covering, and tiling 3 Gardner's bricks problem 4 Numerical experiments 5 Conclusions Acknowledgements 12 - Index - C Blank Page IS2024_-_SIKDD_2024_paper_028.pdf Abstract 1 Introduction 2 Packing, covering, and tiling 3 Gardner's bricks problem 4 Numerical experiments 5 Conclusions Acknowledgements Blank Page Blank Page