Zbornik 27. mednarodne multikonference

INFORMACIJSKA DRUŽBA – IS 2024

Zvezek C





Proceedings of the 27th International Multiconference

INFORMATION SOCIETY – IS 2024

Volume C





Odkrivanje znanja in podatkovna skladišča - SiKDD

Data Mining and Data Warehouses - SiKDD





Urednika / Editors



Dunja Mladenić, Marko Grobelnik





http://is.ijs.si





7. oktober 2024 / 7 October 2024

Ljubljana, Slovenia



Urednika:





Dunja Mladenić

Department for Artificial Intelligence

Jožef Stefan Institute, Ljubljana





Marko Grobelnik

Department for Artificial Intelligence

Jožef Stefan Institute, Ljubljana





Založnik: Institut »Jožef Stefan«, Ljubljana

Priprava zbornika: Mitja Lasič, Vesna Lasič, Lana Zemljak

Oblikovanje naslovnice: Vesna Lasič





Dostop do e-publikacije:

http://library.ijs.si/Stacks/Proceedings/InformationSociety





Ljubljana, oktober 2024





Informacijska družba

ISSN 2630-371X



Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni knjižnici v Ljubljani

COBISS.SI-ID 214428163

ISBN 978-961-264-301-0 (PDF)





PREDGOVOR MULTIKONFERENCI

INFORMACIJSKA DRUŽBA 2024

Leto 2024 je hkrati udarno in tradicionalno. Že sedaj, še bolj pa v prihodnosti bosta računalništvo, informatika (RI) in umetna inteligenca (UI) igrali ključno vlogo pri oblikovanju napredne in trajnostne družbe. Smo na pragu nove dobe, v kateri generativna umetna inteligenca, kot je ChatGPT, in drugi inovativni pristopi utirajo pot k superinteligenci in singularnosti, ključnim elementom, ki bodo definirali razcvet človeške civilizacije.

Naša konferenca je zato hkrati tradicionalna znanstvena, pa tudi povsem akademsko odprta za nove pogumne ideje, inkubator novih pogledov in idej.

Letošnja konferenca ne le da analizira področja RI, temveč prinaša tudi osrednje razprave o perečih temah današnjega časa – ohranjanje okolja, demografski izzivi, zdravstvo in preobrazba družbenih struktur. Razvoj UI ponuja rešitve za skoraj vse izzive, s katerimi se soočamo, kar poudarja pomen sodelovanja med strokovnjaki, raziskovalci in odločevalci, da bi skupaj oblikovali strategije za prihodnost. Zavedamo se, da živimo v času velikih sprememb, kjer je ključno, da s poglobljenim znanjem in inovativnimi pristopi oblikujemo informacijsko družbo, ki bo varna, vključujoča in trajnostna.

Letos smo ponosni, da smo v okviru multikonference združili dvanajst izjemnih konferenc, ki odražajo širino in globino informacijskih ved: CHATMED v zdravstvu, Demografske in družinske analize, Digitalna preobrazba zdravstvene nege, Digitalna vključenost v informacijski družbi – DIGIN 2024, Kognitivna znanost, Konferenca o zdravi dolgoživosti, Legende računalništva in informatike, Mednarodna konferenca o prenosu tehnologij, Miti in resnice o varovanju okolja, Odkrivanje znanja in podatkovna skladišča – SIKDD

2024, Slovenska konferenca o umetni inteligenci, Vzgoja in izobraževanje v RI.

Poleg referatov bodo razprave na okroglih mizah in delavnicah omogočile poglobljeno izmenjavo mnenj, ki bo oblikovala prihodnjo informacijsko družbo. “Legende računalništva in informatike” predstavljajo slovenski “Hall of Fame” za odlične posameznike s tega področja, razširjeni referati, objavljeni v reviji Informatica z 48-letno tradicijo odličnosti, in sodelovanje s številnimi akademskimi institucijami in združenji, kot so ACM Slovenija, SLAIS in Inženirska akademija Slovenije, bodo še naprej spodbujali razvoj informacijske družbe. Skupaj bomo gradili temelje za prihodnost, ki bo oblikovana s tehnologijami, osredotočena na človeka in njegove potrebe.

S podelitvijo nagrad, še posebej z nagrado Michie-Turing, se avtonomna RI stroka vsakoletno opredeli do najbolj izstopajočih dosežkov. Nagrado Michie-Turing za izjemen življenjski prispevek k razvoju in promociji informacijske družbe je prejel prof. dr. Borut Žalik. Priznanje za dosežek leta pripada prof. dr. Sašu Džeroskemu za izjemne raziskovalne dosežke. »Informacijsko limono« za najmanj primerno informacijsko tematiko je prejela nabava in razdeljevanjem osebnih računalnikov ministrstva, »informacijsko jagodo« kot najboljšo potezo pa so sprejeli organizatorji tekmovanja ACM Slovenija. Čestitke nagrajencem!

Naša vizija je jasna: prepoznati, izkoristiti in oblikovati priložnosti, ki jih prinaša digitalna preobrazba, ter ustvariti informacijsko družbo, ki bo koristila vsem njenim članom. Vsem sodelujočim se zahvaljujemo za njihov prispevek k tej viziji in se veselimo prihodnjih dosežkov, ki jih bo oblikovala ta konferenca.



Mojca Ciglarič, predsednica programskega odbora

Matjaž Gams, predsednik organizacijskega odbora





i

PREFACE TO THE MULTICONFERENCE

INFORMATION SOCIETY 2024

The year 2024 is both ground-breaking and traditional. Now, and even more so in the future, computer science, informatics (CS/I), and artificial intelligence (AI) will play a crucial role in shaping an advanced and sustainable society. We are on the brink of a new era where generative artificial intelligence, such as ChatGPT, and other innovative approaches are paving the way for superintelligence and singularity—key elements that will define the flourishing of human civilization. Our conference is therefore both a traditional scientific gathering and an academically open incubator for bold new ideas and perspectives.

This year's conference analyzes key CS/I areas and brings forward central discussions on pressing contemporary issues—environmental preservation, demographic challenges, healthcare, and the transformation of social structures. AI development offers solutions to nearly all challenges we face, emphasizing the importance of collaboration between experts, researchers, and policymakers to shape future strategies collectively. We recognize that we live in times of significant change, where it is crucial to build an information society that is safe, inclusive, and sustainable, through deep knowledge and innovative approaches.

This year, we are proud to have brought together twelve exceptional conferences within the multiconference framework, reflecting the breadth and depth of information sciences:

• CHATMED in Healthcare

• Demographic and Family Analyses

• Digital Transformation of Healthcare Nursing

• Digital Inclusion in the Information Society – DIGIN 2024

• Cognitive Science

• Conference on Healthy Longevity

• Legends of Computer Science and Informatics

• International Conference on Technology Transfer

• Myths and Facts on Environmental Protection

• Data Mining and Data Warehouses – SIKDD 2024

• Slovenian Conference on Artificial Intelligence

• Education and Training in CS/IS.

In addition to papers, roundtable discussions and workshops will facilitate in-depth exchanges that will help shape the future information society. The “Legends of Computer Science and Informatics” represents Slovenia’s “Hall of Fame” for outstanding individuals in this field. At the same time, extended papers published in the Informatica journal, with over 48 years of excellence, and collaboration with numerous academic institutions and associations, such as ACM Slovenia, SLAIS, and the Slovenian Academy of Engineering, will continue to foster the development of the information society. Together, we will build the foundation for a future shaped by technology, yet focused on human needs.

The autonomous CS/IS community annually recognizes the most outstanding achievements through the awards ceremony. The Michie-Turing Award for an exceptional lifetime contribution to the development and promotion of the information society was awarded to Prof. Dr. Borut Žalik. The Achievement of the Year Award goes to Prof. Dr. Sašo Džeroski. The "Information Lemon" for the least appropriate information topic was given to the ministry's procurement and distribution of personal computers. At the same time, the

"Information Strawberry" for the best initiative was awarded to the organizers of the ACM Slovenia competition. Congratulations to all the award winners!

Our vision is clear: to recognize, seize, and shape the opportunities brought by digital transformation and create an information society that benefits all its members. We thank all participants for their contributions and look forward to this conference's future achievements.



Mojca Ciglarič, Chair of the Program Committee

Matjaž Gams, Chair of the Organizing Committee



ii





KONFERENČNI ODBORI

CONFERENCE COMMITTEES



International Programme Committee

Organizing Committee

Vladimir Bajic, South Africa

Matjaž Gams, chair

Heiner Benking, Germany

Mitja Luštrek

Se Woo Cheon, South Korea

Lana Zemljak

Howie Firth, UK

Vesna Koricki

Olga Fomichova, Russia

Mitja Lasič

Vladimir Fomichov, Russia

Blaž Mahnič

Vesna Hljuz Dobric, Croatia



Alfred Inselberg, Israel

Jay Liebowitz, USA

Huan Liu, Singapore

Henz Martin, Germany

Marcin Paprzycki, USA

Claude Sammut, Australia

Jiri Wiedermann, Czech Republic

Xindong Wu, USA

Yiming Ye, USA

Ning Zhong, USA

Wray Buntine, Australia

Bezalel Gavish, USA

Gal A. Kaminka, Israel

Mike Bain, Australia

Michela Milano, Italy

Derong Liu, Chicago, USA

Toby Walsh, Australia

Sergio Campos-Cordobes, Spain

Shabnam Farahmand, Finland

Sergio Crovella, Italy





Programme Committee

Mojca Ciglarič, chair

Marjan Heričko

Baldomir Zajc

Bojan Orel

Borka Jerman Blažič Džonova

Blaž Zupan

Franc Solina

Gorazd Kandus

Boris Žemva

Viljan Mahnič

Urban Kordeš

Leon Žlajpah

Cene Bavec

Marjan Krisper

Niko Zimic

Tomaž Kalin

Andrej Kuščer

Rok Piltaver

Jozsef Györkös

Jadran Lenarčič

Toma Strle

Tadej Bajd

Borut Likar

Tine Kolenik

Jaroslav Berce

Janez Malačič

Franci Pivec

Mojca Bernik

Olga Markič

Uroš Rajkovič

Marko Bohanec

Dunja Mladenič

Borut Batagelj

Ivan Bratko

Franc Novak

Tomaž Ogrin

Andrej Brodnik

Vladislav Rajkovič

Aleš Ude

Dušan Caf

Grega Repovš

Bojan Blažica

Saša Divjak

Ivan Rozman

Matjaž Kljun

Tomaž Erjavec

Niko Schlamberger

Robert Blatnik

Bogdan Filipič

Stanko Strmčnik

Erik Dovgan

Andrej Gams

Jurij Šilc

Špela Stres

Matjaž Gams

Jurij Tasič

Anton Gradišek

Mitja Luštrek

Denis Trček

Marko Grobelnik

Andrej Ule

Nikola Guid

Boštjan Vilfan

iii





iv





KAZALO / TABLE OF CONTENTS



Odkrivanje znanja in podatkovna skladišča - SiKDD / Data Mining and Data Warehouses - SiKDD .... 1

PREDGOVOR / FOREWORD ............................................................................................................................... 3

PROGRAMSKI ODBORI / PROGRAMME COMMITTEES ............................................................................... 5

Integrating Knowledge Graphs and Large Language Models for Querying in an Industrial Environment / Kenda

Klemen, Hočevar Domen ................................................................................................................................... 7

Comparative Analysis of Machine Learning Models for Groundwater Level Forecasting: The Impact of

Contextual Data / Klančič Rok, Kenda Klemen .............................................................................................. 11

Interactive Tool for Tracking Open-source Artificial Intelligence Progress on Hugging Face / Šinik Bogdan,

Vake Domen, Vičić Jernej, Tošić Aleksander.................................................................................................. 15

Multilingual Hate Speech Modeling by Leveraging Inter-Annotator Disagreement / Grigor Patricia-Carla, Kralj

Novak Petra, Evkoski Bojan ............................................................................................................................ 19

Predicting Pronunciation Types in the Sloleks Morphological Lexicon of Slovene / Čibej Jaka ........................ 23

Higher-order bibliographic services based on bibliographic networks / Batagelj Vladimir, Pisanski Jan, Pisanski

Tomaž ............................................................................................................................................................... 27

Are papers all that counts? A bibliometric analysis of the Slovenian scientific community / Dupuis Aymeric,

Džeroski Sašo, Koloski Boshko, Martinc Matej .............................................................................................. 31

Empowering Open Education Methodologies with AI-based Strategies for the Customization of Education /

Amiel Tel, Mores Neto Antonio J., Pita Costa Joao, Polajnar Anja, Jermol Mitja .......................................... 35

Addressing Water Sustainability Challenges in North Africa with Artificial Intelligence / Zaouini Mustafa, Pita

Costa Joao, Cherakaoui Manal, Hachimi Hanaa, Abkari M. Wahib, Gourari Kamal, Lachheb Hatim, Tounsi

El Azzoiani Jad ................................................................................................................................................. 39

Predicting poverty using regression / Urbanč Luka, Grobelnik Marko, Pita Costa Joao ..................................... 43

Fact Manipulation in News: LLM-Driven Synthesis and Evaluation of Fake News Annotation / Golob Luka,

Sittar Abdul ...................................................................................................................................................... 47

Borrowing Words: Transfer Learning for Reported Speech Detection in Slovenian News Texts / Fijavž Zoran 51

Connecting company performance to ESG terms in financial reports / Andrenšek Luka, Sitar Šuštar Katarina,

Pollak Senja, Purver Matthew .......................................................................................................................... 55

Classification of Patents Into Knowledge Fields: Using a Proposed Knowledge Mapping Taxonomy

(KnowMap) / Motamedi Elham, Novalija Inna, Rei Luis ............................................................................... 59

Enhancing causal graphs with domain knowledge: matching ontology concepts between ontologies and raw text

data / Stegnar Jernej, Rožanec Jože M., Leban Gregor, Mladenić Dunja ....................................................... 63

Measuring and Modeling CO2 Emissions in Machine Learning Processes / Hrib Ivo, Šturm Jan, Topal

Oleksandra, Škrjanc Maja ................................................................................................................................ 67

Enhancing Ontology Engineering with LLMs: From Search to Active Learning Extensions / Kholmska Ganna,

Kenda Klemen, Rožanec Jože M. .................................................................................................................... 73

On the Brazilian Observatory for Artificial Intelligence / Meira Silva Rafael, Godoy Oliveira Cristina, Costa

Luiz, Candia Vieira Joao Paulo, Pita Costa Joao ............................................................................................. 77

Pojavljanje incidentov ob uporabi Umetne Inteligence / Grobelnik Marko, Massri M. Besher, Guček Alenka,

Mladenić Dunja ................................................................................................................................................ 81

Perception of AI in Slovenia / Sittar Abdul, Guček Alenka, Mladenić Dunja ..................................................... 85

Naslov / Šker Tesia, Rožanec Jože M., Leban Gregor, Mladenić Dunja ............................................................. 89

Generating Non-English Synthetic Medical Data Sets / Dolinar Lenart, Calcina Erik, Novak Erik ................... 93

LLNewsBias: A Multilingual News Dataset for Lifelong Learning / Swati, Mladenić Dunja ............................ 97

Creating Local World Models using LLMs / Longar Mark David, Novak Erik, Grobelnik Marko .................. 101

Semantic video content search and recommendation / Longar Mark David, Fir Jakob, Pangeršič Bor ............ 105

Continuous Planning of a Fleet of Shuttle Vans as Support for Dynamic Pricing / Stavrov Filip, Stopar Luka 109

Knowledge graph Extraction from Textual data using LLM / Gilliani Khasa, Novak Erik, Kenda Klemen,

Mladenić Dunja .............................................................................................................................................. 113

Solving hard optimization problems of packing, covering, and tiling via clique search / Szabo Sandor, Zavalnij

Bogdan ........................................................................................................................................................... 117



v



Indeks avtorjev / Author index ................................................................................................................. 121





vi



Zbornik 27. mednarodne multikonference

INFORMACIJSKA DRUŽBA – IS 2024

Zvezek C





Proceedings of the 27th International Multiconference

INFORMATION SOCIETY – IS 2024

Volume C





Odkrivanje znanja in podatkovna skladišča - SiKDD

Data Mining and Data Warehouses - SiKDD





Urednika / Editors



Dunja Mladenić, Marko Grobelnik





http://is.ijs.si





7. oktober 2024 / 7 October 2024

Ljubljana, Slovenia

1





2

PREDGOVOR





Tehnologije, ki se ukvarjajo s podatki so močno napredovale. Iz prve faze, kjer je šlo predvsem za shranjevanje podatkov in kako do njih učinkovito dostopati, se je razvila industrija za izdelavo orodij za delo s podatkovnimi bazami in velikimi količinami podatkov, prišlo je do standardizacije procesov, povpraševalnih jezikov. Ko shranjevanje podatkov ni bil več poseben problem, se je pojavila potreba po bolj urejenih podatkovnih bazah, ki bi služile ne le transakcijskem procesiranju ampak tudi analitskim vpogledom v podatke. Pri avtomatski analizi podatkov sistem sam pove, kaj bi utegnilo biti zanimivo za uporabnika – to prinašajo tehnike odkrivanja znanja v podatkih (knowledge discovery and data mining), ki iz obstoječih podatkov skušajo pridobiti novo znanje in tako uporabniku nudijo novo razumevanje dogajanj zajetih v podatkih. Slovenska KDD konferenca SiKDD, pokriva vsebine, ki se ukvarjajo z analizo podatkov in odkrivanjem znanja v podatkih: pristope, orodja, probleme in rešitve.





Dunja Mladenić in Marko Grobelnik





3

FOREWORD





Data driven technologies have significantly progressed. The first phases were mainly focused on storing and efficiently accessing the data, resulted in the development of industry tools for managing large databases, related standards, supporting querying languages, etc. After the initial period, when the data storage was not a primary problem anymore, the development progressed towards analytical functionalities on how to extract added value from the data; i.e., databases started supporting not only transactions but also analytical processing of the data. In automatic data analysis, the system itself tells what might be interesting for the user - this is brought about by knowledge discovery and data mining techniques, which try to obtain new knowledge from existing data and thus provide the user with a new understanding of the events covered in the data. The Slovenian KDD conference SiKDD covers topics dealing with data analysis and discovering knowledge in data: approaches, tools, problems and solutions.





Dunja Mladenić and Marko Grobelnik



4





PROGRAMSKI ODBOR / PROGRAMME COMMITTEE



Janez Brank, Jožef Stefan Institute, Ljubljana

Marko Grobelnik, Jožef Stefan Institute, Ljubljana

Alenka Guček, Jožef Stefan Institute, Ljubljana

Branko Kavšek, University of Primorska, Koper

Dunja Mladenić, Jožef Stefan Institute, Ljubljana

Erik Novak, Jožef Stefan Institute, Ljubljana

Inna Novalija, Jožef Stefan Institute, Ljubljana

Joao Pita Costa, Quintelligence, Ljubljana

Lui Rei, Event Registry, Ljubljana

Jože Rožanec, Jožef Stefan Institute, Ljubljana

Abdul Sitar, Jožef Stefan Institute, Ljubljana

Luka Stopar, SolvesAll, Ljubljana

Swati Swati, Bundeswehr University Munich, Munich

Jan Šturm, Jožef Stefan Institute, Ljubljana

Oleksandra Topal, Jožef Stefan Institute, Ljubljana





5



6





Integrating Knowledge Graphs and Large Language Models for

Querying in an Industrial Environment

Domen Hočevar

Klemen Kenda

domenhocevar1@gmail.com

klemen.kenda@ijs.si

Jožef Stefan Institute

Jožef Stefan Institute

Ljubljana, Slovenia

Ljubljana, Slovenia

Abstract

To overcome these challenges, we propose a system that en-

ables users to interact with knowledge graphs through natural

Knowledge graphs have traditionally required the use of specific

language queries. The system leverages LLMs’ capabilities to

query languages, such as SPARQL, to retrieve relevant data. In

interpret knowledge graphs while compensating for their limited

this paper, we present a system capable of performing natural

ability to generate fully syntactically and semantically correct

language queries on knowledge graphs by leveraging retrieval-

SPARQL queries. Proposed system, depicted in Figure 1, lever-

augmented generation (RAG) and large language models (LLMs).

ages large language models (LLMs) [11] to process natural lan-

Our system can ingest large knowledge graphs and answer queries

guage inputs and provide responses in natural language. Our

using two approaches: first, by utilizing LLMs to extract informa-

approach integrates retrieval-augmented generation (RAG) tech-

tion directly from subgraphs; and second, by generating SPARQL

niques alongside the automatic generation of SPARQL queries

queries with LLMs and using the results to inform further infer-

based on natural language input [2].

ence, such as counting the number of items.

Keywords

knowledge graph, semantic inference, Industry 4.0, LLM, RAG

1

Introduction

In the context of Industry 4.0, knowledge graphs play a crucial

role in mapping and describing the entire production vertical,

from supply and demand dynamics to intricate details within the

production process. This includes the configuration of shop floors,

production lines, machines, and data setups, extending even to

specific datasets generated during operations. Knowledge graphs

can also include relevant information about the tools required for

particular processes, as well as details about personnel, including

their skills and roles.

A key standard for representing such data within the Industry

4.0 initiative is the Asset Administration Shell (AAS) [3], which provides a logical representation for a factory asset (can also be

a piece of software, etc.). By adopting AAS, industries can en-

Figure 1: Intended usage of the system: AAS instances are

sure interoperability and standardization, enabling more efficient

converted into a knowledge graph, enabling natural lan-

data exchange and integration across various systems, ultimately

guage queries by the user.

enhancing the agility and responsiveness of manufacturing pro-

cesses.

Querying knowledge graphs can be a challenging task for end

users, as it often requires expertise in specialized query languages

By doing so, our system not only simplifies the querying pro-

such as SPARQL [8] — a skill that is not widely known among

cess but also ensures that the responses are accurate and con-

non-experts. Working with SPARQL SELECT queries remains a

textually relevant, making knowledge graphs more accessible

challenge also for LLMs, with performance varying significantly

and usable for a broader range of users. Additionally, the use of

depending on the specific model and task complexity. While the

LLMs in combination with SPARQL querying enables the system

leading LLMs can reliably address basic syntax errors, generating

to handle complex tasks, including those that require logical rea-

semantically accurate SPARQL SELECT queries remains difficult

soning, aggregation, or interpretation of data, thus enhancing

in many cases [10]. Similar work has been done on interaction

its utility in real-world applications. For example, our system is

with databases, however even with SQL query generation the

able to answer queries such as: “Give me all machines that

results of GPT-4 are still far behind human ability (approx. 55%

are capable of drilling a hole with 2cm perimeter”.

execution accuracy) [9].

Finally, question answering with the help of knowledge graphs

and language models has been tackled before [16], however, the Permission to make digital or hard copies of all or part of this work for personal development of retrieval-augmented generation (RAG) systems

or classroom use is granted without fee provided that copies are not made or

has seen significant growth recently. In 2024, several preprints

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this have emerged showcasing the application of the RAG approach

work must be honored. For all other uses, contact the owner /author(s).

to knowledge graphs [12, 13, 14]. This paper contributes to this Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

rapidly evolving field by presenting our own advancements and

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2024.sikdd.5

findings.

7





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Domen Hočevar and Klemen Kenda

2

Data

query generation, which are essential for responding to complex

queries by the user.

This study uses a generated dataset representing a hypothetical

factory with various machine models, designed to test the capabil-

ities of the developed application. The work is part of the Smart

Manufacturing pilot in the EU-funded HumAIne project [7], with the aim of eventually using real-world data from participating

factories.

The mock factory includes models of "drillers", "circle cut-

ters", and "circular saws", each with unique names, manu-

facturers, and descriptions. These models are represented using

AASs with relevant submodels for energy consumption, man-

ufacturer details, and operation-specific parameters like hole

diameter or depth of cut.

We created AASs for 7 drilling machine models, 7 circle cutter

models, and 10 circular saw models, along with 1,000 machine

instances randomly assigned to these models. Numerical values

and availability were populated randomly for testing, reflecting

potential real-world variations.

The initial step after acquiring AAS data is to convert it into

a knowledge graph. This process involves transforming JSON-

serialized AASs into RDF triples, which represent the semantic

information of the data. Once the RDF triples are generated, they

1

are stored in a GraphDB

repository. To enable semantic data re-

trieval, we employ a connector that interfaces with the ChatGPT

2

Retrieval Plugin , which operates alongside the server application. When new triples are added to the GraphDB repository, the

connector triggers the plugin to generate vector embeddings of

Figure 2: System architecture for retrieval augmented gen-

the text representations of the new nodes. These embeddings

eration with knowledge graphs in Industry 4.0.

are created using a language model and are stored in a separate

vector database. The ChatGPT Retrieval Plugin enables interacte

to a selection of different vector databases, in our case we em-

In summary, the architecture is designed to streamline the

ployed the Milvus vector database. The system is also designed

process of building a knowledge graph from AAS data and en-

to maintain consistency; if any triples are removed from the

ables users to query this graph with retrieval-augmented gener-

GraphDB repository, the corresponding vector embeddings are

ation (RAG) using natural language, with the system handling

automatically deleted from the vector database.

the complexities of data storage, retrieval, and natural language

processing in the background.

3

Methodology

The sequence diagram in Figure 3 illustrates the interaction

between system components during query processing. Our sys-

The system architecture is illustrated in Figure 2. The user inter-tem enables two distinct approaches to handle natural language

acts with the system through a client application, developed using

queries, often combining both to generate a comprehensive an-

ReactJS, which serves as the graphical user interface (GUI). This

swer for the user.

client application communicates with the system’s middleware,

which is built on the Flask framework. Users have the capability

to upload AAS data to construct and enhance the knowledge

graph, as well as to issue natural language queries.

The middleware acts as the core of the system, facilitating com-

munication between the client application, the knowledge graph

stored in a GraphDB database, and OpenAI’s GPT models. The

AAS data uploaded by the user is first converted into RDF triples

and then stored in the GraphDB repository. The Flask-based mid-

dleware also integrates with the ChatGPT Retrieval Plugin, which

is responsible for generating vector embeddings of the knowledge

graph nodes using OpenAI’s text-embedding-ada-002 model.

These vector embeddings are stored in the Milvus vector data-

base [15]. The ChatGPT Retrieval Plugin allows the system to

efficiently retrieve the most relevant embeddings in response

to user queries, ensuring that the system can provide accurate

and contextually appropriate answers. Additionally, the middle-

Figure 3: Sequence diagram of different approaches for

3

ware leverages LlamaIndex

to manage sub-graph retrieval and

data extraction. The blue box represents the RAG approach

and the red box represents the SPARQL query generation

1 https://graphdb.ontotext.com/

approach. Note that RAG approach utilizes results from

2 https://github.com/openai/chatgpt-retrieval-plugin

SPARQL queries on the knowledge graph.

3 https://www.llamaindex.ai/

8





Querying with KG and LLMs for Industry 4.0

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

The first approach utilizes a Retrieval-Augmented Generation

the number of machines that met the voltage criteria and identi-

(RAG) method. Upon receiving a query, the system analyzes the

fying any errors, such as incorrect voltage values or unnecessary

query to identify relevant concepts and generates vector embed-

machine retrievals. Results are depicted in Figures 4 and 5.

dings for these concepts [5]. These embeddings are then matched against the knowledge graph stored in GraphDB to find the most

relevant nodes. Once the relevant nodes are identified, a naive

neighborhood expansion is performed, capturing additional re-

lated nodes to ensure a more complete context. The search is

parameterized using parameters: scope, how many nodes from

the graph to retrieve; breadth, from how many relevant nodes

to start the neighborhood expansion; score weight, how many

more nodes are visited from the identified relevant nodes that

are deemed more relevant using embedding similarity. This sub-

graph, along with a few examples for context, is then fed into

the Large Language Model (LLM) using a few-shot [1] learning

technique to generate a response [4]. The LlamaIndex framework provides a general context query for turning triples into natural language. This method is particularly effective for queries

requiring contextual understanding and extraction of complex

information from the knowledge graph.

The second approach involves generating a SPARQL query

based on the natural language query and the ontology used

within the knowledge graph. The system attempts to execute this

Figure 4: Performance of the system by the type of the

SPARQL query in the GraphDB database. If the query runs suc-

machine and query.

cessfully, the resulting data is passed to the LLM to formulate the

final answer. This approach is especially beneficial for tasks that

In Figure 4, each table contains four columns: "V" (voltage involve counting instances or performing specific data aggrega-specified in the query), "R" (percentage of correctly retrieved ma-

tion operations, where LLMs alone might struggle. This approach

chines), "W" (number of machines with incorrect voltage), and

benefits from the first approach as it can use it as backup or to

"A" (number of unnecessary machine retrievals). Figure 5 summa-enrich the SPARQL query results with additional context.

rizes the results: "Fully Correct Answers" shows the percentage

of queries that returned all requested information without errors;

"Share of Expected Information Found" indicates the proportion

4

Results

of requested information retrieved; and "Share of Incorrectly Dis-

To thoroughly evaluate the system, we employed three different

played Voltages" represents the percentage of retrieved voltages

evaluations: (a) assessing the accuracy of data retrieval based on

that were incorrect.

query parameters (not using query generation), (b) evaluating the

system’s ability to correctly fetch the number of instances (testing

query generation), and (c) conducting a manual assessment of

most relevant user queries.

4.1

Accuracy of Data Retrieval

The first approach involved testing the system’s ability to accu-

rately retrieve data that met specific query conditions without

employing SPARQL query generation. We focused on queries

where the user requested a list of machines of a particular type

with a voltage requirement less than or equal to a specified value.

An example query would be: “Return all drilling machines

that consume at most 4 volts and specify their consump-

Figure 5: Combined performance.

tion.”

We conducted these tests on three types of machines: "drilling

machines", "circle cutters", and "circular saws". The voltage The results show that sometimes the LLM would incorrectly

values specified in the queries ranged from 0 to 10 volts, inclu-

generate a different voltage requirement for a machine, making

sive. The evaluation was designed to measure how accurately

it appear to satisfy the query conditions. However, the retrieved

the system could identify and return the correct set of machines

machines were always of the correct type. For example, a query

based on these voltage constraints.

like “Name all drilling machines and specify their voltage

For these tests, the following parameters were used (scope: 100,

reqirements” correctly retrieves all machines with the right

breadth: 1000, score weight: 100, model: gpt-4-1106-preview,

specifications, suggesting the issue may lie with the LLM rather

query generation strategy: disabled).

than the knowledge retrieval process.

The system’s performance was assessed by comparing the

To address this, users can try adjusting query parameters or

retrieved data against the expected results, specifically checking

rewording the query to verify the information’s accuracy. If this

9





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Domen Hočevar and Klemen Kenda

type of query is crucial, incorporating voltage-specific queries

balance between subgraph retrieval and SPARQL generation to

into the query generation strategy could improve reliability, al-

ensure even more robust and comprehensive query handling.

though the LLM may struggle with large lists due to its context

window limitations. As shown in Figure 5, these types of queries Acknowledgements

often do not reliably provide all requested information in one

This work was supported by the European Commission under

answer, so users should run multiple queries to increase the

the Horizon Europe project HumAIne, Grant Agreement No.

likelihood of retrieving all necessary data.

101120218. We would like to express our gratitude to all project

partners for their contributions and collaboration.

4.2

Instance Fetching Accuracy

References

In these tests, we tested query generation strategy. The following

[1]

Tom B Brown. 2020. Language models are few-shot learners. arXiv preprint

parameters were used (scope: 100, breadth: 1000, score weight:

arXiv:2005.14165.

100, model: gpt-4-1106-preview, query generation strategy:

[2]

Diego Bustamante and Hideaki Takeda. 2024. Sparql generation with entity

enabled).

pre-trained gpt for kg question answering. arXiv preprint arXiv:2402.00969.

[3]

2022. Details of the asset administration shell. https://www.plattf orm- i40.d

The queries asked for the number of available instances for se-

e/IP/Redaktion/EN/Downloads/Publikation/Details_of _the_Asset_Adm

lected machine models, such as "Get the number of available

inistration_Shell_Part1_V3.pdf ?__blob=publicationFile&v=1 (visited on

[name of the machine 1], [name of the machine 2] machine

02/22/2024). (2022).

[4]

Chao Feng, Xinyu Zhang, and Zichu Fei. 2023. Knowledge solver: teach-

instances. Specify the number for each machine type sepa-

ing llms to search for domain knowledge from knowledge graphs. ArXiv,

rately.". The query format was picked such that the LLM will

abs/2309.03118. https://api.semanticscholar.org/CorpusID:261557137.

[5]

Luis Gutiérrez and Brian Keith. 2019. A systematic literature review on word

benefit from query generation (availability property is specified

embeddings. In Trends and Applications in Software Engineering: Proceedings

in the schema supplied for query generation).

of the 7th International Conference on Software Process Improvement (CIMPS

A total of 100 queries were run, with 10 queries for each num-

2018) 7. Springer, 132–141.

[6]

Domen Hočevar. 2024. Integrating Knowledge Graphs and Large Language

ber of specified machine models (ranging from 1 to 10 models).

Models for Querying in an Industrial Environment. Bachelor’s Thesis. Uni-

The share of fully correct answers for each query type was be-

versity of Ljubljana, Faculty of Computer, Information Science, Faculty of

tween 80 and 100%. The overall accuracy was 96%. This supports

Mathematics, and Physics, Ljubljana, Slovenia, (Aug. 2024). Interdisciplinary

University Study Program, First Cycle, Computer Science and Mathematics.

our hypothesis that the query generation strategy provides more

[7]

Humaine Horizon. 2024. Humaine horizon. https://humaine- horizon.eu/.

accurate answers for slightly more complex queries.

Accessed: 2024-08-26. (2024).

[8]

Pérez Jorge. 2006. Semantics and complexity of sparql. In Proc. 5th Int.

Semantic Web Conference (ISWC2006).

4.3

Manual Evaluation of Example Queries

[9]

Jinyang Li et al. 2024. Can llm already serve as a database interface? a big

bench for large-scale database grounded text-to-sqls. Advances in Neural

This evaluation was initially performed to identify several short-

Information Processing Systems, 36.

comings in our methodologies as mentioned in the previous sub-

[10]

Lars-Peter Meyer, Johannes Frey, Felix Brei, and Natanael Arndt. 2024.

sections. By manually evaluating specific queries relevant to end

Assessing sparql capabilities of large language models. (2024). https://arxiv

.org/abs/2409.05925

arXiv: 2409.05925 [cs.DB].

users, we were able to partially address these issues and fine-tune

[11]

Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed An-

parameters to achieve more accurate results. For instance, while

war, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian.

2023. A comprehensive overview of large language models. arXiv preprint

the system’s initial results were often incomplete (e. g., query did

arXiv:2307.06435.

not return all the machines satisfying certain criteria), increasing

[12]

Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong

the breadth parameter to include a larger subgraph and allowing

Wu. 2024. Unifying large language models and knowledge graphs: a roadmap.

IEEE Transactions on Knowledge and Data Engineering.

LLMs to traverse a broader neighborhood improved the results.

[13]

Diego Sanmartin. 2024. Kg-rag: bridging the gap between knowledge and

Additionally, we demonstrated that subgraph retrieval and query

creativity. arXiv preprint arXiv:2405.12035.

generation can complement each other, further enhancing overall

[14]

Bhaskarjit Sarmah, Benika Hall, Rohan Rao, Sunil Patel, Stefano Pasquali,

and Dhagash Mehta. 2024. Hybridrag: integrating knowledge graphs and

performance. All the results are commented in detail in [6].

vector retrieval augmented generation for efficient information extraction.

arXiv preprint arXiv:2408.04948.

5

Conclusions

[15]

Jianguo Wang et al. 2021. Milvus: a purpose-built vector data management

system. In Proceedings of the 2021 International Conference on Management

In this paper, we presented a system that bridges the gap between

of Data, 2614–2627.

[16]

Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut, Percy Liang, and Jure

natural language processing and querying knowledge graphs,

Leskovec. 2021. Qa-gnn: reasoning with language models and knowledge

specifically within the context of Industry 4.0. By leveraging

graphs for question answering. arXiv preprint arXiv:2104.06378.

large language models (LLMs) and retrieval-augmented gener-

ation (RAG), our system allows users to interact with complex

knowledge graphs using natural language queries, thereby sim-

plifying access to detailed manufacturing data.

Our evaluation demonstrated the usability of our system, how-

ever with the integration of LLMs for natural language under-

standing, some challenges remain. These include occasional inac-

curacies in data retrieval and the LLM’s limited ability to handle

large datasets or specific queries. By adjusting subgraph retrieval

parameters such as breadth and scope, and by combining it with

SPARQL query generation, we were able to significantly enhance

the system’s accuracy and reliability.

This work highlights the potential of combining knowledge

graphs with LLMs to create more intuitive and effective query

systems in industrial environments. Future improvements could

focus on refining query strategies and further optimizing the

10





Comparative Analysis of Machine Learning Models for

Groundwater Level Forecasting: The Impact of Contextual Data

Rok Klančič

Klemen Kenda

rok.klancic@gmail.com

klemen.kenda@ijs.si

Jožef Stefan Institute

Jožef Stefan Institute

Ljubljana, Slovenia

Ljubljana, Slovenia

Abstract

2

Methods

This paper presents a comparative evaluation of three distinct

In our experiments, we employed three categories of methods:

categories of models applied to groundwater level data: tradi-

traditional batch learning techniques, time series deep learning

tional batch learning methods, time series deep learning methods,

models, and time series foundation models.

and time series foundation models. By enriching the water level

data with weather-related features, we significantly improved

2.1

Traditional Batch Learning Methods

the effectiveness of simpler models. The results demonstrate that,

In the context of data-driven modelling of environmental is-

despite their state-of-the-art performance on univariate datasets

sues, traditional batch learning methods have historically demon-

and the corresponding publicity, advanced models without con-

strated significant success [5]. In this study, we employed linear textual feature support are still surpassed by traditional methods

regression alongside two tree-based approaches: random forest

trained on enriched datasets.

and gradient boosting [7] as baselines to evaluate whether the newer, more prominent techniques, which have recently gathered

Keywords

a considerable amount of attention, can perform competitively

groundwater level prediction, time series forecasting, deep learn-

in this specific setting.

ing, foundation models, contextual data

All of the chosen batch learning techniques are regression-

based and are valued for their simplicity, speed, and ease of

1

Introduction

use. However, they often lack the complexity necessary to fully

capture intricate patterns in the data. To mitigate this limitation,

Accurate water level prediction is crucial for mitigating the im-

we incorporated contextual features, such as weather data and

pacts of climate change on water resources. By forecasting water

forecasts (e.g., precipitation, cloud cover, temperature). While the

levels, we can better prepare for potential floods and droughts,

data fusion problem is solved [8], this approach raises concerns and more effectively manage our water supplies. However, pre-about the availability and relevance of the contextual data.

dicting water levels presents a significant challenge due to the

dynamic nature of the data. As climate change leads to prolonged

2.2

Time Series Deep Learning Methods

droughts and increasingly erratic precipitation patterns, the need

for reliable forecasting methods becomes even more important

Time series deep learning models are explicitly designed for

[2].

forecasting time-dependent data. In our study, we employed N-

In this paper, we aim to compare the performance of various

BEATS [12] and PatchTST [10], both of which have architectures models in forecasting groundwater levels. Specifically, we focus

tailored to capture trends and seasonalities inherent in time se-

on the differences between traditional batch learning methods

ries data. Despite their advanced capabilities, these models have

that utilize relevant contextual data and newer univariate time

drawbacks, including longer training and inference times, the ne-

series deep learning and foundation models.

cessity for extensive hyperparameter tuning to achieve optimal

The main contributions of this paper are:

performance, and limited support for incorporating additional

•

features. Although certain models support multivariate time se-

A comparative analysis of the performance of traditional

ries, they were not utilized in our experiments.

batch learning methods against state-of-the-art time series

deep learning techniques and time series foundation mod-

2.3

Time Series Foundation Models

els, particularly in the context of feature vectors enriched

with relevant contextual data.

While deep learning methods require separate training and pre-

• The application of time series foundation models and deep

diction phases, time series foundation models aim to eliminate

learning methods to the domain of groundwater level fore-

the training step. Inspired by large language models, these models

casting.

are pretrained on extensive time series datasets, enabling zero-

shot predictions on new time series without additional training.

The groundwater dataset used in this study has previously been

We used CHRONOS [1], an open source foundation model. The

employed for predictive modeling with traditional batch learning

advantages of this approach include ease of use with minimal pa-

methods [9], where extensive feature engineering was also per-

rameter adjustments and no need for training. However, similar

formed. Our work builds upon and extends this earlier research

to deep learning models, they lack support for multivariate time

by incorporating a different set of models.

series.

Permission to make digital or hard copies of all or part of this work for personal Several studies have already evaluated the performance of

or classroom use is granted without fee provided that copies are not made or

various deep learning and foundation models for time series fore-

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this casting [1] [13]. However, this research extends the application work must be honored. For all other uses, contact the owner /author(s).

of these forecasting models to groundwater level data, therefore

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

contributing to the better understanding of their effectiveness in

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2024.sikdd.6

this domain.

11





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Rok Klančič and Klemen Kenda

3

Experiment Setting

and gradient boosting regressor as our baseline methods. These

models were previously applied to the groundwater dataset [9],

The experiments were conducted on a dataset of groundwater

necessitating a reproduction of the results as a benchmark.

levels in Slovenia. Due to the cumulative nature of water levels

and to facilitate comparison with the original study [9], predictions were made on daily changes in water levels rather than on

3.4

Implementation Details

absolute values.

The prediction pipelines varied slightly between the different

3.1

Dataset

types of models:

The groundwater dataset is a subset of the larger dataset used

• For CHRONOS, we utilized the dataset without weather

in the study [9]. It consists of groundwater level measurements features, as it only supports univariate time series. Since

taken daily from multiple stations across Slovenia. To apply tra-

no hyperparameter tuning was required, the data was

ditional batch learning methods, we enriched the dataset with

divided into training and test sets, omitting the validation

weather data, associating each water measurement station with

set. The model generated the predictions directly from the

the nearest weather station. Due to the availability of weather

water level data. We used the chronos-t5-large model from

data, only data from the years 2010 to 2017 was included in our

the chronos library.

study. For consistency and ease of comparison with previous

• For N-BEATS and PatchTST, the same dataset was used,

study [9], we focused on data from two water measurement

given the same limitation as mentioned previously. How-

stations located in Ljubljana.

ever, a validation set was required for hyperparameter

In traditional batch learning within the environmental domain,

tuning. After selecting appropriate hyperparameters, the

it is essential to not only use the raw data but also to engineer

models were trained on the training set and evaluated

relevant features. Initially, we removed the pressure and dew

on the test set. Implementations from the NeuralForecast

point features, as they were either unrelated to the target variable

library were used for both models.

or highly correlated with other features [9]. We then created

• For the linear regression, random forest regressor,

additional features by shifting the data from 1 to 10 days, making

and gradient boosting regressor models, we included

historical values available, and by computing the averages of

both water level and weather data. Feature selection was

features over a 2- to 10-day window. This process resulted in

conducted to reduce the number of features, resulting in

approximately 2,000 features. Given the excessive number of

42 features for linear regression, 30 for random forest, and

features, which could degrade model performance, we employed

36 for gradient boosting. After feature selection, hyper-

a feature selection algorithm to identify the most informative

parameters for the random forest and gradient boosting

subset.

models were tuned, and the data for linear regression was

We used a genetic feature selection algorithm from scikit-learn,

normalized. The models were then trained on the train-

evaluated on 365-day part of training dataset, with the maximum

ing set and evaluated on the test set using scikit-learn’s

number of features set to 40. The algorithm was executed sepa-

implementations.

rately for each model, focusing on one station and a prediction

The hyperparameters used for training are listed in Appendix

horizon of three days, resulting in distinct feature vectors. Sub-

A, while a description of the selected features is provided in

sequently, weather forecast features with longer offsets were

Appendix B.

manually added to the selected feature set.

3.2

Evaluation Metrics

4

Results

The dataset was split into a training set (approx. 2,500 days),

The results for all tested models across various prediction hori-

a validation set (100 days), and a test set (365 days) for model

2

zons are presented in Table 1. The reported R

scores were calcu-

2

evaluation. Model performance was evaluated using the R

score,

lated based on the differences in water levels; if absolute water

averaged across all tested stations. Although alternative metrics

2

levels had been used, the R

scores would have been significantly

such as root-mean-squared error (RMSE), and mean absolute

higher. For example, in the case of CHRONOS with 1-day ahead

percentage error (MAPE) were considered, they, for this dataset,

2

predictions, the R

score is 0.725 for relative level differences and

2

produce results that are closely related to the R . This metric was

0.998 for absolute water levels.

selected due to its robustness against variations in data offset

Among the models, linear regression achieved the highest per-

and amplitude, and for direct comparability with the results in

formance, followed by the random forest. In contrast, the more

2

the original study [9]. The R

score is defined as:

complex methods, including deep learning models and the foun-

Í𝑛

(𝑦 − ˆ

𝑦 )2

dation model, showed generally lower performance, with the

𝑖

𝑖

2

𝑖 =1

𝑅

= 1 −

,

Í𝑛

(

exception of the 1-day prediction horizon, where N-BEATS out-

𝑦

− ¯

𝑦 )2

𝑖

𝑖 =1

2

performed the tree-based models. Notably, the R

scores decrease

where 𝑦

is the i-th true value, ˆ

𝑦

is the i-th predicted value and

𝑖

𝑖

as the prediction horizon lengthens, with a more pronounced

¯

𝑦 is the average of true values.

decline observed in the deep learning and the foundation models

compared to the traditional batch learning methods.

3.3

Baseline Methods

Figures 2 and 3 display the predictions from CHRONOS, Patch-The primary objective of our research was to compare the per-

TST, and linear regression compared to the true data for the 1-day

formance of traditional batch learning methods, enriched with

and 5-day prediction horizons. It is evident that the predictions

relevant contextual features, against that of modern deep learn-

from CHRONOS and PatchTST begin to exhibit a rightward shift

2

ing techniques and foundation models for time series forecasting.

as the horizon extends. Figure 1 visualizes the R

scores for all

Therefore, we selected linear regression, random forest regressor,

models across the different prediction horizons.

12

The Impact of Contextual Data

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Table 1: R2 Scores for Different Prediction Horizons and Models.

Methods

1 day ahead

2 days ahead

3 days ahead

4 days ahead

5 days ahead

Chronos-large

0,725

0,365

0,175

0,04

-0,09

GradientBoostingRegressor

0,640

0,603

0,527

0,556

0,545

RandomForestRegressor

0,726

0,697

0,701

0,706

0,691

N-BEATS

0,742

0,397

0,17

-0,03

-0,143

PatchTST

0,721

0,394

0,215

0,109

-0,02

LinearRegression

0,792

0,781

0,785

0,784

0,780

The best and second-best results are bolded and underlined respectively.

R² Scores for Different Models and Prediction Horizons

0.8

Models

Chronos-large

GradientBoostingRegressor

0.6

RandomForestRegressor

N-BEATS

es 0.4

PatchTST

LinearRegression

R² Scor 0.2

0.0

1 day

2 days

3 days

4 days

5 days

Prediction Horizons

Figure 1: R2 Scores for All of the Methods and Prediction Horizons.

Predictions for Horizon 1

Predictions for Horizon 5

0.10

True data

True data

Chronos

0.08

Chronos

0.08

PatchTST

PatchTST

LinearRegression

0.06

LinearRegression

0.06

0.04

0.04

0.02

0.02

Water level change (m) 0.00

0.00

Water level change (m)

0.02

0.02

2017-01-15

2017-02-01

2017-02-15

2017-03-01

2017-03-15

2017-01-15

2017-02-01

2017-02-15

2017-03-01

2017-03-15

Time

Time

Figure 2: Example Predictions for Three Models for 1-Day

Figure 3: Example Predictions for Three Models for 5-Day

Prediction Horizon.

Prediction Horizon.

The results indicate that traditional methods, when supple-

mented with relevant contextual features, outperform more com-

plex models that do not incorporate such data. While the 1-day

predictions. This likely occurs due to the absence of contextual

ahead predictions show comparable performance across all meth-

information, causing these models to lag in capturing the true

ods, as the prediction horizon extends, the accuracy of CHRONOS,

trajectory of water levels. In contrast, models with access to

PatchTST, and N-BEATS declines sharply. In contrast, the tradi-

weather data can predict further ahead by accounting for factors

tional models, supported by contextual features, maintain their

such as the impact of rainfall patterns on water levels.

predictive accuracy much more effectively, as shown in Figure 1.

An unexpected finding is that among the baseline models,

A closer examination of the predictions in Figures 2 and 3

linear regression outperforms the more sophisticated methods.

reveals that for 1-day ahead predictions, all models track the true

For instance, in the article [9], while linear regression produced data closely. However, in the 5-day ahead predictions, models

strong results, it did not surpass the performance of the other

lacking contextual data begin to exhibit a rightward shift in their

two methods.

13





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Rok Klančič and Klemen Kenda

5

Conclusion and Future Work

A

Hyperparameters

After evaluating all models on the groundwater level dataset,

Table 2: Hyperparameters Used for Gradient Boosting Re-

we observed that traditional methods, when equipped with rel-

gressor and Random Forest Regressor.

evant features, consistently outperformed newer and more so-

phisticated techniques, particularly as the prediction horizon

lengthened. This suggests that the emphasis on developing the

Hyperparameter

GradientBoosting

RandomForest

most powerful deep learning or foundation models for time se-

n_estimators

28

164

ries predictions may be overstated. With thoughtful selection of

max_features

’log2’

0.5

contextual features, even the simplest models can outperform

max_depth

10

20

modern approaches, which is a significant finding for fields with

sufficient contextual data, such as data-driven environmental

modelling.

Table 3: Hyperparameters Used for N-BEATS and

To enhance the robustness of our evaluation, future work could

PatchTST.

involve testing additional methods, expanding the analysis to

include more measurement stations and surface water level data,

Hyperparameter

N-BEATS

PatchTST

and incorporating deep learning models that support multivariate

loss

HuberLoss

/

time series, such as N-BEATSx [11] and N-HiTS [3]. Further n_harmonics

5

/

insights could be gained by exploring foundation models with

n_polynomials

5

/

multivariate support, such as TimesFM [4], as well as some more scaler_type

’robust’

/

univariate models, like TimeGPT-1 [6]. Future research could

n_blocks

[3, 3, 1]

/

also compare the inference times of various models and assess

mlp_units

[[128, 128]]

/

performance across different time series lengths.

horizon

5

5

input_size

15

71

learning_rate

0.001

0.001

Acknowledgements

max_steps

25

1323

This work was supported by the European Commission under the

encoder_layers

/

12

Horizon Europe project Plooto, Grant Agreement No. 101092008.

n_heads

/

16

We would like to express our gratitude to all project partners for

hidden_size

/

64

their contributions and collaboration.

linear_hidden_size

/

512

Furthermore, we would like to thank Erik Novak for his assis-

dropout

/

0.2

tance in completing this research.

fc_dropout

/

0.1

head_dropout

/

0.1

attn_dropout

/

0.2

References

patch_len

/

16

[1]

Abdul Fatir Ansari et al. 2024. Chronos: learning the language of time series.

stride

/

8

arXiv preprint arXiv:2403.07815.

revin

/

True

[2]

ARSO. 2009. Freshwater. Retrieved August 27, 2024 from https://www.arso

.gov.si/en/soer/f reshwater.html.

[3]

Cristian Challu, Kin G Olivares, Boris N Oreshkin, Federico Garza Ramirez,

B

Selected Features

Max Mergenthaler Canseco, and Artur Dubrawski. 2023. NHiTs: neural

hierarchical interpolation for time series forecasting. In Proceedings of the

Due to the large number of features selected by the feature selec-

AAAI conference on artificial intelligence number 6. Vol. 37, 6989–6997.

[4]

Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. 2023. A decoder-

tion algorithm, we provide a summarized description of the most

only foundation model for time-series forecasting. arXiv preprint arXiv:2310-

frequently chosen features. The features that appeared most often

.10688.

include shifts and averages of precipitation, precipitation fore-

[5]

Fan Feng, Hamzeh Ghorbani, and Ahmed E. Radwan. 2024. Predicting

groundwater level using traditional and deep machine learning algorithms.

casts, temperature, altitude difference, cloud cover, humidity, and

Frontiers in Environmental Science, 12. doi: 10.3389/fenvs.2024.1291327.

snow accumulation. Notably, the majority of selected features

[6]

Azul

Garza

and

Max

Mergenthaler-Canseco.

2023.

TimeGPT-1.

arXiv

preprint arXiv:2310.03589

were derived features we generated, with only approximately

.

[7]

Trevor Hastie, Robert Tibshirani, and Jerome H Friedman. 2009. The elements

one original feature being selected per model.

of statistical learning: data mining, inference, and prediction. Vol. 2. Springer.

In Table 4, the most common shifts and averages for each

[8]

Klemen Kenda, Blaž Kažič, Erik Novak, and Dunja Mladenić. 2019. Streaming

data fusion for the internet of things. Sensors, 19, 8. doi: 10.3390/s19081955.

individual model are presented. The table indicates that shifts

[9]

Klemen Kenda, Jože Peternelj, Nikos Mellios, Dimitris Kofinas, Matej Čerin,

and averages of varying lengths were selected, with a slight

and Jože Rožanec. 2020. Usage of statistical modeling techniques in surface

preference for shorter ones.

and groundwater level prediction. Journal of Water Supply: Research and

Technology-Aqua, 69, 3, (Apr. 2020), 248–265. doi: 10.2166/aqua.2020.143.

[10]

Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam.

Table 4: Most Frequently Selected Shifts and Averages for

2022. A time series is worth 64 words: long-term forecasting with trans-

Various Methods.

formers. arXiv preprint arXiv:2211.14730.

[11]

Kin G. Olivares, Cristian Challu, Grzegorz Marcjasz, Rafał Weron, and Artur

Dubrawski. 2023. Neural basis expansion analysis with exogenous variables:

Method

Shifts (days)

Averages (days)

forecasting electricity prices with nbeatsx. International Journal of Forecast-

ing, 39, 2, 884–900. doi: https://doi.org/10.1016/j.ijforecast.2022.03.001.

GradientBoostingRegressor

4, 10

2, 6

[12]

Boris N Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio.

RandomForestRegressor

2, 6

3, 9

2019. N-BEATS: neural basis expansion analysis for interpretable time series

forecasting. arXiv preprint arXiv:1905.10437.

LinearRegression

2, 10

2, 7

[13]

Hongwei Ye et al. 2024. A transformer-based forecasting model for f10.7

Combined

2, 10

2, 3

index and its application study on the chinese langfang dataset. Advances

in Space Research. doi: https://doi.org/10.1016/j.asr.2024.08.024.

14





Interactive Tool for Tracking Open-source Artificial

Intelligence Progress on Hugging Face

Bogdan Šinik

Domen Vake

bogdan.sinik@f amnit.upr.si

domen.vake@f amnit.upr.si

UP FAMNIT

UP FAMNIT

Koper, Slovenia

Koper, Slovenia

Jernej Vičič

Aleksandar Tošić

jernej.vicic@upr.si

aleksandar.tosic@upr.si

UP FAMNIT, UP IAM

UP FAMNIT, InnoRenew CoE

Koper, Slovenia

Koper, Slovenia

Abstract

to execute your own model, as long as it is of a modest enough

size, on a home computer’s graphics processing unit (GP U), even

Given its increasing importance in our daily lives, Artificial In-

if the GP U is a few years old [9]. The rise in accessibility also telligence has become a prominent subject that needs extensive

enables a larger community to test and develop new solutions

investigation and understanding. This study presents an analysis

and build on top of existing models. We believe that there is a

of the open-source community in the field of Artificial Intelli-

big lack of tools for monitoring the impact of this movement.

gence (AI). Various questions arise anytime AI is introduced.

1

Hugging Face

has grown into one of the primary platforms

open-source AI introduces additional concerns. Should artifi-

for the open-source community. Users are able to download and

cial intelligence (AI) be universally accessible, or should it be

interact with all significant open-source models. Subsequently,

restricted to private use? Is it worthwhile to offer basic models

users have the option to publish their models on the platform and

to the broad user population? We chose the most important data

compare their performance by adding them to the leaderboard,

from the primary website in the field, Hugging Face. We have

where all the models are benchmarked and ranked. The open-

developed a tool that allows for straightforward monitoring of

source community relies heavily on the distribution of models

the progress of various open-source AI models using data ob-

by large corporations, as creating a model from scratch is a hard

tained from their leader board. The platform offers accessible

undertaking [9]. This tool facilitates collaboration among open-and valuable information about various AI models, including

source contributors, enabling them to collectively generate social

their architectures and the activities of authors. Through per-

media content, exchange ideas, and even publish concise articles.

forming a quick review with our tool, it becomes evident that the

In addition to the models, they have the ability to generate and

open-source community is becoming large and has an undeniable

upload useful datasets. It represents the most advanced and inno-

impact on the AI community.

vative developments in the field of open-source AI and Machine

Keywords

learning.

An issue that has been observed is the absence of effective

LLM, open-source, AI, Hugging Face

visualization tools on Hugging Face, which would enable users

1

Introduction

to easily see patterns and gain a comprehensive understanding

of the open-source AI area. In order to address this issue, we

Artificial intelligence, particularly large language models (LLMs),

have developed a sophisticated tool that offers users various

is an important topic in the computer industry today. Despite

viewpoints on the data.

the numerous fears and dogmas around it, it is certain that AI

has become an integral aspect of our lives. This research has

2

Literature review

specifically concentrated on the development of a tool for moni-

Large Language Models (LLMs) have proven essential in enhanc-

toring the impact of the open-source community in the area of

ing software engineering (SE) tasks, demonstrating their effec-

artificial intelligence. As implied, these models are accessible to

tiveness in code comprehension Similar to conventional soft-

all individuals. There is considerable debate on whether this type

ware engineering tools, open-source cooperation is essential for

of technology should be universally accessible. We wanted to

achieving superior products in this area. [8]

investigate if the open-source community is actively contributing

The article authored by Patel et al. [9] emphasizes the sig-

to the development of the field, regardless of one’s philosophical

nificance of the open-source AI community and elucidates its

convictions. Due to the substantial computational requirements,

rapid growth in the wake of major industry leaders like Google,

it was previously impossible to execute Large Language Mod-

Microsoft, and OpenAI. An important milestone in this subject is

els on personal computers. As increasingly compact versions

often emphasized as the day when the LLama model was initially

with impressive capabilities are being produced, this scenario

made available to the open-source community. The community

undergoes a significant transformation. Currently, it is feasible

promptly recognized the possibilities and potential involved in

Permission to make digital or hard copies of all or part of this work for personal this release.

or classroom use is granted without fee provided that copies are not made or

Due to its continuous growth, Hugging Face has emerged

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this as the primary platform for exchanging machine learning (ML)

work must be honored. For all other uses, contact the owner /author(s).

models, resulting in an increasing level of complexity. A relational

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

© 2024 Copyright held by the owner/author(s).

1

https://doi.org/https://doi.org/10.70314/is.2024.sikdd.1

https://huggingface.co/

15





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Bogdan Šinik, Domen Vake, Jernej Vičič, and Aleksandar Tošić

database called HFCommunity was established to facilitate the

data was categorized using several criteria, such as model type,

analysis and resolution of this issue [1].

model architecture, and amount of parameters. The data was

As previously said, open-source AI models offer an exten-

initially selected and aggregated to ensure that all crucial com-

sive range of possibilities. At the recent conference, the authors

ponents were easily accessible. All models that were categorized

[12] demonstrated their effective use of Hugging Face. Due to

as flagged have been excluded from the dataset. In addition, we

the significant difficulty in developing a model with broad in-

have collected data on the authors’ activities and conducted a

telligence, researchers have merged ChatGPT capabilities with

study on that particular aspect. Once the data had been cleaned

models from HuggingFace using agentic architecture to get im-

and prepared for visualization, we utilized the R ggplot library

pressive results in multiple domains. ChatGPT was tasked with

to create visual representations of the data. A comprehensive

creating a plan of action and assigning specific duties to each

R Shiny app was developed by aggregating all the visuals. We

open-source model based on their own areas of expertise. This

chose to utilize Shiny because it is a great option for constructing

is an excellent demonstration of the influence and capabilities

interactive data analysis solutions due to several factors. Firstly,

of the open-source community, given the familiarity with open

it enables the development of web applications that are capable

models and their capabilities.

of responding and adapting to real-time changes and user interac-

The article [6] examines the vulnerabilities associated with

tions. This simplifies the process of exploring and analyzing data.

open-source AI. A much higher number of repositories with high

Shiny easily incorporates with R, utilizing its robust statistical

vulnerabilities has been discovered compared to those with low

and graphical functionalities to generate complex, interactive vi-

vulnerabilities, particularly in root repositories. This emphasizes

sualizations without the need for experience in web technologies

the significance of ensuring the security of technology in order

such as HTML, CSS, or JavaScript. [13] Finally, our application to facilitate its utilization.

was deployed to a server, making it accessible online.

In a recent paper [10], authors have analyzed the transparency of Hugging Face pre-trained models regarding database usage

4

Results

and licenses. The analysis revealed that there is often a lack of

The outcome of this study is the tool we have developed. The

transparency regarding the training datasets, inherent biases,

3

link may be accessed via the following URL.

. It has six distinct

and licensing details in pre-trained models. Additionally, this re-

viewpoints, all conveniently accessible inside its tab. The ini-

search identified numerous potential licensing conflicts involving

tial figure, labeled as 1, displays both the count of new models client projects. 159,132 models were examined. It was found that

and the distribution of various model types. Hugging Face has

merely 14% of these models explicitly identify their datasets with

identified five distinct categories of models: basic mergers and

specific tags. Furthermore, a detailed examination of a statisti-

moerges, fine-tuned on domain-specific datasets, chat models,

cally significant sample comprising 389 of the most frequently

continuously pretrained models, and pretrained models. If the

downloaded models showed that 61% documented their training

model did not belong to any of these classes, its type was classi-

data in some form.

fied as unknown. The user has the ability to effortlessly choose

their preferred categories, along with the desired time frame and

3

Methodology

unit of aggregate (daily, weekly, or monthly). This allows the

We obtained the data by extracting the Open LLM Leaderboard

viewer to clearly observe the evolution of model types and their

from Hugging Face [2] by saving the data server sent to the

popularity over time. It is evident that fine-tuned models are

client. This data contains information about repositories of mod-

predominantly utilized. This is logical, as users are adapting base

els that are currently on the leaderboard and the models that are

models by training them on unique datasets to achieve specializa-

waiting to be evaluated for the leaderboard. A Python pipeline

tion. Also, we can see that merged models are a relatively recent

2

was developed to clean and enrich this data available on

. The

phenomenon.

leaderboard data includes model architecture and precision as

well as the model type and performance on the following bench-

marks: ARC[3], HellaSwag[14], MMLU[5], TruthfulQA[7], Wino-

grade[11] and GSM8K[4]. In addition to the data provided on the leaderboard, additional information on the given models was

obtained by using the HF API client. This included data about

repository contributors, tags, base models, used datasets, and

repo activity. It is important to note that the data is self-reported

by the developers and is not enforced by HuggingFace. Addition-

ally, the leaderboard includes duplicates due to developers being

able to replace models in the repository with different models

under the same name. This means the duplicates have the same

repository data but distinct performances. Due to the inability to

Figure 1: Popularity by model type over time

programmatically determine the current model in the repository,

we chose the best-performing model under the repository name

as the model representing the repository when removing dupli-

The second view, referenced as 2, has two interconnected

cates. Thus, all datasets were generated for further utilization.

visualizations. The upper section displays the activity of the top

The following analysis was conducted using the R programming

10 authors within a specific range of dates. The display showcases

language. The data was mostly studied via the perspective of

every model they have developed, along with its corresponding

time, as our focus was on identifying any obvious trends. The

type. The lower section presents the average benchmark score for

2

3

https://github.com/VakeDomen/HF_analysis

https://oai.dltlt.famnit.upr.si/

16





Interactive Tool for Tracking Open-source Artificial Intelligence Progress on Hugging Face Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

each model, organized by author. This visualization enables users

Opt, GPT2, and GPT2-NeoX. All architectures that did not fit

to effortlessly monitor the most prominent authors and observe

into any one category were classed as "Other". This perspective

their patterns and accomplishments in model development over

has two graphics that depict popularity. The first comparison

time. Users have the ability to effortlessly choose a certain range

assesses the popularity of a model relative to itself, depending on

of dates and also narrow down the list to the top 10 authors

the number of new models introduced before. The second one

according to their preferences. It is evident that leading authors

compares it to the average number of new models created, taking

typically do not adhere to trends and consistently provide models

into account their architecture. Both are depicted by coloring the

of similar type.

area, as it is the most convenient way to track. Users may analyze

the fluctuation in popularity of well-known model architectures

over time and examine how the rising popularity of a particular

architecture might impact the popularity of a certain architecture

of interest. The lower plot indicates that LLama and Mistral are

the predominant models; nonetheless, they have experienced

fluctuations throughout time, as visible on the upper plot.

Figure 2: Top authors activity over time

The following perspective 3 illustrates two aspects. The first aspect is the alteration in the average benchmark score for each

model type as time progresses. The display showcases the top-

performing model for each category and time interval (daily,

Figure 4: Change of popularity of main architectures over

weekly, or monthly). In addition to the dots representing each

time

model, we have incorporated a smooth line to aid the user in see-

ing the temporal changes for a particular model type. Following

the first visualization, we have included a second visualization

The graphic labeled as 5 illustrates the progressive improve-

that displays the total number of models for each model type

ment of the key base models developed by famous companies.

within the chosen period range. Through these visualizations,

This was accomplished by isolating each incremental improve-

users can easily identify the model type that experienced the

ment in score over time, using the base model as a reference. In

most improvement and the model types that were mainly pro-

order to fulfill this objective, we have chosen five distinct varia-

duced. We can see the trend, which indicates that open-source AI

tions of LLama, Mistral, and Mixtral, as well as three iterations

models are improving, as evidenced by the improvement in aver-

of Phi. The user may easily observe the overall improvement in

age benchmark scores across most of them. The overall number

benchmark scores for each base model. In addition, users have

of models is rapidly increasing, indicating a rise in the popularity

the ability to view the overall duration required for the model to

of open-source AI models.

achieve its maximum performance. We have included a feature

that enables users to toggle the visibility of model labels, hence

enhancing visibility and facilitating more in-depth examination

according to their preferences. This allows the user to observe the

speed at which specific models reached their peak performance

and the extent of their improvement relative to the base models.

Figure 3: Change of benchmark score and total models per

type over time

The fourth perspective, as seen in Figure 4, examines the

changing popularity of various model architectures through-

out time. The following architectures have been chosen for this

Figure 5: Evolution of famous base models

specific objective: LLama, Mixtral, Mistral, Qwen2, Gemma, Phi,

17





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Bogdan Šinik, Domen Vake, Jernej Vičič, and Aleksandar Tošić

The final view, as depicted in Figure 6, illustrates the impact

[4]

Karl Cobbe et al. 2021. Training verifiers to solve math word problems.

(2021). arXiv: 2110.14168 [cs.CL].

of significant releases on the popularity of various model designs.

[5]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika,

As we have employed identical model designs to those in view

Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask lan-

four, we have extracted and categorized all significant release

guage understanding. (2021). arXiv: 2009.03300 [cs.CY].

[6]

Adhishree Kathikar, Aishwarya Nair, Ben Lazarine, Agrim Sachdeva, and

dates of these models. The user has the option to choose the

Sagar Samtani. 2023. Assessing the vulnerabilities of the open-source artifi-

time unit for aggregate, which can be either day, week, or month.

cial intelligence (ai) landscape: a large-scale analysis of the hugging face

Users may quickly analyze the impact of significant releases and

platform. In 2023 IEEE International Conference on Intelligence and Security

Informatics (ISI), 1–6. doi: 10.1109/ISI58743.2023.10297271.

observe how they influence the popularity and mass creation of

[7]

Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Truthfulqa: measuring

specific models. We can observe the evident impact of the recent

how models mimic human falsehoods. (2022). arXiv: 2109.07958 [cs.CL].

[8]

Zhihao Lin et al. 2024. Open-source ai-based se tools: opportunities and chal-

releases of LLama and Mistral for their popularity.

lenges of collaborative software learning. arXiv preprint arXiv:2404.06201.

[9]

Dylan Patel and Afzal Ahmad. 2023. Google “we have no moat, and neither

does openai.”. SemiAnalysis. May, 4, 2023.

[10]

Federica Pepe, Vittoria Nardone, Antonio Mastropaolo, Gerardo Canfora,

Gabriele Bavota, and Massimiliano Di Penta. 2024. How do hugging face

models document datasets, bias, and licenses? an empirical study.

[11]

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi.

2019. WINOGRANDE: an adversarial winograd schema challenge at scale.

(2019). arXiv: 1907.10641 [cs.CL].

[12]

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and

Yueting Zhuang. 2023. Hugginggpt: solving ai tasks with chatgpt and its

friends in hugging face. In Advances in Neural Information Processing Systems.

A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors.

Vol. 36. Curran Associates, Inc., 38154–38180. https://proceedings.neurips.c

c/paper_f iles/paper/2023/f ile/77c33e6a367922d003f f 102f f b92b658- Paper-

Conf erence.pdf .

[13]

Carson Sievert. 2020. Interactive web-based data visualization with R, plotly,

and shiny. Chapman and Hall/CRC.

[14]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi.

Figure 6: Effect of big releases on architecture of produced

2019. Hellaswag: can a machine really finish your sentence? (2019). arXiv:

1905.07830 [cs.CL].

models

5

Conclusion and future work

Given the growing importance of Artificial Intelligence in mod-

ern culture, it is beneficial to explore the free solutions that are

accessible rather than just depending on commercial alternatives.

This paper offers valuable insights into a tool designed to simplify

the examination of trends in open-source AI in a user-friendly

manner. It offers various viewpoints and enables users to acquire

knowledge and reach certain conclusions about the subject. Hug-

ging Face has the capability to function as an excellent tool for

finding a certain model. As time progresses, open-source AI is

expected to provide a growing contribution to the AI community

and provide more specific applications for models that could be

ignored by big organizations.

We aim to enhance the functionality of our Shiny application

by incorporating more perspectives and expanding the range

of data interaction options. Our objective is to ensure that the

system is as updated as possible. Besides that, we want to conduct

a comprehensive analysis of the data to identify patterns and

correlations inside this group. We aim to assess the potential of

these models and examine their capabilities and potential uses

in addressing real-world issues. We would like to analyze the

sustained popularity and efficacy of these models over a longer

time frame.

References

[1]

Adem Ait, Javier Luis Cánovas Izquierdo, and Jordi Cabot. 2023. Hfcom-

munity: a tool to analyze the hugging face hub community. In 2023 IEEE

International Conference on Software Analysis, Evolution and Reengineering

(SANER), 728–732. doi: 10.1109/SANER56733.2023.00080.

[2]

Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan

Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas

Wolf. 2023. Open llm leaderboard. https://huggingf ace.co/spaces/open- llm-

leaderboard/open_llm_leaderboard. (2023).

[3]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal,

Carissa Schoenick, and Oyvind Taf jord. 2018. Think you have solved ques-

tion answering? try arc, the ai2 reasoning challenge. (2018). arXiv: 1803.054

57 [cs.AI].

18





Multilingual Hate Speech Modeling

by Leveraging Inter-Annotator Disagreement

Patricia-Carla Grigor∗

Bojan Evkoski

Petra Kralj Novak

University of Vienna

evkoski_bojan@phd.ceu.edu

novakpe@ceu.edu

Vienna, Austria

Central European University

Central European University

Vienna, Austria

Vienna, Austria

Jožef Stefan Institute

Ljubljana, Slovenia

Abstract

challenge is the subjectivity of hate speech, as annotators often

As social media usage increases, so does the volume of toxic

disagree due to diverse backgrounds and perspectives.

content on these platforms, motivating the Machine Learning

To address this challenge, researchers have proposed alterna-

(ML) community to focus on automating hate speech detec-

tive methodologies to ground-truthing, including the incorpo-

tion. While modern ML algorithms are known to provide nearly

ration of diverse perspectives into the training and evaluation

human-like results for a variety of downstream Natural Lan-

pipelines of ML models [1, 14]. One such approach is introduced guage Processing (NLP) tasks, the classification of hate speech

by [7], who train monolingual hate speech classifiers in several is still an open challenge, partially due to its subjective anno-languages directly on datasets that include disagreement. As an

tation, which often leads to disagreement between annotators.

alternative to gold-standard data, such data is referred to as dia-

This paper adopts a perspectivist approach that embraces sub-

mond standard data, based on the assumption that more than one

jectivity, leveraging conflicting annotations to enhance model

single truth exists. In terms of evaluation, the researchers focus

performance in real-world scenarios. A state-of-the-art multi-

on the evaluation of models from the perspective of disagreement,

lingual language model for hate speech detection is introduced,

with the ultimate goal of estimating the agreement between the

trained, and evaluated using diamond standard data with metrics

annotators themselves, as well as between models and annotators

that consider disagreement. Various strategies for incorporat-

by using the appropriate metrics. Their main findings indicate

ing disagreement are compared in the process. Results demon-

that disagreement between annotators represents an intrinsic

strate that the model performs equally or better on all evalu-

limitation to the performance that can be achieved by automated

ated languages compared to respective monolingual models and

systems.

drastically outperforms on multilingual data. This highlights

This paper aims to explore the potential of training a multilin-

the effectiveness of multilingual and perspectivist methods in

gual hate speech model, as well as further explore the ideas of

addressing the complexities of hate speech detection. The pre-

incorporating inter-annotator disagreement in model training.

sented multilingual hate speech detection model is available at:

Therefore, at the basis of this paper lie the following research

https://huggingface.co/IMSyPP/hate_speech_multilingual.

questions:

- How does the performance of multilingual hate speech classifiers

Keywords

trained on diamond standard data compare to the performance of

monolingual models?

hate speech detection, inter-annotator disagreement, multilin-

- How can inter-annotator disagreement be effectively incorporated

gual language modeling

into the classifier fine-tuning process?

1

Introduction

In light of these research questions, the expected outcomes

are twofold: (1) multilingual classifiers trained on diamond stan-

The phenomenon of hate speech, which is typically defined as

dard data are anticipated to outperform monolingual models,

offensive or derogatory language targeting individuals or groups

and (2) incorporating inter-annotator disagreement is expected

based on characteristics such as race, religion, ethnic origin, sex-

to enhance sensitivity to nuanced hate speech. These findings

ual orientation, disability, or gender [2], has become a significant could benefit computational linguistics research and social me-problem on social networks in recent years, with communities

dia providers by informing the development of more effective

being increasingly exposed to toxic content as the networks

content moderation algorithms.

grow and become more interconnected [13, 3]. Consequently,

the Machine Learning (ML) and computational linguistics com-

2

Related Work

munities have begun developing content moderation strategies

using advanced algorithms and Natural Language Processing

Several methods exist for incorporating disagreement into ML

(NLP) techniques to detect hate speech [10, 11]. However, a key training pipelines [12, 5], but few focus on hate speech detection. One approach is presented in [7], where monolingual hate

∗The first author conducted the research with significant input from the second au-speech classifiers were trained for English, Italian, and Slovenian.

thor, under the supervision and guidance of the third author. All authors contributed to writing the manuscript.

These classifiers utilized diamond standard datasets sourced from

YouTube and Twitter, employing a consistent annotation process

Permission to make digital or hard copies of all or part of this work for personal for each language. Their main findings indicate that, according to

or classroom use is granted without fee provided that copies are not made or

distributed for profit or commercial advantage and that copies bear this notice and the accuracy scores, the annotators demonstrated a high degree

the full citation on the first page. Copyrights for third-party components of this of agreement in approximately 80% of the cases across all three

work must be honored. For all other uses, contact the owner/author(s).

datasets. In terms of Krippendorff’s ordinal alpha score, which

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

considers both agreement by chance and the ordering of classes

© 2024 Copyright held by the owner/author(s).

https://doi.org/https://doi.org/10.70314/is.2024.sikdd.7

(from least to most severe), the agreement score is approximately

19





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Patricia-Carla Grigor, Bojan Evkoski, and Petra Kralj Novak

0.6 for all three languages. Furthermore, the evaluation results

3.2

Model Selection and Fine-Tuning

indicate that the performance of each model aligned with the

Our proposed multilingual hate speech model builds on the pre-

inter-annotator agreement, both in terms of accuracy and the

trained XLM-R transformer model [4], chosen for its proven

alpha score. This implies that the performance of models is inher-

effectiveness in cross-lingual understanding and its ability to

ently constrained by the level of agreement among annotators.

handle a wide range of languages. This provides a robust founda-

Consequently, when trained on diamond standard data, it is un-

tion for fine-tuning and optimization, particularly since English,

likely that the performance of these models can significantly

Italian, and Slovenian—the languages used for fine-tuning—were

surpass human performance.

included in XLM-R’s pre-training. To explore various strategies

This work was built upon these findings through investigat-

for incorporating annotator disagreement during training, three

ing the potential of multilingual models to enhance hate speech

model variants were fine-tuned on the previously presented

detection, with the aim of broadening their applicability across

datasets, referred to in the tables as MDA, MDD, and MRD, re-

diverse linguistic contexts. Additionally, strategies for incorpo-

spectively.

rating annotator disagreement were explored, with the goal of

To address class imbalance and enhance model performance

improving model performance to approach human-level accuracy

on minority classes, a custom training loop with a weighted

and agreement.

cross-entropy loss function was implemented, as proposed in [9].

3

Method

The class weights were calculated to be inversely proportional

to the frequency of each hate speech class within the training

This section details the methodology for training and evaluating

data. The hyperparameters for the fine-tuning process included a

the multilingual hate speech classifier presented in this paper. It

learning rate of 6 × 10−6, a batch size of 8, and 3 training epochs.

begins with a brief overview of the datasets used, followed by

During the training phase, the AdamW optimizer was employed

an explanation of the chosen pre-trained language model that

to optimize the model parameters. The fine-tuning process was

serves as the foundation for fine-tuning. The section concludes

implemented using PyTorch.

with a description of the methods employed for evaluating the

models.

3.3

Model Evaluation

3.1

Datasets

In terms of evaluation, the approach introduced in [7] was replicated in order to compare the performance of the multilingual

Three monolingual datasets, i.e. the English (Youtube), Italian

classifiers to human judgment from the perspective of disagree-

(Youtube) and Slovenian (Twitter) datasets, introduced in [7]

ment. This was achieved by employing identical measures to

served as the basis for our multilingual model. Each item was

estimate the agreement between human annotators, as well as

annotated by two annotators independently, assigned to one of

the agreement between annotators and models. Accuracy, F1

four available classes: [Appropriate], [Inappropriate], [Offensive],

score and, most notably, Krippendorff’s ordinal alpha were used

and [Violent]. In the case of conflicting labels, both annotating

to evaluate all models in this research.

instances were kept.

Rarely used in ML applications, Krippendorff’s alpha is a ro-

To explore strategies for incorporating disagreement, three

bust measure for assessing inter-rater reliability, accounting for

multilingual datasets were created. First, the Duplicate All (DA)

agreement beyond what might occur by chance. It is applicable

dataset, which contains all instances by their respective two anno-

across various data types (nominal, ordinal, interval, and ratio

tators from the three monolingual datasets. Second, the Duplicate

scales) and is particularly effective in dealing with missing data.

Disagreement (DD) dataset, in which instances where annotators

The value of Krippendorff’s alpha ranges from -1 to 1, where 1

disagreed appear twice with their respective conflicting labels,

indicates perfect agreement and 0 suggests agreement equivalent

while instances that they agreed upon appear only once, creat-

to chance. Generally, an alpha above 0.80 is considered a strong

ing a more balanced training set that reflects both agreement

agreement, while in hate speech datasets, the alpha values range

and disagreement, potentially preventing the models from be-

from 0.25 to 0.65. For a detailed discussion, see Krippendorff [8].

ing biased towards instances where annotators agree. And third,

the Remove Disagreement (RD) dataset, which consists only of

4

Results

instances where annotators agree. Thus, the first two datasets

contain diamond standard data, while the third dataset can be

This section presents the evaluation results on the multilingual

considered a gold standard dataset in which disagreement has

model and its variants. It starts with an evaluation from the

been explicitly removed.

perspective of inter-annotator and model-annotator agreement.

All instances in these datasets have undergone the same pre-

Then, the class specific evaluation results, as well as a model

processing steps, such as replacing links and usernames with

comparison based on the models’ average scores are presented.

placeholders. This step was undertaken to mitigate any potential

The models are also compared to monolingual baselines fine-

biases associated with certain names, as discussed in [6]. Table 1

tuned on data for their respective languages, including the BERT

presents an overview of the label distribution across the three

model for English, AlBERTo for Italian, and CroSloEngual for

multilingual training sets. The datasets used for monolingual

Slovenian, as presented in [7].

evaluation are the unmodified evaluation sets presented in [7].

4.1

Inter-Annotator and Model-Annotator

Agreement

Table 1: Label distribution of the multilingual train sets

The inter-annotator agreement was computed on the evaluation

Dataset

Acceptable

Inappropriate

Offensive

Violent

sets for each language using Krippendorff’s alpha and accuracy.

DA

191,677

11,005

112,833

7,145

The same measures were also used to compute the agreement

DD

111,324

8,346

72,706

4,992

RD

80,573

2,661

40,255

2,161

between the annotators and the models. The results are presented

in Table 2.

20





Multilingual Hate Speech Modeling by Leveraging Inter-Annotator Disagreement Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Table 2: Inter-Annotator Agreement compared to model-annotator agreement in terms of Krippendorff’s ordinal alpha (𝛼) and Accuracy (Acc.) for the models Multilingual Duplicate All (MDA), Multilingual Duplicate Disagreement (MDD), and Multilingual Remove Disagreement (MRD) based on the language-specific evaluation sets Dataset

Inter-Annotator Agreement

MDA

MDD

MRD

𝛼

Acc.

𝛼

Acc.

𝛼

Acc.

𝛼

Acc.

English

58.19

82.91

55.89

79.97

50.18

76.47

57.90

81.41

Italian

57.00

81.79

58.29

82.00

56.15

80.43

57.84

82.69

Slovenian

56.62

79.43

55.74

78.60

52.95

76.52

55.15

78.84

First, in the case of inter-annotator agreement, annotators

Table 3: Model evaluation results in terms of class-specific

agree around 80% of the time in terms of accuracy, with an accu-

F1 scores on the English dataset. The Total score was calcu-

racy score between 79% and 82% across all three datasets. How-

lated using the weighted F1 score. The first three models

ever, accuracy does not account for class imbalance, nor the

represent the monolingual baselines. The subsequent mod-

ordering of the classes. A more appropriate estimate of the agree-

els represent the multilingual models

ment is computed through Krippendorff’s ordinal alpha. Here,

Model

Appropriate

Inappropriate

Offensive

Violent

Total

the annotators achieve an agreement score alpha in the values

EN

89.38

28.95

68.36

24.17

83.44

between 0.56 and 0.58 across the three languages.

IT

85.25

13.81

0.41

0.00

63.39

Second, in terms of agreement between annotators and mod-

SL

88.01

25.17

49.69

2.88

77.71

els, the same metrics were applied. The results demonstrate a

MDA

86.10

39.16

68.24

27.82

81.09

MDD

83.33

34.16

65.07

24.52

78.20

consistent level of agreement between the models and annotators

MRD

87.43

29.90

69.02

27.27

82.18

across all cases. Based on accuracy scores, all models align with

at least one annotator approximately 80% of the time, with alpha

values comparable to inter-annotator scores. In most instances,

the models achieve the upper limit of inter-annotator agreement,

[Offensive] were achieved by the MDA variant, once again show-

and in some cases, even exceed it (e.g., Italian Multilingual Du-

ing the superiority of the Duplicate All (DA) strategy.

plicate All MDA). This suggests that the models are effectively

In the case of the Slovenian dataset, the observed phenomena

learning consistent patterns or biases that align well with one or

slightly differ from the previous ones. The evaluation results are

more annotators. Such outcomes are expected in scenarios where

presented in Table 5. Here, two of the multilingual variants (MDA annotator disagreement is largely due to subjective interpreta-and RD) outperform the Slovenian monolingual model overall,

tion. This should not be construed as the model being inherently

despite predicting worse on the [Appropriate] class. Notably, the

superior, but rather as an indication of its efficiency in modeling

monolingual model outperforms all models on the [Violent] class,

the predominant patterns present in the training data.

which has not been the case for the other languages. This could

Third, a comparison between the multilingual variants shows

be due to language specifics that the multilingual model fail to

that the Duplicate Disagreement (DD) strategy consistently shows

capture, or to the specifics of the CroSloEngual BERT which is

worse alpha scores, meaning that emphasizing only on disagree-

also heavily pre-trained on Croatian and Slovenian data. Once

ment might be detrimental in training. No consistent difference

again, the DA disagreement strategy shows slight superiority

between Duplicate All (DA) and Remove Duplicates (RD) is evident

over RD.

from the experiments.

Finally, Table 6 shows the average scores of all models, achieved by averaging their combined (weighted) F1 scores across all three

4.2

Model Comparison

languages. Summarizing the multilingual superiority, these final

To evaluate the performance of the models across the four hate

results show how monolingual models drastically falter on un-

speech classes, the F1 score was used. Additionally, the combined

seen languages, while the multilingual models have the capacity

(weighted) F1 score was computed for each model to assess their

to reach the inter-annotator agreement ceiling for all languages.

overall performance. To determine the best-performing model,

While overall results show that the Remove Disagreement (RD)

the weighted F1 scores were averaged across all three languages.

gold standard strategy for incorporating disagreement is best, one

Table 3 shows the results achieved by each of the models on the should be cautious when making such conclusions. Class-specific

English evaluation set. In the case of the English dataset, the re-

results show that the Duplicate All (DA) strategy outperforms

sults show that the multilingual model outperforms the baseline

in all the classes most relevant to hate speech detection, except

monolingual English model across all classes except the [Appro-

for [Appropriate], which is the least relevant class. Another dif-

priate] class, a case in which it still performs competitively. The

ference is that the MDA model involved training longer on the

variant which achieved the highest score on the minority classes

same data which might have resulted in improvement on mi-

is the MDA model, with an F1 score of 39.16 for the [Inappropri-

nority classes and saturation on the majority class. For a future

ate] class and an F1 score of 27.82 for the [Violent] class. This

fairer comparison, the fine-tuning process on gold standard data

is most likely due to introducing the weighted cross-entropy

should be adjusted accordingly. The MDA variant of the model is

loss function, which was effective in improving performance on

available at: https://huggingface.co/IMSyPP/hate_speech_multil

underrepresented classes, a procedure which was not performed

ingual.

in [7].

Similar patterns emerge on the Italian dataset (Table 4). The

5

Discussion

multilingual model is competitive to the monolingual model

In recent years, automated hate speech detection has become

while outperforming the Italian baseline on the minority classes.

crucial for moderating online content and mitigating the nega-

The highest scores on the most important classes [Violent] and

tive impact on social dynamics within online communities. This

21





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Patricia-Carla Grigor, Bojan Evkoski, and Petra Kralj Novak

Table 4: Model evaluation results in terms of class-specific

XLM-R transformer. By leveraging multilinguality, the model

F1 scores on the Italian dataset

significantly outperforms monolingual baselines, demonstrating

its effectiveness across diverse linguistic contexts. This high-

Model

Appropriate

Inappropriate

Offensive

Violent

Total

lights the potential of multilingual approaches in improving hate

EN

86.27

1.28

1.05

0.00

67.42

IT

91.32

58.46

59.02

40.34

83.22

speech detection, especially in scenarios where content spans

SL

86.23

0.76

3.25

0.00

65.95

multiple languages.

MDA

89.77

58.45

60.42

44.97

82.38

Additionally, this research incorporates inter-annotator dis-

MDD

88.95

56.04

58.31

39.85

81.19

MRD

90.41

55.46

59.49

38.78

82.50

agreement into the fine-tuning process using diamond standard

data, offering a valuable alternative to traditional gold-standard

Table 5: Model evaluation in terms of class-specific F1

models. By embracing rather than ignoring annotator disagree-

scores on the Slovenian dataset

ment, the model better reflects the nuances of subjective anno-

tations, enhancing its real-world applicability. However, while

Model

Appropriate

Inappropriate

Offensive

Violent

Total

this approach shows promise, annotator disagreement continues

EN

79.93

3.98

2.34

0.00

53.84

to present challenges, indicating that further work is needed to

IT

79.84

3.80

1.24

0.00

53.43

fully address its impact on model performance.

SL

85.70

43.69

65.26

29.12

78.39

MDA

84.30

45.22

69.69

24.79

78.88

Future research could extend this work by evaluating the mod-

MDD

82.33

43.39

68.59

23.84

77.19

els on additional languages, exploring alternative baseline models,

MRD

84.98

38.47

68.40

15.50

78.80

refining strategies for incorporating annotator disagreement and

handling minority classes. As online hate speech extends its im-

Table 6: Average performance of models based on class-

pact, developing robust, multilingual content moderation systems

weighted F1 scores across three languages

is crucial to maintaining safe and inclusive digital environments.

Model

Avg. Weighted F1 Score (all languages)

7

Acknowledgments

EN

68.23

The authors acknowledge partial financial support from the Slove-

IT

66.68

nian Research Agency (research core funding no. P2-103).

SL

74.02

MDA

80.78

References

MDD

78.86

[1]

Aymé Arango, Jorge Pérez, and Barbara Poblete. 2019. Hate speech detec-

MRD

81.16

tion is not as easy as you may think: a closer look at model validation. In

Proceedings of the 42nd international acm sigir conference on research and

research proposes a novel multilingual hate speech model to ad-

development in information retrieval, 45–54.

[2]

Alexander Brown. 2017. What is hate speech? Part 2: Family resemblances.

dress these challenges on a broader scale. The following discusses

Law and Philosophy, 36, 561–613.

the main findings.

[3]

Naganna Chetty and Sreejith Alathur. 2018. Hate speech review in the

context of online social networks. Aggression and violent behavior, 40, 108–

First, the inter-annotator agreement and the agreement be-

118.

tween annotators and models suggest that inter-annotator agree-

[4]

Alexis Conneau et al. 2019. Unsupervised cross-lingual representation learn-

ment sets an intrinsic limit on model performance. Models are

ing at scale. CoRR, abs/1911.02116.

[5]

Tommaso Fornaciari, Alexandra Uma, Silviu Paun, Barbara Plank, Dirk

limited by the quality and consistency of the annotated data,

Hovy, Massimo Poesio, et al. 2021. Beyond black & white: leveraging an-

which directly affects their ability to accurately predict unseen

notator disagreement via soft-label multi-task learning. In Proceedings of

data. However, incorporating areas of disagreement into model

the 2021 Conference of the North American Chapter of the Association for

Computational Linguistics: Human Language Technologies. Association for

development can lead to more robust models capable of han-

Computational Linguistics.

dling ambiguous cases by employing one of the several available

[6]

Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou. 2018. Word

embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings

strategies for incorporating disagreement.

of the National Academy of Sciences, 115, 16, E3635–E3644.

Second, the multilingual model consistently surpassed the

[7]

Petra Kralj Novak, Teresa Scantamburlo, Andraž Pelicon, Matteo Cinelli,

monolingual baselines, achieving the inter-annotator agreement

Igor Mozetič, and Fabiana Zollo. 2022. Handling disagreement in hate speech

modelling. In International Conference on Information Processing and Man-

ceiling across all languages. This success can be attributed partly

agement of Uncertainty in Knowledge-Based Systems. Springer, 681–695.

to the ability to leverage patterns learned from multiple lan-

[8]

Klaus Krippendorff. 2018. Content analysis: An introduction to its methodology.

guages, partly to vast amounts of data incorporated into state-of-

Sage publications.

[9]

Andraž Pelicon, Syrielle Montariol, and Petra Kralj Novak. 2023. Don’t start

the-art pre-trained multilingual models, and partially to the class

your data labeling from scratch: opsala-optimized data sampling before

weighting scheme employed in the fine-tuning. These findings

labeling. In International Symposium on Intelligent Data Analysis. Springer,

353–365.

support the first research question, demonstrating that a multi-

[10]

Juan Manuel Pérez et al. 2023. Assessing the impact of contextual informa-

lingual hate speech classifier trained on diamond standard data

tion in hate speech detection. IEEE Access, 11, 30575–30590.

outperforms its monolingual counterparts.

[11]

Fabio Poletto, Valerio Basile, Manuela Sanguinetti, Cristina Bosco, and

Viviana Patti. 2021. Resources and benchmark corpora for hate speech

Finally, this research contributes substantially to hate speech

detection: a systematic review. Language Resources and Evaluation, 55, 477–

classification in a multilingual context by introducing a novel

523.

multilingual hate speech detection model and making it avail-

[12]

Alexandra N Uma, Tommaso Fornaciari, Dirk Hovy, Silviu Paun, Barbara

Plank, and Massimo Poesio. 2021. Learning from disagreement: a survey.

able on the Hugging Face platform. Our model underscores the

Journal of Artificial Intelligence Research, 72, 1385–1470.

importance of incorporating inter-annotator disagreement into

[13]

William Warner and Julia Hirschberg. 2012. Detecting hate speech on the

world wide web. In Proceedings of the second workshop on language in social

model development, challenging the reliance on gold standard

media, 19–26.

data in subjective tasks, such as hate speech detection.

[14]

Wenjie Yin and Arkaitz Zubiaga. 2021. Towards generalisable hate speech

detection: a review on obstacles and solutions. PeerJ Computer Science, 7,

6

Conclusions

e598.

This paper advances automatic hate speech detection by introduc-

ing a novel multilingual model fine-tuned on the state-of-the-art

22





Predicting Pronunciation Types

in the Sloleks Morphological Lexicon of Slovene

1,2

Jaka Čibej

jaka.cibej@f f.uni- lj.si

jaka.cibej@ijs.si

1Faculty of Arts, University of Ljubljana

2Jožef Stefan Institute

Ljubljana, Slovenia

Abstract

representation, with some exceptions and several predictable

phoneme assimiliations (such as the assimilation of voiceless

We present an experiment dealing with the automatic prediction

consonant phonemes to their voiced equivalents glasba ‘music’,

of pronunciation types for lemmas in the Sloleks Morphologi-

cal Lexicon of Slovene

IPA: /"gla:zba/, or vice-versa, voiced-to-voiceless, podpreti ‘to

. We perform a statistical analysis on a

support’, IPA: /pOt"pre:ti/).

number of mostly 𝑛-gram-based features and use a set of sta-

However, not all entries in Sloleks follow Slovene G2P prin-

tistically significant features to train and test several machine

ciples. For a number of words, particularly proper nouns de-

learning models to discriminate between lemmas for which a pho-

noting people (Shakespeare, Sharon), locations (Sydney, Birm-

netic transcription can be generated automatically using Slovene

ingham), inhabitants (Newyorčan ‘New Yorker’), etc.; as well as

grapheme-to-phoneme (G2P) conversion rules (e.g. Novak), and

adjectives derived from proper nouns (aachenski ‘pertaining to

lemmas with pronunciation that follows other G2P rules (e.g.

Shakespeare

Aachen’, Acronijev ‘belonging to Acroni’), the phonetic transcrip-

).

tion cannot be generated using Slovene G2P rules. In such cases

Keywords

with foreign orthographic elements that indicate relations be-

tween graphemes and phonemes that are unusual for Slovene,

grapheme-to-phoneme conversion, pronunciation types, mor-

Slovene linguistic and lexicographic practice (see e.g. [5]) first re-phological lexicon, proper nouns, Slovene

quires a transliteration into the closest equivalent using Slovene

graphemes, which can then be used to generate the phonetic tran-

1

Introduction

scription using Slovene G2P rules (e.g. Newyorčan → njújórčan

The Sloleks Morphological Lexicon of Slovene [2] is the largest

→ IPA: /"nju:"jo:rtSan/).

open-access database containing machine-readable information

Because of this, it is necessary to discriminate between differ-

on the morphological properties of Slovene lemmas (e.g. miza

ent pronunciation types: categories of words that follow Slovene

‘table’, noun, common, feminine) and their inflected forms (e.g.

G2P rules (Slovene G2P ) and those that do not (e.g. Other G2P ;

mize, singular, genitive; mizo, singular, accusative). Since version

more on this in Section 2). Pronunciation types denote the manner 2.0 [3], each lemma and inflected form also contains accentuated in which the phonetic transcription of the word can be generated.

forms (e.g. míza) and phonetic transcriptions in the International

In some cases, assigning the pronunciation type to a lemma is

Phonetic Alphabet (IPA) and its equivalent X-SAMPA (e.g. IPA:

trivial – if the lemma contains a grapheme that is not part of

/"mi:za/, X-SAMPA: /"mi:za/). Both transcriptions were generated

2

the Slovene alphabet

(e.g. x, y, w, q), it belongs into the Other

automatically from accentuated forms, first in version 2.0 using a

G2P category (e.g. Byron, Oxford). There are, however, many

rudimentary rule-based system, then again in 3.0 with a greatly

exceptions that belong in the Other G2P category despite being

improved and linguistically informed rule-based grapheme-to-

comprised entirely of Slovene graphemes (e.g. Matt, Sharon).

1

phoneme (G2P) conversion tool for Slovene.

In Sloleks 3.0, the first cca. 100,000 lemmas that had been part

Rule-based G2P conversion for Slovene (particularly from ac-

of version 2.0 were manually annotated with pronunciation types,

centuated forms) yields very good results and leaves only a mi-

whereas the 264,000 new entries (added automatically from the

nority of issues to be resolved manually because in terms of its

Gigafida 2.0 Corpus of Modern Standard Slovene [6]) still lack this orthographic depth, Slovene features a shallow orthography ([9])

information. Because manual annotation from scratch is time-

in which each grapheme in the alphabet generally corresponds

consuming, we performed an experiment to determine to what

to one phoneme (see e.g. [4]) and the spelling-sound correspon-degree the pronunciation type can be predicted automatically by

dence is relatively direct ([1]; [11]): the pronunciation rules allow relying on the scarce linguistic and morphosyntactic information

for words to be pronounced correctly based on their graphemic

that can be extracted from an individual lemma.

The paper is structured as follows: we describe the dataset

1 The Slovene G2P tool is part of Pregibalnik, a piece of software used for the

that was used for the statistical analysis and machine learning

automatic expansion of the Sloleks Morphological Lexicon of Slovene: https://github

.com/clarinsi/SloInf lector It was developed within the Development of Slovene in

experiment (Section 2), as well as the process of feature selection

the Digital Environment project. The Slovene G2P converter is also available as an (Section 3). We train several machine-learning models and evalu-API-service: https://orodja.cjvt.si/pregibalnik/g2p/docs

ate their performance using 10-fold cross-validation (Section 4).

Finally, we manually evaluate a sample of automatically anno-

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or

tated entries (Section 5) and conclude the paper with our plans distributed for profit or commercial advantage and that copies bear this notice and for future work (Section 6).

the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner /author(s).

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

2 Although ć and đ are not part of the Slovene alphabet, they are phonemically

© 2024 Copyright held by the owner/author(s).

transparent and frequently occur in names of Slovene citizens, so they are not

https://doi.org/https://doi.org/10.70314/is.2024.sikdd.2

counted as foreign characters for the purposes of this task.

23





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Čibej

Table 1: Lemmas in Sloleks 3.0 by Pronunciation Type

Table 3: Statistically Significant Features by Category

Pronunciation Type

Frequency

%

Feature Category

Number

-

264,538

72.41%

Percentage of Slovene G2P characters

1

Slovene G2P

94,750

25.93%

Morphosyntactic features

3

Other G2P

3,066

0.84%

General character-level 𝑛-grams

1,119

Numeral

1,840

0.50%

Initial character-level 𝑛-grams

398

Acronym

845

0.23%

Final character-level 𝑛-grams

468

Slovene G2P with minor deviation

113

0.03%

General robust CVC 𝑛-grams

66

Abbreviation

70

0.02%

Initial robust CVC 𝑛-grams

44

Ambiguous G2P

69

0.02%

Final robust CVC 𝑛-grams

39

Symbol

49

0.01%

General finegrained CVC 𝑛-grams

157

Initial finegrained CVC 𝑛-grams

102

Total

365,340

100.00%

Final finegrained CVC 𝑛-grams

93

Total

2,490

Table 2: Lemmas in Sloleks 3.0 with Other G2P pronuncia-

tion type by Morphosyntactic Properties

Morphosyntactic Properties

Frequency

%

alphabet as well as ć and đ ); (b) morphosyntactic features (e.g.

noun

5

, proper, masculine); (c) relative frequencies

of character-

Adjective, possessive

1,092

35.62%

level uni-, bi-, and trigrams within the lower-cased lemma (e.g.

Noun, proper, masculine

958

31.25%

Matt → 𝑓 (𝑚), 𝑓 (𝑎), ..., 𝑓 (𝑚𝑎), 𝑓 (𝑎𝑡 ), ..., 𝑓 (𝑚𝑎𝑡 ), ...); (d) rela-

𝑟

𝑟

𝑟

𝑟

𝑟

Noun, proper, feminine

713

23.26%

tive frequencies of character-level uni-, bi-, and trigrams from

Adjective, general

142

4.63%

a robust CVC-conversion of the lemma, substituting consonant

Noun, common, masculine

127

4.14%

graphemes with C and vowel graphemes with V (e.g. Matt →

Noun, common, feminine

20

0.65%

CVCC → 𝑓 (𝐶), 𝑓 (𝑉 ), ..., 𝑓 (𝐶𝑉 ), 𝑓 (𝑉 𝐶), ..., 𝑓 (𝐶𝑉 𝐶), ...); (e) rel-

𝑟

𝑟

𝑟

𝑟

𝑟

Adverb, general

10

0.33%

ative frequencies of character-level uni-, bi-, and trigrams from a

Noun, common, neuter

2

0.07%

6

finegrained CVC-conversion of the lemma

(e.g. Matt → ZVKK

Verb, main, imperfective

2

0.07%

→ 𝑓 (𝑍 ), 𝑓 (𝑉 ), ..., 𝑓 (𝑍𝑉 ), 𝑓 (𝑉 𝐾 ), ..., 𝑓 (𝑍𝑉 𝐾 ), ...)

𝑟

𝑟

𝑟

𝑟

𝑟

Total

3,066

100.00%

For (c), (d), and (e), the initial and final uni-, bi-, and trigrams

of the lemma were extracted separately as well, as in some cases

the position of the 𝑛-gram in the word can be indicative of one

2

Dataset

class over another.

For general character-level 𝑛-grams, the first 1,498 with a fre-

Sloleks 3.0 contains a total of 365,340 entries, but only approxi-

quency of at least 500 across all Sloleks 3.0 lemmas were analyzed;

mately 28% have been manually assigned one of 8 pronunciation

3

these cover cca. 88.34% of all 𝑛-gram occurrences. For robust CVC

types

(as shown in Table 1). For the classification task, we focus

and finegrained CVC 𝑛-grams, all were analyzed. We performed

only on the two most frequent pronunciation types (Other G2P

4

the Kruskal–Wallis H test [7] (k=2, n=97,056) on a total of 6,148

and Slovene G2P ).

7

features, out of which 2,490 (40%) were statistically significant.

In terms of their morphosyntactic features, the Other G2P

Statistically significant features by categories are shown in Table

lemmas mostly consist of possessive adjectives and proper nouns,

3. 1,146 features are more indicative of Slovene G2P and 1,344 are collectively accounting for cca. 90% of the category (as shown in

more indicative of Other G2P. As shown in Table 4, only three

Table 2), but only 15% of the portion of Sloleks annotated with of the top 10 general 𝑛-grams indicative of Other G2P actually

pronunciation types.

contain non-Slovene G2P characters, confirming that detecting

The final dataset for statistical analysis and machine learning

lemmas from the Other G2P category is more complex and re-

consisted of 94,863 Slovene G2P lemmas (e.g. dekadentnost, Košak,

prefiltriran

quires more than simply taking into account non-Slovene G2P

) and 3,066 Other G2P lemmas (e.g. Elizabeth, Presley,

Sinclaire

graphemes.

).

3

Statistical Analysis and Feature Selection

4

Pronunciation Type Prediction

From each lemma, we extracted a series of features that could help

The identified features (along with several placeholder 𝑛-grams

discriminate between the two classes: (a) percentage of Slovene

to take into account any graphemes not covered in the initial

G2P graphemes within the lemma (i.e. graphemes of the Slovene

dataset) were taken into account to develop a custom vectorizer

that converts a given lemma and its lexical features based on

3 It should be noted that all the inflected forms within the entry effectively inherit the MulText-East v6 (MTE-6) Morphosyntactic Specifications for

the pronunciation type.

4 Symbols in Sloleks are rare, along with entries within the Ambiguous G2P category (where an entry can either follow Slovene G2P rules or not, depending on the

5

context – e.g. Amanda as a Slovene name: /am"a:nda/; or as an English name with Relative frequencies were calculated as 𝑓

(𝑥 ) = 𝑓 (𝑥 )/Í 𝑓 (𝑦 ), e.g. the

𝑟

𝑛

𝑎

𝑛

𝑎

𝑛

a pronunciation adjusted to the Slovene set of phonemes: /9m"E:nda/). Abbrevi-absolute frequency of 𝑛-gram 𝑥 of length 𝑛 within the lemma divided by the sum ations and numerals are easily identifiable, and while acronyms have a separate

of absolute frequencies of each 𝑛-gram 𝑦 of length 𝑛 within the lemma.

6

manner of generating phonetic transcriptions which also depends on their morpho-

In the finegrained CVC-conversion, consonant graphemes were generalized into

logical patterns, they are also mostly identifiable with rules. Because of its rarity more finegrained categories, e.g. graphemes denoting Slovene sonorants (M), voiced and similarity to Slovene G2P, the Slovene G2P with minor deviation category was

(G) and voiceless obstruents (K), foreign consonants (X), etc.

7

2

merged into Slovene G2P for the classification task.

Effect size was calculated as 𝜂

= (𝐻 − 𝑘 + 1)/(𝑛 − 𝑘 ), as reported in [10].

24





Predicting Pronunciation Types in Sloleks

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Table 4: Top 10 Statistically Significant General Character-

Table 6: Confusion Matrix for Linear Support Vector Clas-

Level

2

𝑛-Grams by Effect Size (𝜂 )

sifier

2

𝑛-Gram

H

p

𝜂

Means

True →

↓

Í

Predicted

Slovene G2P

Other G2P

y

11509.36

p ≤ 0.0001

0.1186

𝜇

< 𝜇

𝑆

𝑂

w

9595.25

p ≤ 0.0001

0.0989

𝜇

< 𝜇

Slovene G2P

18,939

140

19,079

𝑆

𝑂

ch

7558.60

p ≤ 0.0001

0.0778

𝜇

< 𝜇

Other G2P

34

473

507

𝑆

𝑂

Í

ll

6295.96

p ≤ 0.0001

0.0649

𝜇

< 𝜇

18,973

613

-

𝑆

𝑂

ss

3804.26

p ≤ 0.0001

0.0392

𝜇

< 𝜇

𝑆

𝑂

nn

3220.65

p ≤ 0.0001

0.0332

𝜇

< 𝜇

𝑆

𝑂

Table 7: Confusion Matrix for Manual Evaluation

th

2973.89

p ≤ 0.0001

0.0306

𝜇

< 𝜇

𝑆

𝑂

wa

2761.53

p ≤ 0.0001

0.0284

𝜇

< 𝜇

𝑆

𝑂

tt

2745.10

p ≤ 0.0001

0.0283

𝜇

< 𝜇

𝑆

𝑂

True →

co

2571.20

p ≤ 0.0001

0.0265

𝜇

< 𝜇

Í

𝑆

𝑂

↓ Predicted Slovene G2P Other G2P

Slovene G2P

86

9

95

Table 5: Model Performance Based on 10-Fold Cross-

Other G2P

14

91

105

Validation

Í

100

100

-

Model

A

BA

P

R

F1

ROC AUC

‘a’ is pronounced as /E/, but this cannot be discerned from the

LinearSVC

99.08

87.87

96.36

87.87

91.64

98.89

graphemic representation itself. Other misclassified examples are

Multin. NB

97.38

79.17

78.12

79.17

78.62

96.55

more obviously pertaining to Other G2P, e.g. Dorfmeister, Faulkn-

kNN (k=5)

98.25

75.17

93.67

75.17

81.74

91.63

erjev, Flaubertov, Heisenbergov, Balfourjev. This might indicate

Majority

96.87

-

-

-

-

-

that not all indicative 𝑛-grams have been included as features

(e.g. ‘ei’, ‘ou’), possibly for lack of evidence in the original dataset

or because they are less frequent and have not been included in

8

Slovene

into a 2,500-dimensional numerical vector. The entire

the initial batch of statistical tests. As the lexicon expands with

dataset was converted into vectors and split into a training set

new entries, the model will be updated with new examples and

9

(80%) and a test set (20%), both stratified by class. Three models

new features to potentially improve performance.

(Linear Support Vector Classifier (LinearSVC), Multinomial Naive

Bayes Classifier (Multin. NB), and k Nearest Neighbors Classifier

5

Manual Evaluation

(kNN)) were trained and evaluated with 10-fold cross-validation.

10

We trained a new instance of the LinearSVC model on the entire

The results are listed in Table 5

and show that LinearSVC out-

dataset and used it to annotate the remaining cca. 264,000 lemmas

performs the other two models. All three exhibit above-baseline

from Sloleks 3.0 with no pronunciation type, resulting in 86,730

accuracy compared to the majority classifier, but Multinomial

lemmas with Other G2P and 177,808 lemmas with Slovene G2P.

NB and kNN perform much worse in terms of balanced accuracy

We performed a preliminary manual evaluation consisting of

as well as precision and, in case of kNN, recall. Recall is also

a random sample of 100 examples from each class. The results

somewhat lower with LinearSVC, which is to be expected – some

Other G2P

are shown in the confusion matrix in Table 7. Although the

lemmas might contain no indicative 𝑛-grams and are

sample is too small to be representative of the whole, it indicates

thus hard to detect; on the other hand, once identified, the model

that the model performs well even on unseen data, achieving

is very precise in its prediction.

an accuracy of 88.50% (P=0.91, R=0.87, F1=0.89) over a majority

Table 6 shows the confusion matrix for the LinearSVC model

baseline accuracy of 50.00%.

tested on the 20% stratified test dataset. The model very rarely

The misclassifications of Other G2P as Slovene G2P include

misclassifies Slovene G2P lemmas, and more frequently errs with

Other G2P

examples such as Mukhamedov, Beatli, Livenza, and Preidler, with

lemmas. A closer inspection of the misclassified Slovene

G2P

limited indicators that the words belong to the Other G2P cat-

examples reveals several errors in the original dataset: Beethoven,

Ratzinger

egory. Most graphemes in these examples are pronounced ac-

, Rotterdam, Franco, Oberstdorf, and Keller were in fact

cording to Slovene G2P criteria, with the exception of individ-

correctly classified as Other G2P, but they are miscategorized as

Slovene G2P

ual 𝑛-grams (‘nz’, ‘ei’, ‘kh’), some of which have not been in-

in the original dataset. Other misclassifications in-

cluded in the set of features. In other examples, only one or two

clude examples of foreign proper nouns and possessive adjectives

vowel graphemes are indicative of Other G2P pronunciation (e.g.

that contain unusual grapheme combinations for Slovene (e.g.

Andreas

Trendlina, which is also a lemmatization error; the correct lemma

, Aurelio, Hilton, Simpsonov), but their pronunciation can

is Trendline; and Sanberg), and the pronunciation of single vowel

still be derived from their graphemic representation (e.g. Andreas

→

graphemes appears harder to predict than consonant graphemes

IPA: /and"re:as/).

or combinations thereof.

On the other hand, Other G2P lemmas misclassified as Slovene

G2P

Similarly, the misclassifications of Slovene G2P lemmas as

include Andersonov, Atkinsov, Batmanov, in which the grapheme

Other G2P lemmas include examples such as Doneck, Barson,

8 MTE-6: https://nl.ijs.si/ME/V6/msd/html/msd-sl.html The vectorizer uses Slovene Bronson, Piersanti, and Faustini. While these are proper nouns of

morphosyntactic tags, e.g. Slz (S – noun, l – proper, z – feminine).

foreign origin, their Slovene pronunciation can either be fully

9 All models were trained using the Python library scikit-learn. [8]

10

discerned from their graphemic representation (e.g. Doneck →

A, BA, P, R, and F1 refer to accuracy, balanced accuracy, macro-precision, macro-

recall and macro-F1, respectively.

IPA: /dO"ne:tsk/), or it only differs slightly from what Slovene

25





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Čibej

grapheme-to-phoneme conversion would produce (e.g. Faus-

even the new orthographic manual anticipates that all transliter-

tini → automatically converted IPA: /faus"ti:ni/; correct IPA:

ation should be done manually, which begs the question whether

/faus"ti:ni/).

at least part of the work can be automatized. This would be an

“

important step in the development of a modern, digital infrastruc-

6

Conclusion

ture for Slovene orthography, and would facilitate the automatic

expansion of modern digital dictionary databases and datasets

In the paper, we presented the results of an attempt to automatize

for automatic speech recognition.

the assignment of pronunciation types to lemmas in the Sloleks

In addition, although our preliminary experiments with LLMs

Morphological Lexicon of Slovene. The results show that a model

(ChatGPT 3.5 and 4.0) classifying Slovene G2P and Other G2P lem-

based on a series of mostly 𝑛-gram features can provide good

mas have yielded much worse results than the best performing

results when discriminating between Slovene G2P and Other G2P

LinearSVC model, more systematic experiments are warranted.

categories, with the best performance achieved by the Linear

As part of our future work, we intend to implement the model

Support Vector Classifier. However, there is still room for im-

12

into Pregibalnik,

which is used for automatically extending the

provement, particularly in terms of recall – a number of Other

lexicon and currently assigns no pronunciation type. The model

G2P lemmas from the test set were misclassified as Slovene G2P,

13

itself is available under the Apache 2.0 license on Github

, while

while those classified correctly were classified with a relatively

the pronunciation type annotations will be included in future

high precision score. 𝑛-grams that are statistically significant

versions of Sloleks and, eventually, manually validated.

as indicative of one class have proven to be useful features for

model development, but because they are not evenly distributed

Acknowledgements

and occur sporadically in different lemmas, it would make sense

The research presented in this paper was conducted within the re-

to further improve the model by performing the same statistical

search project titled Basic Research for the Development of Spoken

analysis (as described in Section 3) on the long tail of less fre-Language Resources and Speech Technologies for the Slovenian Lan-

quent 𝑛-grams to prepare a more comprehensive list of indicative

guage (J7-4642), the research programme Language Resources and

𝑛-grams. The current version of the model is very light-weight

Technologies for Slovene (P6-0411), and the CLARIN.SI research

and additional features should not cause the model to become

infrastructure, all funded by the Slovenian Research and Inno-

overencumbered.

vation Agency (ARIS). The author also thanks the anonymous

There are several possibilities for further development of the

reviewers for their constructive comments.

model. Firstly, instead of using relative frequencies of 𝑛-grams

as features, it would be useful to test how different measures

References

such as TF–IDF, absolute frequencies, or even Boolean values

influence the performance of the model, and potentially also test

[1]

Derek Besner and Marilyn Chapnik Smith. 1992. Chapter 3 basic processes

in reading: is the orthographic depth hypothesis sinking? In Orthography,

several other machine learning algorithms (e.g. Random Forest

Phonology, Morphology, and Meaning. Advances in Psychology. Vol. 94. Ram

Classifier). Secondly, while the other pronunciation types from

Frost and Leonard Katz, editors. North-Holland, 45–66. doi: https://doi.org

/10.1016/S0166- 4115(08)62788- 0.

Sloleks 3.0 (acronyms, abbreviations, etc.) are relatively easily

[2]

Jaka Čibej et al. 2022. Morphological lexicon sloleks 3.0. Slovenian language

identifiable (but much less frequent), in the next step, it would

resource repository CLARIN.SI. (2022). http://hdl.handle.net/11356/1745.

be informative to include them in the training set and test out

[3]

Kaja Dobrovoljc, Simon Krek, Peter Holozan, Tomaž Erjavec, Miro Romih,

Špela Arhar Holdt, Jaka Čibej, Luka Krsnik, and Marko Robnik-Šikonja. 2019.

the model’s performance on the full set of categories. Thirdly,

Morphological lexicon sloleks 2.0. Slovenian language resource repository

a statistical analysis should be performed on the probabilities

CLARIN.SI. (2019). http://hdl.handle.net/11356/1230.

with which the model makes decisions and to what degree they

[4]

Florina Erbeli and Karmen Pižorn. 2012. Reading ability, reading fluency

and orthographic skills: the case of l1 slovene english as a foreign language

correlate with the percentage of graphemes that differ from the

students. English. Center for Educational Policy Studies Journal, 2(3), 119–139.

shallow orthographical Slovene G2P rules (e.g. Anderson, with

https://f iles.eric.ed.gov/f ulltext/EJ1130208.pdf .

[5]

Nataša Gliha Komac et al. 2015. Koncept novega razlagalnega slovarja

arguably only ‘a’ not following Slovene G2P rules; vs. Châteaux,

slovenskega knjižnega jezika. Inštitut za slovenski jezik Frana Ramovša

where the majority of graphemes are pronounced completely

ZRC SAZU. (2015). https://f ran.si/179/novi- slovar- slovenskega- knjiznega- j

differently compared to Slovene G2P rules). This would require

ezika/datoteke/Potrjeni_koncept_NoviSSKJ.pdf .

[6]

Simon Krek et al. 2019. Corpus of written standard slovene gigafida 2.0.

the preparation of a separate dataset in which graphemes are

Slovenian language resource repository CLARIN.SI. (2019). http://hdl.handl

manually aligned to either the graphemes of their transliter-

e.net/11356/1320.

ated Slovene graphemic forms (Newyorčan → njújórčan) or their

[7]

William H. Kruskal and W. Allen Wallis. 1952. Use of ranks in one-criterion

variance analysis. Journal of the American Statistical Association, 47, 260,

Slovene IPA transcriptions. By assigning scores that reflect the

583–621. eprint: https://www.tandf online.com/doi/pdf /10.1080/01621459.19

degree of orthography depth for the individual lemma, it would

52.10483441. doi: 10.1080/01621459.1952.10483441.

[8]

F. Pedregosa et al. 2011. Scikit-learn: machine learning in Python. Journal

be possible to use the dataset to train a regression model.

of Machine Learning Research, 12, 2825–2830.

Similarly, Other G2P lemmas from Sloleks 3.0 can be manually

[9]

Anja Schüppert, Wilbert Heeringa, Jelena Golubovic, and Charlotte Gooskens.

annotated with their language of origin and transliterated ac-

2017. Write as you speak? a cross-linguistic investigation of orthographic

transparency in 16 germanic, romance and slavic languages. English. From

cording to the recently published transliteration rules of Pravopis

semantics to dialectometry, 32, 303–313. isbn: 9781848902305.

8.0 11

,

the new orthographic manual of Slovene, which at the time

[10]

Maciej Tomczak and Ewa Tomczak. 2014. The need to report effect size

of writing this paper is still in development. Such a dataset would

estimates revisited. an overview of some recommended measures of effect

size. Trends in Sport Sciences, 1(21), 19–25.

enable the development of a model for language identification

[11]

Antal van den Bosch, Alain Content, Walter Daelemans, and Beatrice de

for individual lemmas, and, ultimately, a model for automatizing

Gelder. 1994. Analysing orthographic depth of different languages using

data-oriented algorithms. In Proceedings of the 2nd International Conference

transliteration of lemmas of foreign origin into their Slovene

on Quantitative Linguistics.

equivalents. As of now, no such tool yet exists for Slovene, and

12 Pregibalnik: https://github.com/clarinsi/SloInf lector; the entire tool is also 11 Pravopis 8.0: Pravila novega slovenskega pravopisa za javno razpravo. https://prav

available as an API-service: https://orodja.cjvt.si/pregibalnik/docs

13

opis8.fran.si/, 9 August 2024

GitHub: https://github.com/jakacibej/sikdd2024_predicting_pronunciation_types

26





Higher-Order Bibliographic Services

based on bibliographic networks

Vladimir Batagelj

Jan Pisanski

Tomaž Pisanski

IMFM

Faculty of Arts, UL

FAMNIT, UP

Ljubljana, Slovenia

Ljubljana, Slovenia

Koper, Slovenia

IAM and FAMNIT, UP

jan.pisanski@f f.uni- lj.si

IMFM

Koper, Slovenia

Ljubljana, Slovenia

vladimir.batagelj@fmf.uni- lj.si

tomaz.pisanski@upr.si

Figure 1: The largest co-author groups at level 10 at the University of Primorska until 2024.

Abstract

of characteristics describing works. Besides these networks, we

can also get the partition of works by their publication years,

Bibliographic databases only provide basic services to users, but

the partition of works by journals or publishers, the vector of

they could provide much richer information for specific user

the number of pages, and, in some cases, the (one-mode works ×

needs. The main reason for the delay in developing such higher-

works) citation network.

order bibliographic services is the limited access to data in propri-

When constructing any of these networks, the first task is

etary databases. We expect the new open bibliographic databases

to specify the nodes and which relations are linking them. In

like OpenAlex will encourage faster development of these ser-

short, the network boundary problem [16] has to be solved. This vices. We describe an approach based on a collection of biblio-includes deciding whether a network is one-mode or two-mode

graphic networks as a foundation to support the development of

and which node properties are important for the intended analy-

higher-order bibliographic services.

ses. For specifying links, this amounts to answering a series of

Keywords

questions:

bibliographic database, open access, network analysis, higher-

(1) Are the links directed?

order bibliographic service, prototype, OpenAlex

(2) Are there different types of links (relations) to include?

(3) Can a pair of nodes be linked with multiple links?

1

Introduction

(4) What are the weights on the links?

From special bibliographies (BibTEX, EndNote) and bibliographic

(5) Is the network static, or is it changing through time?

databases, it is possible to obtain data about works (papers, books,

Another problem that often occurs when defining the set of

reports, etc.) on selected topics. A typical work description con-

nodes is the identification of nodes. The unit corresponding to a

tains the following data: authors; title; publisher/journal; pub-

node can have different names (synonymy), or the same name can

lication year and pages. In some sources, additional data are

denote different units (homonymy or ambiguity). For example

available including languages, classification of documents, key-

in the BibT

words, authors’ institution/country affiliation, lists of references,

EX bibliography from the Computational Geometry

Database [14] the same author appears under 7 different names: and the abstract. This data can be transformed into a collection

R.S. Drysdale, Robert L. Drysdale, Robert L. Scot Drysdale, R.L.

of compatible two-mode networks on selected topics [5]: works

×

Drysdale, S. Drysdale, R. Drysdale, and R.L.S. Drysdale. Insider

authors; works × keywords; works × countries, and other pairs

information is needed to decide that Otfried Schwarzkopf and

Permission to make digital or hard copies of all or part of this work for personal Otfried Cheong are the same person. At the other extreme, there

or classroom use is granted without fee provided that copies are not made or

are at least 57 different mathematicians with the name Wang, and

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this Li in the MathSciNet Database [20]. Its editors have tried hard, work must be honored. For all other uses, contact the owner /author(s).

from 1985, to resolve the identification of the author’s problem

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

during the data-entry phase. The significant growth of contri-

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2024.sikdd.12

butions by Chinese scientists and their full name similarity in

27





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Batagelj, Pisanski & Pisanski

Roman transcriptions adds additional complexity to the problem.

our publications. Similarly, we get the report on the publication

In the future, the problem could be eliminated by implementing

activity of the selected institution.

initiatives such as using ORCID or resolving the identification

problem in bibliographic databases (Scopus, OpenAlex).

3.1

API

An application programming interface (API) is a way for two or

2

Higher-Order Bibliographic Services

more computer programs or components to communicate with

The data collected in different bibliographic databases can be

each other. It is a type of software interface, offering a service to

used to provide higher-order bibliographic and bibliometric ser-

other pieces of software [21]. In our case, API enables us to use vices such as what to read (contact/visit)? – a list of relevant

the database data from our programs. An R package supporting

articles/books (authors, institutions) on selected topic; where to

the use of OpenAlex is openalexR [1].

publish? – a list of journals suitable for the publication of an

The OpenAlex API is available at https://api.openalex.org. Its article, automatic suggestion of keywords; reviewer selection – a

response is returned in JSON format. Here is an R code using the

list of reviewers suitable for a submitted article; possible partners

OpenAlex API for the IMFM institution search

for research collaboration; a career application – a candidate’s

setwd(wdir <- "C:/work/OpenAlex/API")

activity report draft; etc.) for different types of users (students, re-

library(httr); library(jsonlite)

searchers, teachers, decision-makers, funding agencies, research

res <- GET("https://api.openalex.org/institutions",

query = list(search="imfm"))

institutions, database managers, etc. . To support this goal we

str(res)

have to use high-quality data often obtained by combining data

cont <- fromJSON(rawToChar(res$content))

from different databases.

names(cont); str(cont)

For the development of higher-order bibliographic and biblio-

The response data are available in the variable cont. Similarly,

metric services, open bibliographic databases such as OpenAlex

the API can be used also from other programming languages.

are particularly welcome, as the developed services can remain

The OpenAlex query can be composed of different components.

open.

Using search we can search for a given search text across titles,

abstracts, and full-text. Using a filter we can limit our search

3

OpenAlex

to units satisfying given conditions. Using select we can select

The basic type of unit in a bibliographic database is the work.

data fields that will appear in results. The query can be further

A user searching the database gets a list of works satisfying the

controlled by some parameters. For example

query. Usually, some operations with such lists (inspection, fil-

wd <- GET("https://api.openalex.org/works",

tering, merging, intersection, statistics, etc.) are supported. Only

query = list(

basic services are provided to users.

search="handball",

Some web services also supporting some other types of units

filter="publication_year:2015",

select="id,title",

(authors, institutions, research fields, conferences, etc.) were de-

page="2", per_page="200"))

veloped such as Google Scholar [19], Scholar GPS [12], and DBLP

names(wd)

– computer science bibliography [10].

wc <- fromJSON(rawToChar(wd$content)); names(wc)

Our approach is based on OpenAlex [18, 9] but this informa-

names(wc$meta); wc$meta$count; str(wc$results)

tion can be obtained from most bibliographic databases [13, 11].

returns the second page (with up to 200 entries) on works on

OpenAlex indexes more than twice as many scholarly works as

handball published in the year 2015. Only information about

the leading proprietary products and the entirety of the knowl-

works ID and title is returned.

edge graph and its source code are openly licensed and freely

The OpenAlex API uses paging – the list data are provided

available through data snapshots, an easy-to-use API, and a

by pages. The basic paging (up to 10 000 units) is based on

nascent user interface.

two parameters page and per_page). The cursor paging is a bit

OpenAlex is based on 7 types of units (entities): W(ork), A(uthor),

more complicated than basic paging, but it allows us to access as

S(ource), I(nstitution), C(oncept), P(ublisher), or F(under) (and

many records as we like.

some additional ones such as topics, keywords, countries, con-

tinents, languages, etc.). Each unit gets its OpenAlex ID – we

4

A collection of bibliographic networks

assume that the identification problem is solved by the database.

The simplest use of OpenAlex is through its web interface

We developed an R package OpenAlex2Pajek to support the cre-

(service) https://openalex.org/ or using a direct URL request in ation of bibliographic networks from OpenAlex [4]. We get a

the browser URL line. For example

collection of bibliographic networks (citation network Cite, au-

thorship network WA, sources network WJ, keywords network

• Author’s name: search the OpenAlex web service WK, countries network WC), some partitions and vectors (prop-

• Known author ID: URL https://openalex.org/A5001676164

erties of nodes) (publication year, type of publication, language of

• Work with DOI: URL https://api.openalex.org/works/

publication, cited by count, countries distinct count, referenced

https://doi.org/10.1007/s11192-012-0940-1

works, and additionally two files containing names of works

• Known work ID: URL https://openalex.org/W2083084326

xyzW.nam and names of authors xyzA.nam. Most acquired net-

• Name of the institution: search the Openalex Web service works are 2-mode – they link units of two different types; an

• Known institution ID: URL

ordinary or 1-mode network links units of the same type.

https://openalex.org/institutions/I4210106342

Currently, OpenAlex2Pajek contains three main functions

This way, the OpenAlex web interface provides basic inspec-

OpenAlex2PajekCite, OpenAlex2PajekAll, and coAuthorship.

tions of the selected unit. For example, by including a link with We split the process of creating the collection of bibliographic

our OpenAlex author ID on our web page we get a report on

networks into two parts:

28





Higher-Order Bibliographic Services

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

• determining the set 𝑊 of relevant works using the satu-

UM

MH

NR

PN

TO

TV

HM

TF

BJ

GH

TZ

IQ

KY GD

NI

WF

TK

PN

LYSD

AE

SY

BN

SY

GN

GM

NP

ZW

PG

TK

UG

VU

WF

NF

KI

FM

SL

MY

AQ

BV

GM

QA

TZ

BD

NG

MU

LK

SC

QA

ZM

NF

UM

AS

BN

DO

ration approach

SS

BD

WS

NG

OM

AS

PW

CK

MP

VU

NA

GS

IE

NA

BW

LY

[7, page 506],

GG

BS

AO

MH

SS

SZ

KI

TV

NU

MT

JM

NU

FK

PT

PT

IE

AR

MM

GI

ZA

GB

MT

GB

ZA

IM

GI

•

SB

PY

IM

1990

1995

ZM LS

LI

PW

creation of the network collection for the works from 𝑊 .

KE

SZ

BO

EG

CL

CV

GL

LR

MZ

GL

NR

EC

FJ

CH

KE

ST

GG

SD

GU

BR

CO

UG EG

AU

VE

FJ

FO

BW

FO

CO

EC

LR

ZW

VE

CM

DK

GE

MU

CM

SB

UY

AR

RW

LS

AU

DK

MW

GR

RW

CL

ER

MW

PE

NZ

CW

GY

AM

NZ

ST

PG

BE

GU

CY

GH

CY

SX

BE

AT

PH

FI

PL

CK

UY

SK

GR

NC

AT

BR

SI

CZ

SG

CX

CZ

WS

TO

PE

NO

IS

VN

AF

HR

HK

DE

FM

DE

VA

MK

TG

MK

FI

IL

TW

RS

MP

GN

MN

CN

TH

TL

GY

LV

SG

BT

GW

SI

PL

PK

YE

MC

US

NL

US

TH

GW

NL

PK

PH

SL

CH

HU

JO

HU

SK

KR

TW

LA

LT

JE

YE

DM

VA

CW

SE

KW

ME

CG

EE

LA

VN

RO

AM

SE

SM

CN

MN

BH

ET

MZ

IT

RO

IL

IT

IN

JP

AL

BA

ID

GD

PS

SA

SO

MO

MV

ER

TR

NP

AG

LV

MY

NO

HK

ES

IQ

KN

AE

KR

The set

ES

IR

𝑊 is determined iteratively using the function OpenAlex-

BZ

SO

NI

LB

KH

BT

IS

KW

CD

GT

ID

PR

BH

CU

JO

HR

VI

CU

SV

DO

BG

KG

ET

IN

HT

RS

PR

GQ

TR

CRPA

IR

VI

CRBB PA

MM

IO

VG

BMJMGT

PY

HT

LK

CA

JP

MC

XK

BA

SV

SA

AW

MS

KN

LU

DM

KP

OM

LU

BM

RU

VG

LB

PS

2PajekCite

FR

TT

KP

AZ

GA

BB TTBS

BZ

CA

AL

and the collection is finally created using the func-

LC

CF

KY

MF

UZ

TC

CI

AW

GP

KG

CC

MO

FR

RU

KH

BQ

SR

SN

VC

UZ

AI

TJ

SX

SJ

AD

AX

MQ

MX

MX

GE

TM

DZ

BY

TJ

MV

AF

CI

PF

BG

KZ

TN

AG

AZ

AN

GF

TM

PM

BL

LTUA EE

HM

YT

SH

EH

KM

SN

GF

BJ

BO

NC

BY

MR

tion OpenAlex2PajekAll.

HN

HN

UA

KZ

DZ

GA

MD

SC

KM

EH

CV

PF

MD

SM

LI

ME

CF

RE

CG

DJ

AQ

GS

MQ

NE

TF

BV

NE

BQ

TC

MS

BL

MG

BI

GP

TG

BF

ML

TD

MF

MA

DJ

MG

ML

JE

AD

TN

MA

AX

SJ

RE BI BF

SR

YT

GQ

SH

MR

AO

FK

TD

CX

CC

TL

LC

VC

PM

AN

CD

IO

XK

AI

Pajek

Pajek

The function coAuthorship creates a weighted temporal net-

BW

SO

SO

KY

GM

WF

TV

UM

PN

GM

LS

LA

LYMU

MS

MU

LK

ZM

BD

QA

SZ

SC

GQ

MY

TZ

BN

SS

SY

MW

MY

IE

MV

TD

NA ZW

CG

work describing the co-authorship between world countries in

OM

NF

KI

MH

GG

BN

SZ

GN

GW

AS

SS

SJ

BH

MK

LC

BT

IE

MT

AG

VU

GR

JE

JE

MS

PG

PT

GI

NR

GG

VC

GB

ZA

TO

CD

TK

IM

MT

MK

SB

selected time intervals. The weight of an edge is the number of

ZA

ZMERRW TZ

GI

KE

GB

VU

SD

GH

AU

FJ

SL

2000

IM

AONA

NR

LR

2010

SK

AO

AZ

RWUG

AR

CK

LS

TO

PG

GL

CO

EG UG MW

NG

MZ

MR

NZ

SB

FK

BW

WS

TR

FO

CZ

NZ

ST

EC

EG

KE

MG

ER

NG SD

GH

ZW

AU

BO

VE

GL

AR

FJ

MP

PW

CM

CH

PE

IO

RO

CO

DK

EC

SL

LC

AS

NU

works co-authored by authors from the linked countries.

DK

BO

AT

CL

BR

PY

FO

AM

CL

CK

PW

MZ

GN

EE

CY

GU

UM

BE

UY

NO

SK

CY

GU

ST

DE

MH

BY

LI

AM

GE

MP

VE

PK

PH

FM

NU

BG

PE

FI

AT

FM

AW

BE

BH

VN

TW

IS

BR

PY

SI

SX

MN

MD

KZ

IL

TH

DE

BG

FI

PK

UY

VA

GQ

NL

CN

CW

MO

IL

PH

BT

HU

US

In an analysis of weighted networks, the 1-neighbor skeleton

LI

SI

SR

IS

AE

HR

JO

US

VN

HK

SG

KW

SE

CW

NL

MN

PL

AX

UA

BD

CN

GE

AF

TH

TW

AW

VA

KP

CH

HU

MO

GW

CG

MD

IR

KR

AE

IT

IQ

HK

LT

SC

AL

NO

YE

KZ

LV

KW

LB

SE

JO

KR

PL

PS

SM

ET

PT

AX

AF

YE

LV

SG

CZ

MV

HT

TL

KH

is often used to get an overall insight into the network’s basic

ES

XK

EE

IT

LK

IN

ET

GR

SV

IN

LT

GT

LB

MC

KN

NI

QA

OM

LU

PR

DMVG

JM

SM

BA

GD

ID

CD

PS

BZ

AD

GD

VC

AI MX

SA

ES

KN

RO

NP

SY

SA

JP

MC

DM

JP

CV

GA

VI

CA

AL

BM

HT

VGHN MXTT

IR

WS

LY

CC

LU

MM

KP

MM

CF

DO

GT

BM

PR

IQ

ID

TC

PA

CV

CI

DJ

TT

RU

DO

NP

CU

KY

BS

CR HN BB

GA

VI TC

JM

LA

MG

CX

structure. In the 1-neighbor skeleton, only its strongest link is

CU

CR

BB BS PA

RU

KH

SN

ML

TR

RS

UZ

DZ

RS

UZ

MA

FR

HR

KG

NF

KI

TN

FR

IO

CX

YT

TJ

SH

KG

CI

BY

BJ

CA

TJ

SN

GY

TM

NI

TM

DZ

AZ

TV

PN

RE

ME

UA

GF

ME

LR

EH

FK

TL

CC

KM

NE

PF

BA

BI

NC

kept for each node. The resulting directed network is forest-

AG

TG

NC

NE

PF

CM

WF

TK

MF

BL

PM

AN

BF

TG

BQ

GS

GY

PM

BQ

YT

BV

AN

SR

SJ

BZ

TN

ML

MA

GF

SV

BJ

GP

CF

DJ

GP

KM

MQ

GS

HM

XK

MQMF

MR BI

RE BFTD

EH

SH

AQ

TF

BV

TF

AQ

HM

AD

AI

BL

SX

Pajek

like. Non-trivial connected components in 1-neighbor skeletons

Pajek

NU

LY

LY

AZ

XK

FK LA

IE SSSO MU

BN

GG IE

GM

MU

LS

are (usually) directed trees with a pair of nodes linked in both

BN

GG

GM

LS ZW

NA

MT

PN

ZW

NA

SZ

NR

MT

BW

MW

JE

FK

MY

ID

JE

SZ

IQ

BT TL

AX

MS

MK

MK

TL

EG

GI

EG

IQ

VU

IM

VU

SD

CY

IM

SD

MY

SB

PG

ER

PG

GR

NE

TD

CY

MV

WS

TO

ZA

ZA

SA

directions with the largest weight in the tree – these two arcs

FJ

SA

GR

SB

TM

TV

GB

KE

KI

GB

ZM

NG

KM

NG

FJ

RW

SK

HN

KE

RW

GW

ET

SK

ET

AU

CM

CK

LR

CM

AU

KI

YE

2015

2020

SL

BH

GW

AQ

SL

BH

GL

LR

SS

CK

AR

MW

TV

AR

GL

SC

VE

ZM MZ

FO

BW KM

BO

UG

CO

UG

TO

NZ

TZ

AO

SO

TZ

WS

FM

NR

AO

FO

CZ

CO

GH

AQ

CZ

PE

GH

MZ

TF

NZ

PW

ST

CV

are usually replaced by an edge (undirected link). In Figure 2 the SJ

CL

MP

PW

ST

CL

GY

LT

DK

GE

GY

DK

GU

SJ

MP

GU

PT

AS

PY

LT

PE

BR

AT

CH

AS

AM

MH

BR

GE

MH

UY

UY

LU

EE

NO

WF

NO

FM

UM

SR

UM

LI

BE

BV

AT

BE

DE

IS

IS

SY

PY

BG

SY

PH

SG

PH

PK

CW

BG

FI

VN

EE

PK

DE

FI

VN

SG

NL

SI

CN

MN

TW

1-neighbor skeletons for years 1990, 1995, 2000, 2010, 2015, and

SR

SI

PL

US

TW

PL

US

MN

IL

MO

LI

IL

CN

AL

CH

IT

TH

HR

TH

MO

HR

HU

AE

HK

SM

AE

JO

CG

HU

VA

JO

CG

HK

SE

KW

VA

LV

KW

BT

KR

YE

AX

NL

LV

KR

SE

BD

NP

AD

ES

BD

JP

KP

AL

UA

KP

BZ

IT

IR

LK

BZ

IO

LK

IN

ID

SM

IR

IN

2020 are presented. We see that the number of isolated nodes

GI

AG

GD

PS

AF

AZ

GD

JP

PS

PT

SV

GT

KN

ES

GT

KH

KN

KH

AF

ER

CV

CU

MM

RO

HT

DM

NI

CD

RO

HT

DM

NI

CD

MC

VC

AD

VGVC BQ AI TT PA CW

QA

RS

MM

OM

VE

VG

BQ AI

RS

PA SV

OM

PR

QA

TR

PR

MV

BO

TR

LU

VI

CU

CA

CA

NP

VI

EC

BL

AW

DO

XK

MC

AW

DO

EC

JM

TC

RU

BA

KY

TC

MX

CR

KY

LCSX

BM

MF

JMLCSX BS

BB

BM

RU

TT

BA

CR

MF

BS

BB

MX

(countries not collaborating with other countries) is decreasing.

MD

MD

ME

PM

FR

ME

FR

PM

LB

UZ

TM

AG

GP

UZ

LA

GP

GF

KG

GA

HN

LB

KG

MQ

MQ

CF

TJ

TJ

BY

DJ

PF

DJ

MS

BY

KZ

BL

WF

NC

UA

GN

NC

KZ

MG

AM

In all analyzed years the US has a leading (hub) position. In the

GF

ML

ML

PF

CI

CI

MA

CX

TK

BV

MA

TF

IO

GQ

EH

SN

NF

CX

TK

SN

CF

NU

DZ

BFGN

DZ

BF

NE TG

BI

MGTG

BI

YT

RE

RE

TN

YT

BJSCMR

EH

GS

TN

MR

HM

SH

AN

CC

HM

BJGA

NF

GS

SH

AN

CC

PN

TD

GQ

years 1990, 1995, 2000, and 2010 the edge in the main component

Pajek

Pajek

links US and GB but in the years 2015 and 2020 GB is replaced by

Figure 2: 1-neighbors skeletons of world co-authorship for

CN. In 1990, stronger secondary hubs were GB, FR, RU, JP, and DE.

selected years.

In the following years, some other countries SE, ES, AU, CN, BR,

ZA, and IN (BRICS) became secondary hubs attracting previously

non collaborating countries or geographically or linguistically

in the bibliography of works with at least one co-author from

close countries.

University of Primorska.

Most of the ingredients of basic reports are counters, sorted

In bibliometric analysis, the citation network Cite has a very

lists, (weighted) degrees and their distributions obtained from

important role. It collects “votes” about the relevance of previous

an adequate network. Sometimes also the time is considered

works for a given work. It is often used for solving the network

producing time series.

boundary problem, and also for identifying the most relevant

An important property of a collection of bibliographic net-

works in the collected bibliography [2, 6]. The derived network works is that some of them are compatible – they share a com-ACiA = WA𝑇 · Cite · WA describes the citations between authors

mon set (most often the set of works W). This allows us to use

– its entry 𝐴𝐶𝑖𝐴 [𝑎, 𝑏] counts the number of times author 𝑎 cited

network multiplication (defined by the product of network matri-

author 𝑏 . The co-citation network is defined as the column pro-

ces) to compute the corresponding derived network connecting

jection of the citation network coCi = col(Ci) = Ci𝑇 · Ci and the

the remaining two sets [5]. For example, in the derived network bibliographic coupling network is defined as the row projection

AK = WA𝑇 · WK its entry 𝐴𝐾 [𝑎, 𝑘] tells us in how many works

of the citation network biCo = row(Ci) = Ci · Ci𝑇 .

the author 𝑎 used the keyword 𝑘 . Similarly, in the derived net-

The idea of derived networks can be extended to temporal

work ACiK = WA𝑇 · Cite · WK its entry 𝐴𝐶𝑖𝐾 [𝑎, 𝑘 ] tells us how

bibliographic networks [8]. Using derived networks we enlarge

many times the author 𝑎 cited works described by the keyword

the source for different statistics. Additional insight can be gained

𝑘 .

by analyzing the structure of networks and identifying important

A 2-mode network is always compatible with its transpose

subnetworks in them [6].

(on both sets). The corresponding derived networks are called

In the following, we present an overview of typical report

projections – the row projection row (WA) = WA · WA𝑇 and the

ingredients [7, 15]. Because of limited available space, we decided column projection col (WA) = WA𝑇 · WA. Both projections are

to put examples on Github/bavla.

ordinary weighted 1-mode networks that can be analyzed using

standard network analysis methods.

5

Report ingredients

For the authorship network WA its column projection Co =

5.1

Statistics

WA𝑇 ·WA is the co-authorship network. Its entry 𝐶𝑜 [𝑎, 𝑏] counts

the number of works that authors

Because the analyzed networks are often large a complete pre-

𝑎 and 𝑏 co-authored. It turns

2

out that a work with

sentation is not an option. To describe them we use different

𝑘 co-authors contributes 𝑘

links to the co-

authorship network – works with a large number of co-authors

statistical descriptors.

are overrepresented in it. To treat all authors equally the frac-

• sizes of sets (number of nodes, number of links); structural

tional approach is used [3]. In Figure 1 the largest co-authorship network properties (number of components, size of the

groups at level 10 at the University of Primorska are presented –

largest component, etc.)

connected components of the link cut at level 10 in the network

• top units – ordered lists of units with the largest values of

Co. Each pair of linked authors co-authored at least 10 works

selected property (degre, weighted degree, link weight,

29





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Batagelj, Pisanski & Pisanski

Color Key

Europe 2022 / Balassa / Ward

6

Conclusions

We have presented an approach to support higher-order biblio-

graphic services based on networks. Open access to high-quality

bibliographic data is crucial for the faster development of such

−6

−4

−2

0

2

4

6

services. The new bibliographic database OpenAlex seems to be

Value

GB

a step in the right direction. It needs the support of science policy

LU

IT

IE

ES

and also of individual scientists (checking the correctness of their

NL

BE

FR

AT

DE

data).

CH

GI

VA

IM

GG

SJ

AX

Acknowledgements

JE

PT

SE

FI

DK

NO

The computational work reported in this paper was performed

LI

FO

IS

AM

using a collection of R functions OpenAlex2Pajek and the pro-

GE

AZ

MD

BY

gram Pajek for analysis of large networks. Code, data, and figures

KZ

RU

UA

BA

MK

are available on Github/Bavla/OpenAlex.

ME

AL

XK

AD

VB’s work is partly supported by the Slovenian Research

MC

SK

CZ

PL

Agency ARIS (research program P1-0294, research program Cog-

TR

BG

RO

LV

niCom (0013103) at the University of Primorska, and research

LT

EE

HU

RS

projects J1-2481, J5-2557, and J5-4596), and prepared within the

HR

SI

CY

GR

MT

framework of the COST action CA21163 (HiTEc). JP’s work is

SM

T

A

LI

U

A

O

SI

GB

LU

IT

IE

ES

NL

BE

FR

A

DE

CH

GI

V

IM

GG

SJ

AX

JE

PT

SE

FI

DK

NO

FO

IS

AM

GE

AZ

MD

BY

KZ

R

U

BA

MK

ME

AL

XK

AD

MC

SK

CZ

PL

TR

BG

R

LV

LT

EE

HU

RS

HR

CY

GR

MT

SM

partly supported by ARIS (research program P5-0361 and research

projects J1-2551 and J5-4596). TP’s work is partly supported by

ARIS (research program P1-0294 and research projects N1-0140,

Figure 3: Balassa EU co-authorship for the year 2022.

J1-2481, J5-4596).

References

• distribution of selected property

[1]

Massimo Aria, Trang Le, Corrado Cuccurullo, Alessandra Belfiore, and

• time series describing temporal changes of selected prop-

June Choe. 2024. openalexR: an R-tool for collecting bibliometric data from

erties

OpenAlex. The R Journal, 15, 4, 167–180.

•

[2]

Vladimir Batagelj. 2003. Efficient algorithms for citation network analysis.

scatter plots showing a possible relationship between two

arXiv preprint cs/0309023.

selected properties

[3]

Vladimir Batagelj. 2020. On fractional approach to analysis of linked net-

works. Scientometrics, 123, 2, 621–633. doi: 10.1007/s11192- 020- 03383- y.

Often bibliometric properties of units follow laws such as Zipf

[4]

Vladimir Batagelj. 2024. OpenAlex2Pajek. version 4, June 18. (2024). https:

(or power) law, Bradford law, Lotka law, lognormal distribution,

//github.com/bavla/OpenAlex/tree/main/code.

Hirsch index, etc.

[5]

Vladimir Batagelj and Monika Cerinšek. 2013. On bibliographic networks.

Scientometrics, 96, 3, 845–864. doi: 10.1007/s11192-012-0940-1.

[6]

Vladimir Batagelj, Patrick Doreian, Anuška Ferligoj, and Nataša Kejžar.

5.2

Network analysis

2014. Understanding Large Temporal Networks and Spatial Networks: Explo-

ration, Pattern Searching, Visualization and Network Evolution. Wiley Series

Derived networks are weighted. To get readable results of reason-

in Computational and Quantitative Social Science. Wiley, Chichester. isbn:

able size we usually search for important subnetworks, often a

978-1-118-91537-0; 978-0-470-71452-2. doi: 10.1002/9781118915370.

[7]

Vladimir Batagelj, Anuška Ferligoj, and Flaminio Squazzoni. 2017. The emer-

kind of skeleton – from a given network less important elements

gence of a field: a network analysis of research on peer review. Scientometrics,

are removed. There are different types of skeletons (spanning

113, 1, 503–532. doi: 10.1007/s11192- 017- 2522- 8.

forest, 𝑘 closest neighbors, cuts, cores, islands, etc. [6]).

[8]

Vladimir Batagelj and Daria Maltseva. 2020. Temporal bibliographic net-

works. J. Informetr., 14, 1, Article No. 101006. doi: {10.1016/j.joi.2020.101006}.

A traditional graph-based visualization is used if the obtained

[9]

Dalmeet Singh Chawla. 2022. Massive open index of scholarly papers launches.

result network is not dense. For denser networks, the matrix dis-

Nature.

[10]

DBLP – computer science bibliography. 2024. (2024). https://dblp.org/.

play is much more readable. In a matrix display, the permutation

[11]

Lorena Delgado-Quirós and José Luis Ortega. 2024. Completeness degree of

of nodes (usually obtained by clustering) can create patterns that

publication metadata in eight free-access scholarly databases. Quantitative

reveal the network’s internal structure.

Science Studies, 5, 1, 31–49.

[12]

Scholar GPS. 2024. (2024). https://scholargps.com/.

Figure 3 presents a matrix display of Balassa co-authorship

[13]

Chenyue Jiao, Kai Li, and Zhichao Fang. 2023. How are exclusively data

indices between European countries in 2022 (yellow cell – no

journals indexed in major scholarly databases? an examination of four

link, red/blue cell – above/below expectation) [17].

databases. Scientific Data, 10, 1, 737.

[14]

Bill Jones. 2002. Computational geometry database. (2002). f tp://f tp.cs.usas

k.ca/pub/geometry/.

5.3

Special algorithms

[15]

Daria Maltseva and Vladimir Batagelj. 2019. Social network analysis as

a field of invasions: Bibliographic approach to study SNA development.

Some properties can require special computational procedures

Scientometrics, 121, 2, 1085–1128. doi: 10.1007/s11192-019-03193-x.

and direct access to the bibliographic data. In such cases, open

[16]

Peter V. Marsden. 1990. Network data and measurement. Annu. Rev. Sociol.,

16, 435–463. doi: 10.1146/annurev.so.16.080190.002251.

access to the bibliographic database is of crucial importance.

[17]

Nataliya Matveeva, Vladimir Batagelj, and Anuška Ferligoj. 2023. Scien-

tific collaboration of post-soviet countries: the effects of different network

5.4

Reports

normalizations. Scientometrics, 128, 8, 4219–4242.

[18]

Jason Priem, Heather Piwowar, and Richard Orr. 2022. Openalex: a fully-

The results of analyses can be combined and presented to users

open index of scholarly works, authors, venues, institutions, and concepts.

in different forms:

arXiv preprint arXiv:2205.01833.

[19]

Google Scholar. 2024. (2024). https://scholar.google.com/.

• Booklet report (in PDF).

[20]

Bert TePaske-King and Norman Richert. 2001. The identification of authors

• (Service generated) web pages.

in the mathematical reviews database. Issues Sci. Technol. Librariansh., 31.

doi: 10.5062/f 4kh0k9m.

• Dashboards.

[21]

Wikipedia. 2024. API. August 22. (2024). https://en.wikipedia.org/wiki/API.

• Dataset (JSON, CSV, etc.).

30





Are papers all that counts? A bibliometric analysis of the

Slovenian scientific community

Aymeric Dupuis

Sašo Džeroski

Jožef Stefan Institute

Jožef Stefan Institute

Ljubljana, Slovenia

Ljubljana, Slovenia

aymeric.dupuis@etu.univ- nantes.f r

saso.dzeroski@ijs.si

Boshko Koloski

Matej Martinc

Jožef Stefan Institute

Jožef Stefan Institute

Ljubljana, Slovenia

Ljubljana, Slovenia

boshko.koloski@ijs.si

matej.martinc@ijs.si

Abstract

a discipline as a whole has. More specifically, our contributions

are the following:

We conduct a bibliometric analysis of the Slovenian science by

scraping the data from Slovenian current research information

• Using the collected data about the Slovenian scientists

system (SICRIS) and using it to build a knowledge graph, repre-

and their projects, covering different scientific fields and a

senting a network of all Slovenian scientific fields and a large

large majority of researchers working in Slovenian science,

majority of Slovenian researchers. By analyzing this network us-

we conduct a graph analysis of connections between dif-

ing different graph measures, we obtain valuable insights into the

ferent fields and researchers. By drawing a comprehensive

connections between different scientific fields and researchers in

map of connections between actors and fields, we iden-

Slovenian science. Additionally, we show the importance of graph

tify the most important researchers and scientific fields

measures as measures of scientific excellence, since they measure

that connect others and play a vital role in the Slovenian

very different aspects of scientific success than the traditional

scientific ecosystem.

citation metrics.

• We created a new ranked list of Slovenian scientists ac-

cording to graph based metrics, which were not available

Keywords

in any of the previous analyses or databases. We argue

that these metrics measure the importance of a role that a

bibliometrics, Slovenian scientific community, knowledge graphs

specific scientist has in a research community, i.e., their

1

Introduction

influence which allows them to act as a bridge or a hub

connecting scientists from different fields.

With the growth and diversification of the scientific enterprise,

obtaining empirical evidence on the research process is crucial for

2

Related work

enhancing its efficiency and reliability. Meta-research and biblio-

Studies in bibliometrics (see [4] for a comprehensive survey of metrics are developing scientific disciplines, seeking to analyse,

techniques used for measuring scientific excellence) have re-

evaluate and refine research practices, and several studies have

cently gained traction in parallel with the success of the scien-

focused on the analysis of the global scientific endeavour, e.g.,

tific enterprise, which has grown in both size and diversity, and

identifying most prominent scientists and fields [7]. These stud-with the availability of data. According to Ioannidis et al. [7],

ies also focus on the problem of how to properly rank scientific

research on research is becoming important due to the mounting

excellence and scientific outputs in general, warning that one

evidence suggesting an alarming drop in reproducibility of re-

should not rely on just a few metrics to obtain a comprehensive

search findings, the growing inefficiency of the scientific process,

picture of the actual impact a specific scientist has [8].

and the fact that the number of false positives in the literature

Until now, very few studies have tackled the analysis of sci-

is exceedingly high. To address these problems, they propose

entific ventures at national level, and to our knowledge, there

a meta-research divided into five main categories that should

has been no study covering the Slovenian scientific landscape

be studied: methods, reporting, reproducibility, evaluation, and

specifically. This kind of research is nevertheless important and

incentives. Studying these five areas would correspondingly al-

could potentially influence policies that would improve scientific

low for five distinct insights into how to perform, communicate,

production and enable effective distribution of research funds

verify, evaluate, and reward research.

and resources.

Recently, several studies also tackled the problem of how to

In this study, we try to address the identified research gaps by

properly rank scientist and scientific outputs in general. For ex-

1.) drawing the map of Slovenian scientific research that would

ample, Ioannidis et al. [8] addressed the increasing prevalence of enable proper decision making and policy formulation, and 2.)

multiauthorship observed in several fields and how this phenom-

proposing new metrics of scientific excellence that would allow

enon affects the effectiveness of the informativeness of citation

us to obtain a more complete view of the impact a scientist or

metrics. They also explored how sensitive the indicators are to

Permission to make digital or hard copies of all or part of this work for personal self-citation and alphabetic ordering of authors. They concluded

or classroom use is granted without fee provided that copies are not made or

that multiple indicators should be used for ranking, as a com-

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this posite of different metrics gives a more comprehensive picture

work must be honored. For all other uses, contact the owner /author(s).

of the actual impact that a specific scientist has. They also ac-

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

knowledged that no single or composite citation indicator can

© 2024 Copyright held by the owner/author(s).

https://doi.org/https://doi.org/10.70314/is.2024.sikdd.11

be expected to select all the best scientists.

31





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Trovato et al.

Several studies employed graph-based metrics to enrich the as-

publications in exceptional, high quality and important venues,

sessment of bibliometric analysis [4, 1]. Network metrics such as respectively. We also extracted the A1 metric, which represents a

degree of centrality, betweenness centrality, eigenvector central-

weighted sum of these three metrics, a CI10 metric measuring the

ity, closeness centrality, and PageRank were used to pinpoint the

number of pure citations of scientific work in the last 10 years,

relative importance of research constituents (i.e., researchers and

the CImax metric measuring the number of citations in the most

institution), which may not necessarily be reflected just through

cited work, and the h10 metric representing the h-index in the

publications. In a large majority of cases, these metrics were

last ten years. Furthermore, we extracted the SICRIS points, a

calculated on co-authorship graphs.

conglomerate metric combing several distinct metrics mentioned

The studies that would cover Slovenian scientific environment

above, and the A3 metric, which measures the amount of funds a

are very scarce. In fact, we are aware of just one, the study by

specific researcher received for his research activity outside of

[2], where they claim that research performance is highly de-

the Slovenian National Research Agency (ARIS).

pendent on the conditions of (national) research environments.

Finally, the SICRIS database also contains information on

They focus on analyzing research activity in six eastern European

projects financed by the Slovenian national research agency in

countries, namely Croatia, Estonia, Hungary, Latvia, Lithuania,

which a specific researcher participated. Scraping this informa-

and Slovenia, and try to determine and compare the effectiveness

tion provided us with an important insight into collaborations

of research in a specific country by obtaining the number of

between different scientists and fields, allowing us to build col-

articles belonging to the most cited 10% and the most cited 1%

laboration graphs, calculate several graph-based ranking criteria

articles in the corresponding subject area and publication year

and draw the map of the Slovenian scientific community.

for each country. Their empirical analysis addresses three levels:

cross-country, cross-institution, and cross-researcher compari-

3.2

Methods

son. The study concludes that Hungary is the country with the

Once the data was obtained, we conduct two distinct analysis

highest output, followed by Croatia and then Slovenia, when it

steps, namely 1.) graph construction and analysis, and 2.) corre-

comes to the number of influential articles published.

lation analysis

3

Methodology

3.2.1

Graph construction and analysis. To construct the neces-

sary graphs, we used the Python NetworkX library [6]. Using

In this section, we describe our methodology, namely 1.) how we

the data from SICRIS, which contain information about project

gather the data and 2.) how we analyze these data to obtain a

collaboration, we created an undirected graph as follows: all re-

map of the Slovenian scientific community.

searchers who participated in at least one project are represented

3.1

Data Retrieval

by a node, and nodes of researchers who worked together on a

project are connected by weighted edges, in which the weights

Data were retrieved from the Slovenian Current Research Infor-

represent the number of shared projects. By removing the iso-

1

mation System (SICRIS) website , which lists more than 35,000

lated nodes, we ended up with a graph with a total of 20,012

researchers working in Slovenian research institutions. Data col-

nodes and 618,871 edges.

lection from the SICRIS website proved challenging, as informa-

In the next step, we apply several graph statistics and mea-

tion about a specific researcher can only be obtained by scraping

sures in order to obtain several node rankings, each of them

his/her Web page on SICRIS. This required finding a solution to

measuring a different aspect of the importance a specific node

quickly retrieve data from more than 35,000 different pages, and

(i.e., a researcher) has in the graph. More specifically, we calculate

2

3

to achieve this, we used the Python Asyncio

and BeautifulSoup

PageRank (PR), Betweenness centrality (BC), and Eigenvector

libraries, which allow the asynchronous connection to several

centrality (EC) measures.

dozen pages simultaneously and extraction of the required data.

In the context of our graph, the PageRank [3] algorithm is

Since the script sometimes took several seconds to connect

applied to evaluate the influence of researchers within the collab-

to a specific page, which could quickly accumulate, resulting in

oration network. Thus, researchers who are strongly connected

considerable overall slowdowns, we optimized the procedure and

to other researchers, who also have many connections (i.e, the so-

identified potential slowdowns. Our proposed solution was to

called hubs in the graph), will have a higher PR score, reflecting

implement a strategy that involved canceling the connection and

their importance and influence in the Slovenian research com-

adding the URL to a list whenever a page failed to connect within

munity. On the other hand, the Betweenness Centrality [5]

a 0.5-second time frame. This timeframe was chosen after several

measure evaluates the role of each researcher as an intermediary

trials and was found to be the best compromise. Once all pages

or a bridge between other researchers. This measure is based on

had been visited, we repeatedly tried to reconnect to the URLs

the idea that researchers who are on many collaboration paths

on this list until it was empty. This change significantly reduced

between other researchers are considered central and influential

the time required to retrieve all our data. Once all the data was

in the network. In our contexts, it helps to better understand

2

retrieved, we used the Pandas library

for data manipulation,

the structure of the collaboration network among researchers.

which allowed us to export the results into Excel spreadsheets,

Researchers with high BC are those who play a crucial role in

appropriate for further processing.

creating links between different subgroups of researchers and in-

From SICRIS, we extracted research areas for each scientist

terdisciplinary connections. In practical terms, BC evaluates the

and various bibliometric indicators of their impact, namely A”,

number of times a researcher is traversed by the shortest paths

A’, A1/2, citation metrics based on a quantitative assessment of

connecting other researchers in the network. Thus, researchers

who are frequently used as pathways for collaboration among

1 https://cris.cobiss.net/ecris/si/en

their peers obtain higher BC scores.

2 https://docs.python.org/3/library/asyncio.html

3 https://www.crummy.com/software/BeautifulSoup

2

1

https://pandas.pydata.org/

https://networkx.org/

32





A bibliometric analysis of the Slovenian scientific community

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Another graph centrality measure that we applied to the cre-

low according to SICRIS points (e.g., the best ranked researcher

ated graph is the Eigenvector centrality [9]. This measure

according to our novel three measures, Dr. Branimir Leskošek, is

evaluates the influence of a researcher taking into account both

ranked as 5731th according to the SICRIS points). This finding

the quality and the quantity of connections. EC assigns more

supports hypothesis 1 that the proposed new measures measure

weight to connections that include influential researchers. Thus,

different aspects of scientific excellence than the more established

a researcher connected to influential researchers will be assigned

citation measures. Another important observation is that 7 out

a high score, reflecting potentially greater influence within the

of 10 best ranked scientists appear to be active in two fields. This

network. This measure helps to detect researchers who, even

might suggest that they are (or have been) involved in several

with fewer direct connections, occupy strategic positions in the

interdisciplinary projects, which could have a positive influence

collaboration network. While this may seem similar to the PR

on the newly proposed graph-based metrics.

algorithm, there are some differences. Unlike PR, which primarily

In Figure 1, we present the heatmap of the correlations be-

focuses on the popularity of links, Eigenvector centrality also

tween the different metrics extracted from SICRIS website and

takes into account the quality of connections. This means that

the newly proposed graph-based metrics. We observe a strong

even if a researcher does not have a large number of direct con-

correlation between PR and BC, 0.7, which might suggest that

nections, if they are connected to influential researchers, their

researchers who collaborate with a wide range of colleagues from

Eigenvector centrality score can be high. In summary, while

different fields are more likely to work with the most important

both measures aim to evaluate the influence of researchers in a

ones.

network, they do so through slightly different approaches, thus

offering complementary perspectives for analyzing the structure

and importance of actors within the collaboration network.

Our second important area of focus in our research is the

collaboration between different fields. To build a graph that

would represent interdisciplinary collaboration between fields,

we grouped all researchers from the same field into a single node,

representing an entire field, i.e., we obtain a node for each scien-

tific field found on SICRIS. Similar to the previous graph, edges

and their weights represent collaborations on a project between

researchers in the linked fields.

3.2.2

Correlation analysis. In order to better understand the

metrics from SICRIS and to evaluate the relevance of our scores,

we deemed it pertinent to explore the correlation between all

our data. This analysis has two main purposes. First, we aim to

test the hypothesis 1 that the new graph ranking we presented,

Figure 1: Heatmap of the Spearman correlation among

measure different aspects of scientific excellence than the more

metrics.

established measures based on number of citations or publica-

tions available on the SICRIS web page. This hypothesis would

We also observe very strong correlations in the top left corner

be deemed correct if one-on-one correlations scores between the

of the heatmap. While a strong correlation was expected, as A”,

newly proposed graph measures and other measures would be

A’, A1/2 and A1 are all scores based on the number of publications

low, and incorrect if correlations would be high.

(in venues of different qualities), the almost perfec correlation

Additionally, we wish to explore the correlation between the

between the SICRIS points and A1 (which suggest they measure

established measures available on the SICRIS web page. More

exactly the same aspect of the scientific impact) is surprising. This

specifically, we wish to test the hypothesis 2 that these measures

finding supports hypothesis 2 that the current SICRIS measures

are strongly correlated, which would indicate that they essentially

all measure a very similar aspect of scientific excellence. On the

all measure a very similar aspect of scientific excellence, which is

other hand, there is no strong correlation between any of the

problematic. In order to obtain one-on-one correlations between

newly proposed graph-based metrics and metrics extracted from

all measures, we calculate the Spearman correlation coefficient

the SICRIS website.

among all of them and then display it through a heatmap.

In Table 2, we present the results of our study of interdis-

ciplinary collaboration between different scientific fields. The

4

Results

graph metrics were obtained from a graph of nodes representing

In Table 1, we present some of the results of the graph analy-

fields and edges representing interdisciplinary project collabo-

sis conducted on the graph of nodes representing researchers,

rations. Note that the field of Computer science and informatics

connected by edges representing project collaborations. More

ranks first according to all the criteria. On the other hand, most

specifically, we present 10 best ranked researchers in the SICRIS

interdisciplinary collaborations are conducted by the researchers

dataset according to the average between ranks of the three newly

from the field of Chemistry, which ranked as third according

proposed graph-based measures, their declared scientific fields,

to the average (AVG) between the ranks of three graph-based

and their ranking (i.e., lower is better) according to the SICRIS

metrics, PG, BC and EV.

points, BC, EC and PR measures.

5

Conclusions

Note that while the table does contain some highly ranked

researchers according to the SICRIS points (e.g., Dr. Sašo Džeroski

The graph based bibliometric analysis of the Slovenian scientific

is ranked as 33rd out of roughly 20K researchers according to

community shows that current citations based metrics do not

this criteria), several researchers in the table are ranked relatively

cover some aspects of scientific excellence, such as researcher’s

33





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Trovato et al.

Table 1: 10 best ranked researchers in the SICRIS dataset according to the average between ranks of the three newly proposed measures, BC, EC and PR. We do not show metric scores, but ranks according to scores (i.e., lower value is better).

Researcher

Field 1

Field 2

SICRIS points

BC

EC

PR

AVG

15355 PhD Branimir Leskosek

Public health (occupational safety)

Computer science and informatics

5731

8

4

31

14

06013 PhD Damjana Rozman

Biochemistry and molecular biology

Metabolic and hormonal disorders

704

21

2

33

18

11279 PhD Nives Ogrinc

Control and care of the environment

Animal production

182

7

50

3

20

27733 PhD Tina Kosjek

Control and care of the environment

Pharmacy

809

2

73

9

28

22459 PhD Tadeja Rezen

Neurobiology

Microbiology and immunology

1837

61

3

49

37

22621 PhD Polonca Ferk

Metabolic and hormonal disorders

Pharmacy

5059

13

8

103

41

12688 PhD Kristina Gruden

Biotechnology

/

219

44

139

6

63

08800 PhD Gregor Sersa

Oncology

/

71

3

185

1

63

12315 PhD Ester Heath

Control and care of the environment

Chemistry

208

62

115

23

66

11130 PhD Šašo Dzeroski

Computer science and informatics

/

33

1

195

20

72

Table 2: Scientific fields as defined in the SICRIS database, sorted according to the average (AVG) between the ranks (lower score is better) of three graph-based metrics, PG, BC and EV.

Rank

Field

Collaborations

PR

EC

BC

AVG

Rank

Field

Collaborations

PR

EC

BC

AVG

1

Computer science and informatics

81248

1

1

1

1.0

36

Textile and leather

21080

27

41

39

35.67

2

Materials science and technology

88934

4

3

4

3.67

37

Animal production

34982

29

29

50

36.0

3

Chemistry

101139

2

2

12

5.33

38

Political science

13598

46

37

27

36.67

4

Control and care of the environment

52648

5

8

9

7.33

39

Anthropology

9860

53

36

24

37.67

5

Physics

50010

3

9

14

8.67

40

Ethnology

6698

65

39

11

38.33

6

Plant production

74535

6

6

16

9.33

41

Cardiovascular system

20793

28

43

45

38.67

7

Systems and cybernetics

45584

7

10

23

13.33

42

Telecommunications

14068

41

45

31

39.0

8

Biology

58879

12

7

21

13.33

43

Veterinarian medicine

30954

32

34

60

42.0

9

Civil engineering

36466

22

13

6

13.67

44

Metabolic and hormonal disorders

18518

30

46

55

43.67

10

Biochemistry and molecular biology

79725

11

5

25

13.67

45

Metrology

12978

34

52

47

44.33

11

Neurobiology

45680

14

12

19

15.0

46

Law

7480

54

49

32

45.0

12

Biotechnology

87261

8

4

33

15.0

47

Psychology

8583

51

55

29

45.0

13

Interdisciplinary research

22946

9

33

5

15.67

48

Human reproduction

21535

35

42

58

45.0

14

Public health (occupational safety)

30400

10

25

13

16.0

49

Process engineering

15340

36

47

53

45.33

15

Educational studies

23518

33

15

3

17.0

50

Hydrology

12396

40

53

44

45.67

16

Mathematics

30680

17

20

20

19.0

51

Architecture and Design

4242

58

57

22

45.67

17

Manufacturing technologies and systems

38874

18

14

26

19.33

52

Philosophy

7380

57

44

43

48.0

18

Forestry, wood and paper technology

30620

19

28

15

20.67

53

Sport

10013

43

54

49

48.67

19

Geography

18555

39

23

2

21.33

54

Geodesy

7760

45

56

51

50.67

20

Economics

26891

31

16

18

21.67

55

Electric devices

13633

42

51

59

50.67

21

Microbiology and immunology

54175

16

11

42

23.0

56

Literary sciences

6399

61

50

48

53.0

22

Sociology

19922

44

17

10

23.67

57

Traffic systems

4448

48

60

52

53.33

23

Pharmacy

41125

15

18

41

24.67

58

Culturology

7240

60

48

54

54.0

24

Linguistics

18176

49

19

7

25.0

59

Technology driven physics

6876

47

59

64

56.67

25

Chemical engineering

33753

13

27

38

26.0

60

Communications technology

4388

52

63

56

57.0

26

Energy engineering

32762

23

21

40

28.0

61

Psychiatry

2481

55

65

61

60.33

27

Computer intensive methods and applications

26942

20

32

34

28.67

62

Criminology and social work

2324

66

62

62

63.33

28

Mechanics

26444

24

31

36

30.33

63

Mining and geotechnology

2342

59

68

63

63.33

29

Oncology

37101

21

24

46

30.33

64

Theology

2941

67

58

66

63.67

30

Geology

26961

37

26

28

30.33

65

Ethnic studies

2398

63

61

67

63.67

31

Electronic components and technologies

28858

26

30

37

31.0

66

Art history

1408

70

64

57

63.67

32

Historiography

12390

56

22

17

31.67

67

Archaeology

1177

68

66

65

66.33

33

Urbanism

8669

50

40

8

32.67

68

Information science and librarianship

792

62

70

70

67.33

34

Mechanical design

22352

25

38

35

32.67

69

Stomatology

391

64

71

68

67.67

35

Administrative and organisational sciences

18563

38

35

30

34.33

70

Landscape design

1046

69

67

71

69.0

71

Musicology

748

71

69

69

69.67

role of connecting a wider research community. Our correlation

[3]

Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hyper-

textual web search engine. Computer networks and ISDN systems, 30, 1-7,

analysis indicates that existing measures of scientific excellence

107–117.

extracted from the SICRIS web page are strongly correlated. In

[4]

Naveen Donthu, Satish Kumar, Debmalya Mukherjee, Nitesh Pandey, and

the future, we plan to expand this analysis to also measure the

Weng Marc Lim. 2021. How to conduct a bibliometric analysis: an overview

and guidelines. Journal of business research, 133, 285–296.

impact of Slovenian scientists on the global scientific enterprise

[5]

Linton C. Freeman. 1977. A set of measures of centrality based on between-

and conduct additional research to try to find certain patterns

ness. Sociometry, 40, 1, 35–41. Retrieved June 27, 2024 from http://www.jstor

across disciplines, or institutions.

.org/stable/3033543.

[6]

Aric Hagberg, Pieter J Swart, and Daniel A Schult. 2008. Exploring network

structure, dynamics, and function using networkx. In Proceedings of the 7th

6

Acknowledgments

Python in Science Conference (SciPy2008). Los Alamos National Laboratory

(LANL), Los Alamos, NM (United States), 11–15.

The authors acknowledge the financial support from the Slove-

[7]

John PA Ioannidis, Daniele Fanelli, Debbie Drake Dunne, and Steven N

nian Research Agency for research core funding for the pro-

Goodman. 2015. Meta-research: evaluation and improvement of research

methods and practices. PLoS biology, 13, 10, e1002264.

grammes Knowledge Technologies (No. P2-0103).

[8]

John PA Ioannidis, Richard Klavans, and Kevin W Boyack. 2016. Multiple

citation indicators and their composite across scientific disciplines. PLoS

References

biology, 14, 7, e1002501.

[9]

Paul Turán, editor. 1969. Publications of edmund landau. Number Theory and

[1]

Njål Andersen. 2021. Mapping the expatriate literature: a bibliometric review

Analysis: A Collection of Papers in Honor of Edmund Landau (1877–1938).

of the field from 1998 to 2017 and identification of current research fronts.

The International Journal of Human Resource Management

Springer US, Boston, MA, 335–355. isbn: 978-1-4615-4819-5. doi: 10.1007/978

, 32, 22, 4687–4724.

- 1- 4615- 4819- 5_23.

[2]

Lutz Bornmann. [n. d.] Research excellence in eastern europe: a bibliometric

study focusing on croatia, estonia, hungary, latvia, lithuania, and slovenia.

34





Empowering Open Education Methodologies with AI-based

Strategies for the Customization of Education

Tel Amiel

Antônio J. Moraes Neto

Joao Pita Costa

Mitja Jermol, Anja

Universidade de Brasilia

Instituto Federal de Brasilia

IRCAI, Jozef Stefan Institute

Poljanar

Brasilia, Brazil

Brasilia, Brazil

Ljubljana, Slovenia

IRCAI, Jozef Stefan Institute

amiel@unb.br

antonio.neto@ifb.edu.br

joao.pitacosta@quintelligence.com

Ljubljana, Slovenia



ABSTRACT

The amount and heterogeneity of data generated in the context

1 Introduction

of education allied to the rapid progress of scientific research

The centralizing piece of the discussions in this paper is an AI-

and technological development have created vast amounts of

based observatory that allows to explore OER-related topics,

data, much of it open data, but significant challenges to

particularly those mentioned in the OER Recommendation:

gathering, filtering and making sense of this information. In this

promoting OER and acknowledging it’s contribution to

paper, we discuss the research outcomes of complementary

advancing quality education while providing information on

Artificial Intelligence (AI)-based strategies monitoring and

advances focused on the equity and inclusion qualities of OER,

enhancing Open Education, mining online forum interaction

as well as on research, activities, projects and news related to

student-educator, and empowering mentorship of educators.

OER development, including new initiatives and projects while

Firstly, the initial results obtained from the construction of an

also promoting public infrastructures for education. The OER

Observatory focusing Open Education Resources (OERs),

Observatory builds on the content made available in UNESCO’s

contribute to implement 2019 UNESCO OER Recommendation

OER Dynamic Coalition Portal (oerdynamiccoalition.org) providing

and advance the Education-focused Sustainable Development

the user with access to any of the four proposed views: media;

Goal (SDG) 4. It is acting on five verticals, enriching and treating

science; policies and training. In each of the views, the user can

multilingual data, it displays meaningful information on a

access interactive data visualisation summarising the sourced

dashboard focused on AI and OERs and serving as a

data

configured

to

observe

the

UNESCO

OER

collaboration platform focused on existing partnerships within

recommendations. As it is fully based on open data, it allows the

the international research centre on AI under the auspices of

user to click on the resources collected and summarized, being

UNESCO (IRCAI), the UNESCO Chair in Distance Education and

taken directly to the source in media, journal, policy or training.

the UNESCO Chair on Open Technologies for Open Educational

Embracing the intersection of AI and education, which has

Resources and Open Learning, mobilizing research

led to the development of various tools that personalize and

collaboration on key AI research challenges relating to

enhance learning experiences, we discuss a complementary

generating knowledge about OER. Secondly, we will discuss the

research based on CA much aligned with the objective of

recent development of an Educational Recommender System

empowering Community interaction at the SDG 4 (Education)

(ERS) that integrates Conversational Analysis (CA) to assess

Observatory [6]. AI applications in education often focus on

and enhance collaborative learning (CL) in Virtual Learning

providing adaptive feedback, facilitating personalized learning

Environments (VLEs). This novel system was designed to

paths, and analyzing student data to improve outcomes. CA is a

identify collaboration among students and provide tailored

method that examines the understanding generated through

recommendations to promote participation and interaction

interactions, offering a framework for analyzing how students

within discussion forums. Finally, we will discuss the

collaboratively build knowledge. By combining CA with AI, this

development and implementation of AI and OERs in alignment

research aims to develop a system that not only assesses but

with SDGs, addressing topics of significant social impact over

also actively promotes collaboration in VLEs [10]. The ERS

an international online mentoring initiative.

discussed later in this paper, is an example of how IRCAI’s SDG4

KEYWORDS

Observatory gains a complex capability towards the

engagement with communities such as in Education. This

Open Education, Machine Learning, Educational Recommender

System, Conversational Analysis, Virtual Learning Environment

discussion then expands towards the appropriate mentorship

of the professionals that will change the domain’s landscape.



While initiatives in this context are diverse and disperse, the

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or

authors are not aware of existing similar approaches [5].

distributed for profit or commercial advantage and that copies bear this notice

and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2024.sikdd.16

35





2 AI-based strategies for the moderation of

across five experimental cycles in different classes at two

online forums on education

Brazilian Federal Institutes , in a Portuguese language context.

The results indicated a positive impact on student learning,

Entering the age of Big Data, AI is feeding the data-driven digital

with 82% of participants acknowledging the relevance of the

transformation across industries including Education. CL

recommendations. The system motivated increased

emphasizes the importance of group tasks and joint

participation and collaboration, with a notable trend of

participation, wherein students learn by actively engaging in

students writing more and systematically organizing their

dialogues that facilitate the sharing of ideas and information.

ideas in forum posts. Additionally, 90% of students engaged in

Even in remote settings, CL enables students to learn together

other activities proposed by their teachers, demonstrating the

through virtual platforms. AI offers new opportunities as a

effectiveness of the recommendations. The results also

demonstrate the system's effectiveness in fostering

pedagogical tool, providing adaptive and personalized

collaboration, with positive feedback from students and

environments that can support CL. This research explores the

educators. A dashboard was developed for teachers, containing

integration of AI into educational contexts, particularly through

graphs including one that shows the main terms discussed in

the development of an Educational Recommender System

the forum by analysis, in which each edge represents a message

(ERS) that uses CA to identify and promote collaboration

from the student with two of these terms, and the nodes in blue

among students in VLEs [1] (see Figure 1).

highlight the new terms that emerged in relation to the



previous analysis (see Figure 2).





Figure 2: Visual analysis of students’ collaboration in a discussion



forum where nodes represent actors in the discussion

(students/educators) and edges represent interactions.

Figure 1: The ERS forum analysis screen

The research methodology is divided into three key stages:

The development of the ERS represents a significant

Conversational Analysis, applying CA to monitor discussion

advancement in promoting collaborative learning in

forums within the Moodle platform, focusing on interactions

educational settings [6,7]. By integrating CA into the system,

among students, identifying collaborative behaviors and

the ERS effectively identifies and enhances collaboration

interaction patterns; Collaboration Assessment, evaluating the

among students. The current implementation of this ERS aims

level of collaboration among students based on identified

to provide personalized recommendations to students,

interaction patterns; and Development of ERS, building a

teachers, and tutors, fostering a more interactive and

mechanism that provides recommendations to students,

teachers, and tutors. These recommendations are aimed at

collaborative learning environment [6]. Future work will

enhancing collaboration and are based on the analysis of forum

explore the integration of additional features, such as

interactions [15]. The initial dataset comprises 20,976

wikification and visualization tools, to further enhance the

messages of Moodle discussion forums, with 15,703 posted by

system's capabilities. Furthermore, the research will benefit

students from a vocational education school. The analysis

from the semi-automatic categorization of educational

focuses on these messages to develop and validate the ERS's

resources of a range of formats, including videos as in [3].

recommendations. The quality of collaboration is measured

through various indicators, which are extracted during

3 An AI-based Observatory to Assess the

different stages of CA. Preprocessing applies techniques of

Impact of OER Worldwide

Natural Language Processing (NLP) to ensure the accuracy of

the analysis, preparing data for the Resource Processing stage

Although the abundance of information available online, some

using Social Network Analysis (SNA) to characterize social

of which is labeled as education-related, it is harder and harder

dynamics and interactions among students. Moreover, the

to find the appropriate resources that can serve education

Message Attribute Identification is the CA stage that allows

either at an undergraduate or a professional training level.

identifying characteristics of students’ messages, , specifically

IRCAI’s Open Education Observatory is an initiative dedicated

their questions, and then Topic Modeling is employed to

to monitoring, analyzing, and promoting the use of OERs

identify key terms discussed in the forums [12], using

globally. It serves as a hub for research insight and fomenting

Tomotopy library (bab2min.github.io/tomotopy) The ERS was tested

collaboration, providing valuable insights and data on the

36





adoption, impact, and trends of OER in education systems

leveraging equitable access provided by OER, sustainability

worldwide.

The

observatory

supports

educators,

models, or international cooperation.

policymakers, and institutions in leveraging open resources to

enhance teaching and learning. It is designed to support

government and institutional decision-makers dedicated to

promoting the goals of the 2019 UNESCO OER

Recommendation, which is centred on OER but generally

promotes the ideals of Open Education (see Figure 3).



Figure 4: The architecture of the OER Observatory as an

Elasticsearch-based system that enables the visualization of

heterogeneous data on OERs



For each area, users can filter and find content specific to their

Figure 3: Dashboard of visual modules to analyse the most

domain of interest: up-to-date news and research on OER

relevant topics under a certain domain or SDG, and the trends

developments, academic studies related to professional

that can direct the education actors preparedness

development, and relevant lectures for capacity building;

information on OER policy development; resources and

The Open Education Observatory ingests a range of different

research focused on effective, inclusive, and equitable access to

data sources with heterogeneous nature and different

quality OER; strategies for developing sustainable OER models;

frequency: (i) worldwide news in almost real-time providing

and opportunities for fostering international cooperation

information from a vast catalogue of multilingual world news,

through potential new partnerships and shared goals. This

captured in more than 60 languages and based on a variety of

organized approach enhances the ability to pinpoint and utilize

wikidata concepts; (ii) published scientific articles, including

the most relevant information in each domain. Information

journal and conference papers, mostly peer-reviewed, covering

generated by the Observatory can be used to aid in the

over more than 126 million articles with yearly updates; (iii)

resolution of problems related to the promotion of OER, by

OER policies from the OER Policy Hub (www.oepolicyhub.org)

identifying trends and major areas of discussion, and to explore

that needs to be input into the OER DC Portal; subsequent

successful scenarios through similar challenges and cases. The

extraction and enrichment of metadata; preparation of

Observatory provide benefits to a range of stakeholders

dashboard related to dashboard based on filters over the

including: national governments, providing access to a variety

metadata, as well as OECD policies data and metadata on AI and

of perspectives on OER trends for decision-making;

Education with yearly updates; (iv) lectures and videos

educational and research institutions, facilitating the access to

selected and filtered on content from Videolectures.net [10]

resources and data; civil society, allowing access to information

resources related to OER; (v) a snippet of worldwide public and

and training materials that explore the knowledge available

private initiatives related to AI and SDG 4 captured by IRCAI’s

towards the implementation of the UNESCO recommendations;

Top100 and related actions; and (iv) a range of worldwide

and the general population, empowering open education.

indices with yearly updates on Education-related topics such as

the percentage of children out of school, or the literacy rate in

4 Open Education for a Better World

youth and adults (see Figure 4).

To ensure that content is readily available for each focus area,

The Open Education for a Better World (OE4BW) program is an

materials from the mentioned sources are categorized by

international online mentoring initiative aimed at advancing

relevant keywords and concepts closely associated with the

the development and implementation of open educational

resources (OER) that address topics of significant social impact,

five key areas of the Recommendation. This organization

in alignment with the United Nations Sustainable Development

allows users to easily filter and access content based on their

Goals (SDG) [2,14]. As part of the Slo2Svet project, the program

specific interests within these areas. By doing so, users can

received 70 project applications and 87 mentor applications

tailor their exploration of resources to match their focus,

from six continents and 25 different countries (see Figure 5).

whether it's capacity building, supportive policy development,

The program's activities are structured into thematic clusters,

focusing on areas such as Artificial Intelligence, Displaced

37



Persons, Sustainability, Health and Well-being, Renewable

insights provided by automatic text analysis and other AI tools.

Energy, Education, and Youth (specifically targeting developers

This will allow us to connect the projects produced by OE4BW

aged 12-24). Throughout the project development process,

to the concrete objectives of the Recommendation, providing

progress was closely monitored by a network of mentors and

examples of practice that can be leveraged to advance its goals.

hub coordinators, providing essential guidance and support to

OER developers. Additionally, within the scope of the Slo2Svet

project, evaluation rubrics for the OER projects were

ACKNOWLEDGMENTS

developed and will be utilized during the final conference,

We thank the support of the Slovenian Research Agency (ARIS)

where developers will present their completed work.

and Ministry of Foreign and European Affairs (MZEZ) on the

project Slo2Svet - Connecting cultures, informing and learning

through Open Educational Resources and AI (V2-2363).

REFERENCES

[1] Ahmadian Yazdi, H., Seyyed Mahdavi Chabok, S. J., and Kheirabadi, M.

(2022). Dynamic Educational Recommender System Based on Improved

Recurrent Neural Networks Using Attention Technique. Applied Artificial

Intelligence, 36(1), 2005298.

[2] Drevensek, M., and Urbancic, T. (2022). The Role of Teamwork in the



Creation of Open Educational Resources for Closing SDG-Related

Knowledge Gaps. Open Praxis, 14(2).

Figure 5: Participants of the OE4BW mentorship in 2023/24.

[3] M. Grcar, D. Mladenic., and P. Kese (2009). Semi-automatic categorization

of videos on videolectures. net. In Proceedings, of the Machine Learning

and Knowledge Discovery in Databases: European Conference, ECML PKDD

5 Conclusions and further work

2009, Bled, Slovenia, September 7-11, 2009, Springer, pp. 730-733.

[4] Koschmann, T. (2013). Conversation Analysis and Collaborative Learning.

In this paper we discussed the research results and

In C. Hmelo-Silver, C. Chinn, C. Chan, & A. O’Donnell (Eds.), The Intern.

Handbook of Collaborative Learning. Routledge Handbooks, pp. 149–167.

opportunities in Open Education, building on an overall

[5] Liu, Q., Huang, J., Wu, L., Zhu, K., and Ba, S. (2019). CBET: Design and

perspective over the OER landscape, the AI-enhanced student-

evaluation of a domain-specific chatbot for mobile learning. Universal

Access in the Information Society.

educator interaction, and the mentorship for further progress.

[6] Moraes Neto, A. J., and Fernandes, M. A. (2019). Chatbot and Conversational We will be exploring further the potential of the OER

Analysis to Promote Collaborative Learning in Distance Education. 2019

observatory, particularly in what regards the appropriate use

IEEE 19th International Conference on Advanced Learning Technologies

(ICALT), 2161-377X, pp. 324–326.

of LLMs in analyzing the compliance to AI policies in Education.

[7] Moraes Neto, A. J., Fernandes, M. A. & Amiel, T. (2022). Conversational In what regards the future developments of the EduColab, in

Analysis to Recommend Collaborative Learning in Distance Education. 14th

Intern. Conference on Computer Supported Education, pp. 196–203.

alignment with IRCAI’s SDG 4 Observatory and the

[8] Moraes Neto A. (2024) Sistema de Recomendação Educacional para

Videolectures.net research agenda and the potential for

Diagnosticar e Promover a Colaboração em Ambientes Virtuais de

Aprendizagem. Doctoral thesis. Federal University of Uberlandia.

institutional collaboration, we will focus on: (i) the appropriate

[9] Novak E., Novalija I. (2016) Visual and Statistical Analysis of

wikification, incorporating suggestions of Wikipedia concepts VideoLectures.NET, Proceedings of the SIKDD’16.

identified by Wikifier and related to the main discussion topics;

[10] Urbančič, T., Polajnar A., and Jermol, M. 2019. (2019). Open education for a better world: a mentoring programme fostering design and reuse of open

(ii) integration of interactive data visualization presenting

educational resources for sustainable development goals. Open Praxis.

graphical representations of collaboration trajectories, topic

Open praxis. ISSN 1369-9997, Vol. 11, no. 4; pp. 1-18.

[11] Urbančič, T., Polajnar, A., & Jermol, M. (2019). Open Education for a Better evolution, and other key indicators; (iii) extending the system,

World: A Mentoring Programme Fostering Design and Reuse of Open

applying the ERS to other datasets, including public and private

Educational Resources for Sustainable Develop. Goals. Open Praxis, 11(4).

message exchange logs, to validate and enhance its

[12] Urbančič, Tanja, et al. (2023) Developing supportive policies and strategies for their implementation: student experience with real-world cases. Open

applicability; and (iv) personalized recommendations,

Educational Resources in Higher Education: A Global Perspective.

developing a user-based collaborative filtering technique to

Singapore: Springer Nature Singapore, 2023. pp. 35-53.

[13] Uthus, D. C., & Aha, D. W. (2013). Multiparticipant chat analysis: A survey.

tailor recommendations more specifically to individual student

Artificial Intelligence, 199–200, 106–121.

groups. Moreover, we will explore together the pathways of AI-

[14] Vayansky, I., & Kumar, S. A. P. (2020). A review of topic modeling methods.

Information Systems, 94.

based citizen science in the context of Open Education and how

[15] Zawacki-Richter, O., Marín, V. I., Bond, M., & Gouverneur, F. (2019).

it can be integrated in the wider scope of the SDG4 Observatory.

Systematic review of research on artificial intelligence applications in

In the context of the Slo2Svet project, we are conducting a

higher education – where are the educators? International Journal of

Educational Technology in Higher Education, 16(1), 39.

comprehensive analysis of the Open Education for a Better

World (OE4BW) mentoring program since its inception,

examining outcomes and connections to other initiatives [see

for example, 12]. Additionally, we will develop an evaluation

framework to assess the impact of the projects produced

through the program, mapping project outputs to the five

action areas of the 2019 UNESCO OER Recommendation, using

38





Addressing Water Sustainability Challenges in North Africa with

Artificial Intelligence



Mustafa Zaouini, Maurizio

Joao Pita Costa*, Davor

Manal Cherkaoui, Anas Ait

Hanaa Hachimi, Y. Kaddouri, I.

Santamicone, Lee Chana

Orlic, Mihajela Črnko

Aomar, Ikram Chairi,

Lirmaqui, A. H. Alaoui, O.

IRCAI, Quintelligence



AI in Africa

Karima Echihabi

Ignammas, H. Rahhou

Johannesburg, South Africa

Ljubljana, Slovenia

UM6P

Ibn Tofail University



mus@fliptin.com

joao.pitacosta@quintelligence.co

Ben Guerir, Morocco

Kénitra, Morocco

m

candia@usp.br

M. Wahib Abkari, R. Rachidi,

K. Gourari, I. Annaki, B. Jearani,

J. T. El Azzoiani, M. Ait Essibaa,

W. Laaleg, Z. Hidila, M. Tabaa

S. Trabi, T. Zennouhi, M. Sbaa

A. Hamidine, H. Lachheb

Moroccan School of Engineering

UMP Univesity

Al Akhawayn University

Sciences, Casablanca, Morocco

Oujda, Morocco

Ifrane, Morocco



ABSTRACT

AI Everything section of the GITEX Africa in the end of May

2024. It was mostly directed to PhD/MsC students and young

The topic of water sustainability has been leading the priorities

entrepreneurs working on AI to solve problems for the good of

worldwide where Artificial Intelligence (AI) can position

their communities, exploring a wide range of machine learning

research institutions, public & private companies and

methodologies (from image recognition on satellite imagery, to

governments towards evidence-based decision-making in

text mining on social media, gamification strategies optimizing

regards to water resources. In this particular domain, the

water consumption, and application of LLM frameworks for

amount and heterogeneity of data generated allied to the rapid

RAG and AI Agents in the context of water sustainability),

progress of scientific research and technological development

engaging experts from global agencies like, e.g., UNESCO, AI

have created vast amounts of data, but significant challenges to

Movement, and UNESCO’s Water Education Institute, as well as

gathering, filtering, and making sense of this information. This

national companies, research institutions and government. The

paper presents the research outcomes of collaborative effort

global challenge of this action, "Water, AI and Sustainability" is

engaging a total 51 students mentored by 15 professors across

one of the MENA priorities, takes into consideration the UN

11 research institutions in North Africa, distributed by 14

Water Program for 2024-25 [12], and follows the work done by

selected projects focusing the appropriate application of

IRCAI with the European Commission (EC) on the NAIADES

machine learning methods to local and national water

Water Observatory [9], as well as the recently opened new

sustainability problems. These outcomes were motivated by a

IRCAI Committee on AI and Water Resource Management [4]

youth challenge co-organized during May 2024 between AI

focusing on the impact of AI in SDG 6 [11]. This work aligns with

Africa and IRCAI with the support of GITEX.

UNESCO’s interests in taking action to capacitate the Youth

towards AI, with focus on the recent activities based from

KEYWORDS

Morocco but with a global scope, including the opening of the

Machine learning, text mining, large language models, community

new UNESCO AI Centre, the AI Movement (aim.um6p.ma).

engagement, water sustainability, competition

1 Introduction

Building upon common interests, exciting initiatives and

existing projects developed by IRCAI and AI in Africa

(aiinafrica.org), focused on AI and Sustainability, this activity

aimed to build capacity within African youth to advance the

Sustainable Development Goals (SDGs) though AI on challenges

within their own communities and in the region. The AI Youth

Challenge originated in the context of discussions started in



GITEX Dubai in 2023 and forwarded to a concrete event in the

Figure 1: Winner of the AI4Water challenge, designed and



developed by UM6P students, exposing a water map that

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or

pinpoints remote villages with assigned water scores based on

distributed for profit or commercial advantage and that copies bear this notice

satellite imagery and crowdsourced data.

and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2024.sikdd.17

39



2 Finalist innovative ideas on water

lies in its data generation and refinement approach. It creates

sustainability

datasets in areas with data scarcity, starting with an automated

baseline from satellite imagery and then enriching it through

Attracting the participation of more than 50 PhD and MsC

user-generated content. This closed-loop system employs

students across 20 teams based in research institutions in

active learning, progressively enhancing accuracy and

Morocco, this initiative was designed to encourage a

relevance of water scores.

conversation between the communities, corporate thought

leaders, the education visionaries, and the ecosystems builders

AquaSense. Water management is a critical issue in many

to have constructive conversations around the shifts and needs

countries, including Morocco. Severe droughts, poor water

of the changing future landscape. The discussions included

distribution, and recent natural disasters raise the urgent need

researchers, start-up communities, technologists, and

for better solutions to manage water resources effectively.

government representatives to unite and define the future of

AquaSense’s prototype (see Figure 2) offers a smart way to

water sustainability as they see it. The selected AI technologies

handle water resources by predicting future water situations,

and methodologies ranged from the use of satellite imagery to

visualizing key data, and engaging citizens and communities.

the analysis of news and social media, or the input from water-

This helps decision-makers plan better, save resources, and

related sensors and the application of Large Language Models

respond quickly to local water issues. AquaSense provides

(LLMs) to describe good practices. We shall proceed with

accurate forecasting of water parameters for informed

describing the problems addressed by the finalists of the

management and answers water-related questions with

AI4Water challenge, their prototypes and the value of the

detailed analysis using the latest data and news. It offers

innovation they brought with them.

transparent data visualization through interactive charts,

allowing users to view and upload data easily. The community

AquaScore. The Rural communities in Morocco's High Atlas

& citizens’ space features real-time news updates, a water

Mountains struggle with water management due to limited

levels map to locate and help regions in need of water, and a

resources and visibility. Despite needing modest funds, these

tool to easily report local water issues.

villages face significant hurdles in accessing support. The

challenge lies in objectively quantifying water issues and

connecting these communities with potential supporters.

AquaScore creates a water map that pinpoints remote villages

and assigns them water scores based on satellite imagery and

crowdsourced data. This enables ranking villages by water

criticality, helping funders and supporters identify where to

direct their assistance effectively. The prototype (described in



Figure 1) also offers a platform for discussing water solutions,

Figure 2: Screenshot of the prototype of Aquasense defining

fostering community engagement through gamification

parameters, visualizing data and monitoring engagement.

features. By increasing the visibility of rural Moroccan villages

and providing objective water criticality assessments,

AquaSense combines two distinct parts of AI: DL (LSTM) and

AquaScore facilitates efficient resource allocation for donors

Generative AI (RAG and AI Agents). AquaSense uses

and experts. This AI-driven approach ensures fair and unbiased

Multivariate and multistep LSTMs to accurately predict water

assistance to communities in need, promoting water

parameters’ levels for the coming years, and Retrieval-

sustainability and improved water management in constrained

Augmented Generation and AI Agents to answer water-related

environments.

queries with detailed analysis, using the latest data, news,

AquaScore employs a hybrid approach combining Computer

predicted parameters, and documents from sources like UN,

Vision (CV) and Natural Language Processing (NLP). CV

UNCCD, and EPA. AquaSense uses Tensorflow and Keras (LSTM

algorithms segment satellite images to generate automated

model), Pandas and Numpy (data preparation & mgmt.),

baseline water scores, while NLP algorithms extract insights

Langchain (LLM framework for RAG and AI Agents), Chroma

from textual data to enhance score accuracy. This combination

(Vector DB), Nomic Embeddings (Open-Source embeddings),

allows for objective assessment and continuous improvement

GPT3.5-TURBO (LLM model), Streamlit (Web app). Aquasense

of water criticality rankings. The team has already aggregated

improves water management by helping stakeholders make

data on 1,322 High Atlas villages, extracted satellite images,

informed decisions, enhancing resource allocation, and

and segmented them using Facebook's Segment Anything

promoting sustainable practices. Through its innovative

model. This process was completed on UM6P servers using

features, it bridges the gap between citizens and authorities,

500GB of storage and 80 CPU cores. The system will

which fosters collaboration and reduces water crises over time.

incorporate user-submitted reports and internet-scraped data

Also, AquaSense aligns with several UN Sustainable

to further refine water scores. The uniqueness of AquaScore

Development Goals (SDGs) such as SDG 6 (Clean Water and

40



Sanitation), SDG 13 (Climate Action), and SDG 11 (Sustainable

powered early detection algorithms were prepared to

Cities and Communities).

constantly monitor for signs of invasive species, triggering

immediate alerts to enable rapid response. Based on species-

Water Consumption Tracker. This prototype is addressing

the global problem of water optimization in the light of the

specific data, the system can precisely deploy the most effective

already visible consequences of climate change. That is, the

eradication methods, from underwater drones to selective

large amount of wasted water due to irresponsible water use

biocides. As invasive species evolve, the AI-driven platform

by the households. The added value lies in the behavioral

continuously adapts strategies, ensuring that the interventions

approach: the application is designed to make users more

remain effective and environmentally responsible.

aware of their attitude toward water consumption, and to make

water conservation a pleasure rather than a responsibility.

YAZ. High unemployment rates in North Africa often translate

Introducing a gamification approach as a new strategy should

into many individuals employed in low-wage jobs, particularly

help make water conservation more appealing. It is based on

youth from low-income households. Severe water scarcity

an app that tracks real-time water usage, provides

leading is decreasing exports and rising prices of vegetables

personalized recommendations, and motivates users over a

and fruits. Challenges meeting the needs of Morocco's

gamification environment, fostering a community focused on

population while being a major exporter of produce to global

sustainable water use.

markets. This AI-based agricultural solution is based on Smart

The use of Machine Learning models such as Random Forest

Hydroponic Towers designed to efficiently grow crops

Regressor to find patterns between the households

vertically indoors and outdoors, offering optimal use of

characteristics and their water usage behavior. We plan to add

GenAI using LLM model as a chatbot to support our vision by

available spaces. The adoption of hydroponics in Africa has the

providing custom tips to optimize water usage. The approach

potential to create millions of new jobs in the coming years.

was fundamentally based on: (1) collecting data about the

Integrated with GPT architecture, the technology allows real-

households using our application UI; (2) providing optimum

time monitoring, pest detection, and yield estimation. YAZ

water consumption level by the ML model based on the data

hydroponics are a shift towards a resilient and sustainable

collected; and (3) monitoring water usage through IoT sensors

Moroccan agriculture.

and the notification system of our App. The data collected is

used to optimize the ML model performance. Our approach can

The tools and technologies presented in this paper that are

potentially reduce household water waste by 20-50% by

open source, are available at IRCAI’s SDG Observatory GitHub

educating users about their consumption habits through

repository (github.com/IRCAI-SDGobservatory).

notifications, ranking systems, and feedback mechanisms.

3 From concept to prototype in a month

AI in Africa in collaboration with IRCAI conducted a gathering

of minds which culminated in a 1-day summit around

technologies and shifts of the future, hosted by GITEX in the AI

Everything section of the GITEX Africa 2024. Between 26th

April and 31st May, 55 PhD and MSc students from 11 research

institutions took part of a complete program including expert

sessions kicked-off at the AI movement, UNESCO’s new center

for AI in Africa, and engaging experts in water-related topics

such as Matjaž Mikoš, UNESCO chair for landslide risk



reduction, droughts and floods, discussing our recent research

on news mining for extreme weather events [5, 6]; Gerald

Figure 3: The pitch of one of the top 3 teams – Ghayt – presenting

the Water Consumption Tracker at the AI stage of GITEX Africa.

Corzo Perez, senior researcher at the UN Water Education IHE

Delft, discussing our ongoing research on Water, AI and Twitter

Aquatic Biodiversity. The introduction of non-native species

[7]; and Ignacio Casals, R&D Manager in Aguas de Alicante

into marine ecosystems presents a significant threat to the

Spain, providing a industrial perspective on the use of AI to

fragile equilibrium of these vital environments. Invasive

tackle the challenges of wastewater management [8].

species, often aggressive, can outcompete native organisms,

The students were followed across 8 stages including:

leading to disrupted food chains, altered habitats, and

conceptualization; data collection, analysis and visualization;

potentially irreversible ecological harm. From coastal areas to

methodology and implementation, prototype building and

the open sea, the swift proliferation of invasive plants, animals,

pitch (see Figure 3). In order to maximize the impact of the

and microorganisms endangers the biodiversity, productivity,

programme, the content from the abovementioned

and resilience of our marine life. Addressing this escalating

opportunities will be organized across UNESCO’s most related

global issue requires immediate and decisive action. AI-

to the five areas: (1) capacity building; (2) developing

41





supportive policy; (3) effective, inclusive and equitable access

4 Conclusions and further work

to quality Education; (4) nurturing and creating sustainability

The capacity building to enhance opportunities can benefit

models for Water Sustainability; and fostering and facilitating

from the engagement of the Youth in AI-driven challenges that

international cooperation.

start in research problems deriving from issues to address in

their communities. Problems they know well and data that they

often have privilege access to, with promising impact that can

ensure the sustainability of the innovation offered. The

initiative served us also to collaboratively discuss sustainable

solutions that help large scale recovery and define a better and

more hopeful inclusive Africa. The winning outcomes of this

challenge will integrate a vibrant worldwide Community of



researchers and entrepreneurs focusing on AI and SDGs,

Figure 4: The phases of the training curriculum across 5 weeks.

starting with SDG 6, and supported by initiatives such as

IRCAI’s Top 100 or the SDG Observatory. Ethical considerations

The training curriculum included weekly seminars open to

are being addressed in the context of the EC project AI4GOV.

public, training workshop for participants, showcases

and mentoring sessions (see Figure 4). The discussions

ACKNOWLEDGMENTS

forming the bae concepts of the participant projects were held

in the light of IRCAI’s research and research achievements (see

This research was partially funded by the European

Figure 5), aiming at building research collaboration bridges.

Commission’s Horizon research and innovation program under

grant agreement 820985 (NAIADES) and 101120237 (ELIAS).

REFERENCES



[1] Blazhevska V.(2020). United Nations launches framework to speed up

progress on water and sanitation goal, United Nations Sustainable

Development.

[2] Casale G., Cordeiro Ortigara A.R. (2019) Water in the 2030 Agenda for

Sustainable Development: How can Europe act? Water Europe, Brussels.

(ISBN 978-90-8277064-3) 36p.

https://unesdoc.unesco.org/ark:/48223/pf0000372496

[3] International Water Association and Xylem Inc (2019). Digital Water:

Industry leaders chart the transformation journey. [Online] https://iwa-



network.org/wp-

content/uploads/2015/12/IWA_2019_Digital_Water_Report.pdf

Figure 5: Selected topics from IRCAI’s research to motivate

[4] IRCAI Committee Chair on AI and Water Resource Management [online]

challengers in AI and Water research

ircai.org/project/ai-and-water-resources-management/

[5] Mikoš M., Bezak N., Pita Costa J., Nassri M. B., Jermol M., Grobelnik M.

(2022). Natural-hazard-related web observatories as a sustainable

The data and methods generated by the participants

development tool. In Progress in Landslide Research and Technology, Vol.

1, No. 1, Springer (in print).

programme can be used by companies, government and

[6] Pita Costa J., Rei L., Bezak N., Mikoš M., Massri M.B., Novalija I. and Leban, G.

research institutions to aid in the resolution of problems

(2024) Towards improved knowledge about water-related extremes based

related to Water Sustainability, by identifying trends and major

on news media information captured using artificial

intelligence. International Journal of Disaster Risk Reduction, 100, p.104172.

areas of discussion, and to explore successful scenarios

[7] Perez, G., Pita Costa J., Novalija I., Rei L., Senožetnik M., Casals del Busto I. C.

through similar challenges and cases. IRCAI’s SDG 6

(2024). Integrating Social Media, News and Machine Learning for

Enhanced Hydrological Event Detection and Management. In 15th

Observatory [10] is being built to properly address the

International Conference on Hydroinformatics (p. 278).

challenges of decision makers, using AI. It is benefitting: (i)

[8] Pita Costa J., Massri M. B., Novalija I., Casals del Busto I., et al. (2021).

national governments providing access to a variety of

Observing Water-Related Events for Evidence-Based Decision-Making. In:

Slovenian Data Mining and Data Warehouses conference (SiKDD2021)

perspectives (including trend and comparative) on a data

[9] Pita Costa J. (2022). Water Intelligence to Support Decision Making,

driven dashboard with information on Water Sustainability

Operation Management and Water Education - the NAIADES Report. IRCAI

Library. [online] https://ircai.org/project/ircais-project-report-on-

trends for decision-making; access to local (e.g. country-level)

naiades/

progress on SDG 6; (ii) educational institutions, offering access

[10] Pita Costa J., Zaouini M., Crnko M., Polzer M., Corzo Perez G., Mikoš M.,

Orlic D. and Jermol M. (2024) Challenging Water Sustainability in Africa

to information on current trends on Water Sustainability

Through AI, Proceedings of the HHAI 2024 workshop on AI in Africa and

research and development; (iii) research institutions, sourcing

SDGs.

open data over interactive visualisation and research; (iv) the

[11] UN-Sustainable Development [online] The IRCAI Water Observatory - AI in

the service of SDG 6 [online] https://sdgs.un.org/partnerships/ircai-

NGO community, easing access to information directly linked to

water-observatory-ai-service-sdg-6

community priorities including citizen science activities; and

[12] UN-Water Work Programme 2024-2025 [online]

https://www.unwater.org/publications/un-water-work-programme-

(v) general population, empowering water education for all.

2024-2025

42





Predicting poverty using regression

Luka Urbanč

Marko Grobelnik

Jožef Štefan Institute

Jožef Stefan Institute

Ljubljana, Slovenija

Ljubljana, Slovenija

urbancluka3@gmail.com

marko.grobelnik@ijs.si

Joao Pita Costa

Luis Rei

IRCAI, Quintelligence

Jožef Stefan Institute

Ljubljana, Slovenija

Ljubljana, Slovenija

joao.pitacosta@quintelligence.com

luis.rei@ijs.si

Abstract

defined by each country individually, recognizing that different

Poverty reduction is the first Sustainable Development Goal set

countries have different measures of, e.g., what life conditions

by the United Nations to be achieved by 2030, but current data

and how much income makes an individual reach a "poor" status,

indicates that the progress is insufficient. The diverse factors

as well as how we can normalise this to better compare these

influencing poverty across different nations pose a challenge in

relative indicators between countries. We are still missing a clear

developing effective predictive models. This paper evaluates the

theory in poverty research, despite the issue existing for a number

use of various regression models to predict poverty rates using a

of decades [2]. With that being said, some authors have already comprehensive dataset of 111 variables from sources such as the

explored the causes of poverty. For instance, corruption, political

UN and the World Bank. The data, spanning multiple domains

instability, ineffective local governance, government polices, gen-

like political stability, education, and economic conditions, was

der inequality and short-term wage replacement policies, such as

preprocessed and transformed to create auxiliary features and

maternity leave benefits and sickness pay, impact relative poverty

interactions. Among the models, Ridge regression yielded the

[6, 7]. When assessing what people believe causes poverty some best results, achieving a Root Mean Square Error (RMSE) of 3.6,

geographical differences emerge. For example, the United States

indicating high predictive accuracy on a global scale. This study

are mostly of the thought that an individuals traits are responsible

highlights the importance of addressing multicollinearity and

for poverty, while countries in Europe have a blend of individ-

incorporating a wide range of features to improve the general-

ualistic, fatalistic and structural beliefs such as lack of will, bad

izability of poverty prediction models. Future research should

luck and social injustice respectively [4].

explore more complex methods, such as neural networks, and

Machine learning (ML) has also been used in academic re-

refine model hyperparameters for enhanced performance.

search to identify trends and analyze data in most fields, includ-

ing poverty research. Although a number of papers have already

Keywords

been published on the use of ML to predict poverty [1, 10, 12, 5,

3, 8] (for more see [11]) including combining satellite images and poverty, linear regression, lasso regression, ridge regression, elas-neural networks to help predict poverty in five African countries

tic net regression, sustainable development goals

[5], most take a limited number of variables. Usmanova’s litera-1

Introduction

ture review found 22 papers published between 2016 and March

2022, with a total of 57 AI methods applied, the most popular be-

The need to eradicate poverty has been a long standing issue,

ing random forest, used in more than half of all papers reviewed.

which was globally recognized numerous times, most impor-

It also found most papers focus only on African and South Asian

tantly in the United Nations (UN) Sustainable Development Goals

countries, a finding consistent with our own [11].

(SDGs), being given the number one spot of SDG1: "End poverty

In this paper we focus on the following research questions: (i)

in all its forms everywhere", which should be achieved by 2030.

can regression be useful to identify the most influential features,

The latest UN report on the progress made in achieving SDG1

from a large amount of global indicators; and (ii) can direct and

indicates Poverty has returned to pre-pandemic levels in middle-

indirect causality relations be identified that signal new indicators

and high-income countries, with poverty in low income countries

relevant to the Poverty-related issues?

still a fraction above those reported in 2019. While the trends

seem to be going in the right direction, the UN warns that the cur-

2

Data

rent pace of improvement is insufficient to reach the agreed goals

before 2030. This raises the question of what impacts poverty

To address the research questions, we utilized 111 primary vari-

rates the most and how countries can most effectively reduce

ables from sources such as the UN and the World Bank, aggre-

poverty levels.

gated through the Our World in Data portal. These variables span

To fully understand and address the issue of poverty, one must

diverse domains, including political stability, policies, education,

navigate several definitions, which can often lead to confusion.

healthcare, economic conditions, and inequality. We prioritized

The baseline definition used in this paper is the poverty line as is

features that prior research has identified as significant, while

also incorporating some factors that are less intuitively linked

Permission to make digital or hard copies of all or part of this work for personal to poverty. The dataset was then used to train various models

or classroom use is granted without fee provided that copies are not made or

distributed for profit or commercial advantage and that copies bear this notice and aimed at predicting poverty rates across countries. This task is

the full citation on the first page. Copyrights for third-party components of this particularly challenging because countries respond differently

work must be honored. For all other uses, contact the owner/author(s).

to the same variables. For instance, GDP growth tends to have a

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

more significant impact on poverty reduction in developing na-

© 2024 Copyright held by the owner/author(s).

https://doi.org/https://doi.org/10.70314/is.2024.sikdd.20

tions compared to developed ones. Additionally, many variables

43





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Urbanč et al.

are strongly correlated, making it difficult for linear regression

the basic parameters in addition to the mathematically derived

models to capture their relationships accurately.

columns.

As previously mentioned, most of the data used in this paper

was sourced from ourworldindata.com (OWiD), with some additional data coming from fao.org—including variables such as foreign direct investment inflows and outflows, and the added

value of agriculture, among others. Data on the transatlantic slave

trade and colonial rule was obtained from www.slavevoyages.org.

All datasets were preprocessed before being merged, following a

series of steps.

The first preprocessing step involved light modifications, such

as removing irrelevant columns, renaming columns, and exclud-

ing data from before 1987 and after 2023 due to gaps and incom-

plete data. Despite increased reporting in recent years, many

countries still omit certain indicators, complicating model train-

ing. To address this, missing features with more than 𝑛 data points

for a given country were interpolated, with the edges filled using

backward fill (bfill) and forward fill (ffill). Those with less than 𝑛

data points used the mean of the country’s income group for the

given year as a filler value. The number 𝑛 was intuitively chosen

to be five and the methods bfill and ffill were chosen to prevent

the use of unrealistic data. The World Bank classifies countries

into income groups: low (less than 1,045 USD), lower-middle

(1,046 USD to 4,095 USD), upper-middle (4,096 USD to 12,695

USD), and high income (12,696 USD or more). However, it is im-

portant to note that the data generated using the aforementioned

methods somewhat reduces overall robustness.

The next step involved generating auxiliary columns, specifi-

cally lagged columns and changes in value for relevant parame-

ters. For instance, the row corresponding to Niger in 2013 would

also include the GDP per capita for 2012, 2011, and earlier years,

in addition to the value for 2013. This approach reflects the fact

that poverty trends often manifest in response to changes over

time, rather than immediately. The default number of years for

Figure 1: Scheme of adopted methodology

lagged data was set to five. Similarly, we incorporated changes

in value over the same five-year period to capture more explicit

data on unusual events, such as the onset of wars or significant

political changes.

3

Methodology

Next each primary parameter was also used as an argument

In order to predict worldwide poverty levels, we have used dif-

for a number of mathematical functions in an effort to see if any

ferent linear regression models and compared their accuracies.

correlations aren’t linear but perhaps squared, cubed or another

With this we aimed to ease the interpretability of the models,

elementary function. The functions used were: 2

3

𝑥

, 𝑥 , ln 𝑥, sin 𝑥,

which is harder to obtain with more complex methods such a

cos 𝑥, tan 𝑥, arcsin 𝑥, arccos 𝑥, arctan 𝑥 to try and capture any

neural networks. To perform the research work that is the base

elementary nonlinear dependence within the model.

of this paper, we have selected ordinary linear regression, lasso

The last step was to create all possible products with the avail-

regression, ridge regression and elastic net regression as the mod-

able primary parameters, as creating all possible products with

els to compare. OLS regression struggles with multicollinearity,

all auxiliary parameters included would have been computation-

where predictor variables are highly correlated, leading to un-

ally inefficient. After all these steps were made, the individual

stable estimates of the coefficients. Ridge regression addresses

columns were fused together. This method of preprocessing in-

this by adding an L2 regularization term, which penalizes large

creases the possible variables included, making the model even

coefficients and helps to stabilize the estimates in the presence

more general and retaining as many rows of data as possible.

of multicollinearity. By shrinking the coefficients, ridge regres-

The function responsible for preprocessing, generating and

sion reduces the sensitivity of the model to colinear predictors,

merging the data has a few parameters: basic_parameters_only,

ensuring more reliable and generalizable results. Unlike lasso,

combinations and math. basic_parameters_only determines, if

ridge retains all predictors, making it particularly useful when

the model will only contain data obtained from various online

multicollinearity is a key concern but feature selection is not

databases, or if the model should include generated data: the

the goal. We use the implementation of these linear regression

change in value and values for previous years. combinations de-

algorithms in scikit-learn [9].

termines, if the model should create all possible combinations

The datasets were split into training and test sets using the

with the primary parameters and math determines if mathemati-

sklearn function train_test_split, with 80% for training and

cal columns are included in an attempt to gain a deeper insight

20% for testing. The training set was used to train four regression

into the features’ relationships. The parameters are marked with

variants (LinearRegression, Lasso, Ridge, ElasticNet),

B, C and M. For instance, B+M would mean the file contains

all with a random state seed of 42. while the test set was used to

44





Predicting poverty using regression

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

determine the mean squared error (MSE) and 2

𝑅

value using the

functions mean_squared_error and r2_score from [9], both

common metrics used to assess models accuracy. All models

except OLS regression also had the data standardized before

training. The hyperparameter 𝛼 for the models was sensibly

chosen as 0,1. The results, seen in Table 1 are color coded: red for poor performance, yellow for intermediate, and green for the

best. The variation in the number of rows is due to the exclusion

of rows with insufficient yearly data, which were dropped when

calculating differences from previous years.

After identifying the most successful model, we proceeded to

compare its performance between high-income and low-income

Figure 2: Visual representation of model weights

countries. This comparison aimed to assess how the accuracy and

frequency of reported data influence the model’s performance.

These two income groups were chosen because low-income coun-

tries typically report less data with lower accuracy, while high-

population with urban and rural population share. Other no-

income countries provide more precise reports. We selected all

table combinations include secondary school completion with

high- and low-income countries from the dataset that were not

women’s civil liberties, internet usage with sanitation access,

used during the model’s training. From the 20% of data reserved

and military spending with wealth distribution. The weights also

for evaluation, 444 rows (30%) belonged to high-income countries,

reflect factors like infant mortality, years colonized, and agricul-

and 368 rows (24%) belonged to low-income countries.

tural employment. Figure 2 further illustrates the decline in the We used the trained model to predict poverty levels for these

absolute value of these weights.

groups and evaluated its performance using the MSE metric to

The model performed better on high-income countries, with

analyze differences between income groups. Additionally, we

an MSE of 6.60, significantly below the overall MSE. In contrast,

calculated the maximum error to determine if the average per-

the MSE for low-income countries was 20.68. The maximum

formance was skewed by outliers. A similar evaluation was con-

error was also lower for high-income countries (22.1) compared

ducted on the data from Slovenia and Somalia, which were part of

to low-income ones (34.4).

the split. Slovenia had 8 rows of data, and Somalia had 6, allowing

The difference in the model’s performance on Slovenia and

us to explore how missing data impacts the model’s performance,

Somalia was notable. For Slovenia, the MSE was 0.78 with a

as Somalia had significantly fewer data points overall.

maximum error of 1.54, far below the overall metrics. Somalia,

however, had a much higher MSE of 95.7 and a maximum error of

18.7, likely due to less reliable and extreme poverty data, which

4

Main Results

skews the model’s performance on extreme cases.

The file configuration plays a critical role in the model’s per-

formance. The results show that C+M, C, and B+C are the best

5

Discussion

configurations. The C+M file includes all basic features, lagged

Firstly, the fact that ordinary least squares linear regression

values, changes in value, mathematical columns, and all possi-

couldn’t produce an accurate model confirms the fact that the

ble combinations of basic parameters, totaling 8,236 parameters.

parameters are indeed correlated. This is probably also the rea-

Configuration C contains all basic features, combinations, and

son why the ridge regression model performed the best: ridge

lagged and difference columns. Lastly, B+C includes only the

regression is used to address the issue of multicollinearity and

basic parameters and their combinations. All top-performing

the features included are mostly strongly correlated, as stated in

models were trained on these datasets.

the introduction. Furthermore, the correlation between parame-

The results in Table 1 show considerable variation. Models

ters is obviously drastically increased by generating all possible

trained with ordinary least squares regression performed poorly,

products of basic parameters.

with the best model reaching an RMSE just under 10.15 and an

Secondly, the impact of mathematical columns needs to be

2

𝑅

of 0.50. In contrast, lasso and elastic net regression achieved

considered. Of the first four models, two have mathematical

better results, with RMSEs around 7 and 2

𝑅

values close to 0.80.

columns and two don’t. Of the eight models generated, three

Ridge regression also struggled, except for configuration B+C,

of them perform worse if mathematical data is present, while 5

which provided the best results with an RMSE of 3.6 and an 2

𝑅

performed better with mathematical data included. This might

of 0.94. However, caution is advised when interpreting models

indicate some deeper connection, which would be interesting

using configuration C+M or C, due to the high number of features

to try and understand. Furthermore, lasso regression handles

relative to the dataset size, which could affect their real-world

mathematical columns much better compared to the other models

reliability.

used due to its ability to exclude features.

The model weights reveal that only products are present

The impact of product combinations of basic features stands

among the top ten most important factors. These products in-

out, with all better-performing models having the combinations

clude data on population, population density, agriculture, equal-

parameter set to True, suggesting deeper relationships between

ity, healthcare, and education. The largest weights show the

variables. Exploring these connections further, perhaps by train-

biggest differences, gradually decreasing in magnitude. The top

ing a neural network on the basic parameters and comparing

ten weights range from just over 10 to 7, with the highest weights

it to linear regression models, could be insightful. If the neural

involving combinations such as population and population den-

network performs better, further investigation into these correla-

sity, meadows and pastures with the global peace index, and

tions would be needed.

45





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Urbanč et al.

Structure

Linear MSE

Linear 2

2

2

2

𝑅

Lasso MSE

Lasso 𝑅

Ridge MSE

Ridge 𝑅

Elastic net MSE

Elastic net 𝑅

Shape of X

M

203

0.031

74

0.65

-

-

-

-

(7653, 2131)

None

-

-

109

0.48

163

0.22

108

0.49

(7653, 1221)

C+M

198

0.054

45

0.78

-

-

40

0.81

(7653, 8236)

C

-

-

50

0.76

-

-

45

0.79

(7653, 7326)

B

103

0.50

110

0.47

103

0.50

111

0.46

(7661, 111)

B+C

-

-

48

0.77

13.3

0.94

43

0.79

(7661, 6216)

Table 1: MSE and R-squared values for different regression models and dataset configurations. The presence of B, C or M

signals the presence of basic basic parameters only (B), combinations (C) and mathematically (M) derived columns in the dataset. A dash is used to label non-converging models with a negative R-squared value.

The dataset used spans from 1987 to 2023, which is relatively

through testing of numerous linear regression models using open

short, given that poverty often has deep historical roots. Al-

data, with the best model being created by using ridge linear

though data becomes scarcer in earlier years, those points could

regression trained on data which also included all possible com-

still be crucial for improving model accuracy. Moreover, most

binations of the basic features included in the dataset. The basic

hyper parameters in this paper were chosen sensibly due to time

parameters included consist of 111 different parameters describ-

and computational constraints. Different values for the number

ing countries across 36 years. Better models could possibly be

of lagged years, years of differences, hyperparameters in the

generated using more complex methods such as neural nets or

training of models and the minimum number of data points re-

random forest, gaining in accuracy but compromising the ex-

quired to interpolate missing data could all lead to interesting

plainability of the model. The models could also benefit from

discoveries and improvements of the generated models. Our re-

hyperparameter tuning during the whole process to improve

sult here shows it is possible to achieve this degree of accuracy,

results and find the optimal values. We will be addressing this in

but it doesn’t limit what the best model could be. The elastic net,

further research.

especially, should benefit from such a tuning.

As stated in [11], the recent literature mostly uses the random 7

Acknowledgements

forest model and, in fact, ordinary linear regression wasn’t even

This research was partially funded by the Future of Life Institute

in the top ten most common methods. An interesting thing to

under the project "An AI-driven Observatory Against Poverty",

explore would also be the performance of random forest using

and the European Commission’s projects under grant agreement

the best configuration, B+C. The models may struggle to capture

101135800 (RAIDO) and 101120237 (ELIAS).

correlations between variables due to differing impacts across

countries, as mentioned in the introduction. A potential solution

References

is to split the countries into 𝑘 groups and train separate models

[1]

Gianni Betti, Antonella D’Agostino, and Laura Neri. 2002. Panel regres-

for each group. While this could improve predictions, it raises

sion models for measuring multidimensional poverty dynamics. Statistical

methods and applications, 11, 359–369.

two challenges: how to split countries without bias and how to

[2]

David Brady. 2019. Theories of the causes of poverty. Annual Review of

ensure enough data for training.

Sociology, 45, 1, 155–175.

The weights in the model further emphasize the issue of mul-

[3]

Muse A.H. Hassan A.A. and Chesneau C. 2024. Machine learning study

using 2020 sdhs data to determine poverty determinants in somalia. IEEE

ticollinearity among the parameters, with only product terms

Transactions on Radiation and Plasma Medical Sciences, 14, 1, 5956.

emerging as the most influential. However, this does not reveal

[4]

Dariush Hayati and Ezatollah Karami. 2005. Typology of causes of poverty:

the true importance of individual parameters, as they may en-

the perception of iranian farmers. Journal of Economic psychology, 26, 6,

884–901.

hance the impact of another factor within the product term. Addi-

[5]

Neal Jean, Marshall Burke, Michael Xie, W Matthew Davis, David B Lobell,

tional research is needed to better determine the true significance

and Stefano Ermon. 2016. Combining satellite imagery and machine learning

to predict poverty. Science, 353, 6301, 790–794.

of these parameters and gain a clearer understanding of what

[6]

AH Ng, Abdul Ghani Farinda, Fock Kui Kan, Ai Ling Lim, and Teo Ming Ting.

drives poverty rates up or down. It can be seen in Figure 2, the 2013. Poverty: its causes and solutions. International Journal of Humanities

models weights occupy a wide range. It is clear that some features

and Social Sciences, 7, 8, 2471–2479.

[7]

Rense Nieuwenhuis, Teresa Munzi, Jörg Neugschwender, Heba Omar, and

are more important, based on their weights and further work is

Flaviana Palmisano. 2019. Gender equality and poverty are intrinsically

being done to understand which features stand out and why.

linked: A contribution to the continued monitoring of selected sustainable

The model also performed better in predicting poverty lev-

development goals. Tech. rep. LIS Working Paper Series.

[8]

Shah O. and Tallam K. 2023. Novel machine learning approach for predict-

els in high-income countries compared to low-income countries.

ing poverty using temperature and remote sensing data in ethiopia. IEEE

This discrepancy can likely be attributed to the fact that high-

Transactions on Radiation and Plasma Medical Sciences, 5, 6, 2302.14835.

[9]

F. Pedregosa et al. 2011. Scikit-learn: machine learning in Python. Journal

income countries report more data with greater accuracy, allow-

of Machine Learning Research, 12, 2825–2830.

ing the model to identify underlying patterns more effectively.

[10]

Mubaraq Dele Sulaimon. 2020. Multidimensional poverty and its determi-

In contrast, much of the data for low-income countries had to be

nants: empirical evidence from nigeria.

[11]

Aziza Usmanova, Ahmed Aziz, Dilshodjon Rakhmonov, and Walid Osamy.

interpolated, which reduced variability between countries and

2022. Utilities of artificial intelligence in poverty prediction: a review. Sus-

negatively impacted the model’s performance.

tainability, 14, 21, 14238.

[12]

Huang Zixi. 2021. Poverty prediction through machine learning. In 2021

2nd International Conference on E-Commerce and Internet Technology (ECIT).

IEEE, 314–324.

6

Conclusion

In this paper, we have shown that a general model exists, based

on linear regression methodologies, which can predict poverty

with a relatively high accuracy (RMSE of 3.6). This was achieved

46





Fact Manipulation in News: LLM-Driven Synthesis and

Evaluation of Fake News Annotation

Luka Golob

Abdul Sittar

lukag26@gmail.com

abdul.sittar@ijs.si

Jožef Stefan Institute and Jožef Stefan Postgraduate

Jožef Stefan Institute and Jožef Stefan Postgraduate

School

School

Jamova cesta 39

Jamova cesta 39

Ljubljana, Slovenia

Ljubljana, Slovenia

Abstract

(1) A methodology to create synthetic data for fake news

using LLMs.

Advancements in artificial intelligence and increased internet

(2) We then use this methodology, to adapt the FA-KES dataset

accessibility have made it simpler to create and disseminate fake

1

with 100 additional synthetic fake news

.

news with customized content. However, they also improved the

ability to analyze and identify such misinformation. To effectively

In Section 2, we discuss work that is closely related to our task.

train high-performance models, we require high-quality, up-to-

Section 3 then outlines the methodology for generating synthetic

date training datasets. This article delves into the potential for

fake news, culminating in Section 4, where we present the results

generating fake news through factual modifications of articles.

and introduce some modifications to the methodology. Finally, in

This is facilitated by prompt-based content generated by large

Chapter 5, challenges, capabilities, and potential improvements

language models (LLMs), which can identify and manipulate

are considered.

facts. We intend to outline our methodology, highlighting both

the capabilities and limitations of this approach. Additionally,

2

Related Work

this effort has resulted in new quality synthetic data that can be

A wide range of approaches to generate fake synthetic news with

incorporated into the standard FAK-ES dataset.

LLM has been developed. In [8] authors generated huge amounts

of fake news and categorized them into multiple categories. LLMs

Keywords

can generate fake news by altering the style to mimic credible

fake news, synthetic data, fact extraction, fact verification, large

sources or using sensationalism to influence perception. They

language models

can subtly manipulate content to be perceived as true, blend real

and fabricated information to exploit cognitive biases, or create

1

Introduction

convincing fictional narratives.

In general, when making a dataset we want a diverse distribu-

Synthetic data refers to artificially generated data that is not

tion of fake datasets. In our case, we will focus on one way of data

obtained by direct measurement or observation of real-world

change, which comes under the umbrella of Content Manipulation.

events. Instead, it is created using algorithms and simulations.

Similar news manipulations can be seen in [7] where the authors The primary purpose of synthetic data is to provide a realistic

use two main techniques. The first one extracts the summary

alternative to real data for various use cases, such as training

from the original text, which preserves the main content, which

machine learning models, testing systems, ensuring data privacy,

is then changed to produce a fake article. The second one asks a

and more.

question about the article and changes the content of its answer,

We will generate synthetic data from news articles. By making

to construct a new article. Our approach is in nature similar to

sure, that the information in the news is changed we can safely

the Question-Answer framework.

call it fake news. In our article, fake news will denote articles that

Many articles provide fake news detection models made using

are intentionally and verifiably false [4]. Synthetic data enhances synthetic data. Most popular are deep neural networks such as

model training by providing additional examples to supplement

BERT [1]. But there are other fact-based approaches for fake news scarce labeled datasets and allows for privacy-conscious testing

labeling as in [3]. In [2] they used GPT4-turbo for prompt-driven without real content manipulation. It enables adaptability to

fake news detection.

evolving fake news tactics by simulating diverse scenarios from

the newest data, thereby improving the robustness and resilience

3

Methodology

of detection algorithms [3].

Large language models (LLMs) made a huge difference in the

The methodology is divided into four conceptual steps: Data

world of news. Fake news is now much easier and cheaper to

collection, Characterization of facts, Fact extraction, and Fact

construct, but we also have additional methods to help us tackle

manipulation as presented in Table 1.

its spread. Numerous articles appeared trying to partake in this

effort. The following are the main scientific contributions of this

3.1

Data Collection

paper:

The publicly available FA-KES dataset [5], focused on the Syrian war, addresses the deficiency of manually labeled datasets in

Permission to make digital or hard copies of all or part of this work for personal this domain of news data. It comprises 804 articles sourced from

or classroom use is granted without fee provided that copies are not made or

distributed for profit or commercial advantage and that copies bear this notice and various media outlets. We used 426 articles that were manually

the full citation on the first page. Copyrights for third-party components of this labeled as authentic news, but we could just as well use the other

work must be honored. For all other uses, contact the owner /author(s).

(fake) articles.

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

© 2024 Copyright held by the owner/author(s).

1

https://doi.org/10.70314/is.2024.sikdd.13

https://github.com/golobluka/Fake- news- generation- from- FA- KES- dataset

47





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Luka et al.

Data collection

Characterization of facts

3.4

Fact Manipulation and Synthetic News

Generation

Should have textual and

1.Name of casualty

The objective is to modify relevant information without altering

statistical facts

2.Gender or age group

the writing style or topic of the article. For this transformation,

3.Cause of death

we used a chain of thought prompt, which for a given fact: 1)

4.Type

changes the fact to another with a different meaning, 2) generates

5.Actor

a new article based on the altered facts. By changing one fact at

6.Place of death

a time, quality is improved compared to altering multiple facts

7.Date of death

simultaneously, as one fact creates a clearer chain of instructions.

LLMs such as Llama3.1:8B often struggle with precise changes

Fact Extraction

Fact manipulation

in the article, such as modifying implicit references or incorpo-

rating new facts. Quality can be improved by carefully adjusting

Name of casualty: Civilians Name of casualty:

the prompt content.

Gender or age group: e.g., Manipulated fact

LLMs are also exceptional in summarization and paraphras-

child, adult, senior

Gender or age group:

ing. Both are used simultaneously with changing the facts. The

Cause of death: shooting,

Manipulated fact

shelling, weapons, etc.

Cause of death:

problem is that we aim to maintain the extracted facts when sum-

Type: military personnel

Manipulated fact

marizing. But this is not crucial, as it usually has better results

Actor: rebels, forces

Type: Manipulated fact

as article generation.

Place of death: Airbase

Actor: Manipulated fact

Date of death: April 7, 2017 Place of death: Manipulated

3.5

Fake News Annotation and Fact

fact

verification

Date of death: Manipulated

fact

After we have generated the fake articles, we can label that data

as “fake” or “non-fake”, based on comparison with extracted facts.

We performed this labeling with various models and compared

Figure 1: A methodology to generate synthetic data for

the performance of labeling,to get the best model. In this ex-

fake news detection

periment we decided for Llama3.1. To do the labeling, we are

performing fact verification [4]. The fact verification task in gen-3.2

Characterization of Facts

eral is making a decision as to whether a claim is correct, based

on the explicitly-available evidence, such as Wikipedia articles

While making the FA-KES dataset, its authors created seven fac-

or research papers. We have the extracted fact, which will be

tual categories:

compared to the article content. The question thus becomes: Do

these facts appear in the given article? This approach emphasizes

(1) Name

of

casualty

(4) Type,

factual content rather than the overall sentiment of the article.

or group,

(5) Actor,

There are two primary types of prompts: 1) Direct prompts

(2) Gender

or

age

(6) Place of death,

that present the article and a table of facts, asking if the facts

group,

(7) Date of death.

relate to the article, 2) Structured prompts that inquire about

(3) Cause of death,

the correspondence of one fact at a time with the article. The

question is: Does this fact correspond to the content of the arti-

It is crucial to note that all articles have a similar structure,

cle? This method combines individual results into an aggregated

describing war incidents. This allows us to establish a consistent

score. Say the Place of death is characterized as Idlib and

framework of facts, such as actor and casualty details. We stick

Daraa provinces. Then the question posed to LLM is of the form:

to those facts, but generate them differently, employing LLMs

Read the article and understand its places of death.

capabilities with faster and cheaper execution, albeit with a slight

Do Idlib and Daraa provinces “really correspond” to

reduction in reliability.

places of death in the article?

3.3

Fact Extraction

We are not as interested in labeling, as we are interested in

the quality of produced synthetic fake news. For this purpose,

We extract facts by constructing prompts for LLMs. First ap-

we will also use fact verification in a slightly different way. We

proach was a few-shot prompt, which gives some examples of

are asking the LLM: Were the factual changes in fake news really

output. Later we constructed an additional approach: Say we

made, as they were supposed to? A similar method is used in the

are extracting the fact Place of death with this second tech-

article [7].

nique. We give a detailed description of what should be extracted

and then LLM reads the article and performs the task solely on

4

Experimentation and Results

this basis. This description is usually longer and contains more

context. The issues with fact extraction in general are:

4.1

Experimental settings

• Some articles lack certain facts or merely imply them.

We selected 426 articles labeled as authentic news from FA-KES

LLMs can identify this, outputting responses such as “No

dataset. Then facts were extracted and transformed, as described

information.”

in the previous section. At first two basic approaches were used

• Longer articles may contain multiple events, each with dis-

to randomly choose 70 news articles and transform them. After-

tinct data such as dates or casualties. This can be managed

ward, we used the labeling procedure to compare performance,

by creating separate tables for each event or consolidating

resulting in the table 1. Based on the results we then composed all events into a single table with various facts.

the final algorithm, which would be manually evaluated.

48





Fact Manipulation in News: LLM-Driven Synthesis and Evaluation of Fake News Annotation Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

4.2

Evaluation

Table 1: Comparison of fake synthetic data.

For every experiment, we first manually checked a minimum 10

Type of data

Number of facts manipulated

Precision

Recall

F1

Accuracy

percent of random examples to get an overview of how well the

Summarization

2/7

0.74

0.63

0.68

0.71

LLM was able to do the job. It is quite useful to print text that rep-

Detailed facts

2/7

0.70

0.80

0.75

0.73

resents the procedure of decision-making that LLM undertakes,

when challenged with the task. It was even helpful to see LLMs

generated thinking procedure, as this gives valuable insight, into

this context, leaving it unchanged in most cases. Our fake news

what is going on “under the hood”. We believe that manual fact-

fails to preserve enough coherence to be trusted by a skeptical

checking is the first and most crucial step in generating good

reader, who tries to connect background material to the event in

prompts. Based on fallacies one can then adjust prompts content.

the article.

To shed some light on this procedure we have made the following

Generating false text, while maintaining coherency, is chal-

overview.

lenging for LLM. In this task, we have changed one fact: for

example, the Place of death may be changed to another city

4.3

Fact Extraction Results

or neighborhood. Then this fact must be changed in the article

while maintaining other factual information. Here are the main

Name of casualty or

Members of Nusra Front

issues:

group:

• In the beginning some facts did not get changed, or the

Gender or age group:

Adults (no specific age men-

facts were altogether just removed from the article. We

tioned)

managed to reduce this error by adjusting the prompt. It

Cause of death:

Explosion at a mosque

is difficult to adjust all occurrences of the fact, especially

Type:

if it is only implied and not explicitly stated. We managed

Non-civilian (militants)

to minimize this problem, by a method yet to be shown in

Actor:

Unknown

(no

group

section 4.5.

claimed responsibility, but

• What remains is the problem of a wider context, Suppose

supporters blamed ISIS)

we change the town of the incident, then we must change

Place of death:

Ariha, Idlib province, Syria

the name of the neighborhood accordingly. LLM usually

Date of death:

Not specified in the article

fails in this, leaving our article inconsistent, which is a

widespread problem.

• LLM does not want to output the content because of harm-

Figure 2: Example of fact extraction.

ful content or does not want to produce articles that could

be used with illegal intent. This was quite a common prob-

LLMs are capable of recognizing different topics and extracting

lem, which is also reasonable, based on the violent con-

words that correspond to this topic, and also noting if the fact is

tent of articles and the possible abuse of LLM-generated

not mentioned. At first, we extracted short words as represented

content. The best thing to prevent this error is to use un-

in Figure 2.

censored LLM. In other cases, one can adjust the prompts

The issue begins with nuances. For example, in many articles

by removing suspicious words like “fake news”.

the Actor is only suspected but not known. In some cases, ac-

• The Generated article was shorter, skipping the original

tor and causality are not precisely distinguished. This usually

text which was not linked to extracted facts. This problem

leaves LLM to some kind of arbitrariness. For this purpose, We

was reduced but still exists in long articles.

also added a longer description that better captures the nuanced

• If the fact is not present in the article, then it is hard

subtleties related to facts. This can also be captured in Table 1.

for LLM to incorporate a new fictitious fact into the text.

There we see the results for short (normal) or detailed extracted

Mainly it just adds the information in separate sentences.

facts. The recall is far worse in the case of short prompts. This

• When we change facts, traces of the old facts still persist.

likely means that there is an abundance of false negatives, which

This is especially common in complicated articles with

result from the fact, that labeling does not manage to match true

diverse structures.

articles and their corresponding short facts.

• Sometimes the change does not bring about any additional

The shorter extracted facts are often not comprehensive. For

meaning. For example, LLM might change previously un-

example, under the label Type (which classifies civilian or non-

known casualties and designate them as civilians. They

civilian) it writes only civilians, even though, contextual under-

were implied to be civilians all along, and this makes only

standing also includes some non-civilian casualties.

a minor change and is not really fake.

Overall the most important insight remains: fact extraction

has better quality than article generation.

4.5

Fact verification with LLMs

Remember that in this task, the prompt asks: Does this fact “really

4.4

Quality and coherence of synthetically

correspond” to the content of the article? Performance largely

generated fake news

depends on how the program takes the word “really correspond”.

The LLM can detect (for example) the Actor of some attack in the

Words have many nuances: different words can have different

news, and then it is mostly able to change every occurrence of

meanings, which can complicate labeling. To simplify: we can be

this Actor with another Actor. But if we would like to preserve

stricter, in the sense that words must be the same in the literal

all the coherence of the article much more would need to be done.

sense, or we can count on the similarity of meaning [6]. Based News usually contains background information, that provides

on our goal of creating fake news it is best to focus on meaning

context for the accident. Our algorithms failed to properly adjust

and not concrete words.

49





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Luka et al.

Here are some common problems:

5.1

Problems, Capabilities and Possible

•

Improvements

Sometimes the fact is changed, but LLM skeptically as-

sumes, that those two names refer to the same group.

• In this stage, LLMs like Lamma3.1:8B are not able to co-

• In longer articles, where there are many events, the names

herently change certain facts of news articles. Changing

get changed only in some events (usually at the beginning

facts can distort the article content, which appears to be

of the article). In this case, the LLM can make unwanted

extremely hard to manage. This normally does not happen

predictions, labeling the fact as true rather than false.

for manageable data as dates (changing the time of some

event), but for much more involved actors of the attack

Manual checking shows that labeling is more accurate than

in the article. Even so, the synthetic fake news provides

generation of fake news. This leads us to use labeling as a means

valuable information.

to improve article generation.

• We did not use the model, which has additional informa-

Table 1 was used to compare different ways to generate fake

tion about the news content. Providing additional context

news. It shows two of the best datasets, which contain true arti-

would likely have a beneficial effect on all the processes.

cles and their false twins, generated in two ways:

• In our case facts were largely dependent on each other. For

(1) Fake news generated by “standard” fact extraction and

example Gender or age group is an extraction of Name

with additional summarization.

of casualty or group. We think it is best if such depen-

(2) Fake news generated by “detailed” fact extraction and with

dencies are removed because they bring to inconsistencies

an additional paraphrasing of the article.

when changing facts. An additional solution would also

be to change Gander or age group whenever Name of

In this experiment, instead of merely categorizing the articles

casualty or group is changed.

as true or false, the results shown in Table 1 reflect how well the

• Fact extraction is close to human-like quality. The issue

generation process aligns with fact verification.

is, that besides manual checking, it is hard to find a good

Low precision in the row with Detailed facts led us to detect

measure of the quality of extracted facts.

articles that were not changed. We implemented a strategy where

• Detection of changed facts is in quality similar to extrac-

labeling was applied after generating the fake articles to assess

tion of facts (this is not surprising, since they are based

the quality of the generation. LLMs often provide incomplete

on the same skill). Because of the diversity of meanings in

responses and struggle to correct them directly. By introducing an

language, it is hard to specify the exact reasoning proce-

additional verification step, we were able to enhance the overall

dure of LLMs and many mistakes come from this kind of

accuracy of the results.

miscommunication.

4.6

Final Dataset Description

6

Acknowledgments

In the end, we constructed 100 fake-news based on a prior ex-

This work was supported by the European Union through AI4Gov

2

periment, which can be found on GitHub

. In every article we

(101094905) and TWON (101095095) EU HE projects and the

randomly chose three facts and changed them. Afterward, we

Slovenian National grant (CRP V2-2272).

carefully went through 10 examples, which are also present on

Git Hub, while here we present only the main points:

References

• Fact verification improved quality by making sure, that

[1]

Nicola Capuano, Giuseppe Fenza, Vincenzo Loia, and Francesco David Nota.

2023. Content-based fake news detection with machine and deep learning: a

the synthetic fake article really incorporated new infor-

systematic review. Neurocomputing, 530, 91–103. doi: https://doi.org/10.1016

mation. More than 90% new facts really got incorporated

/j.neucom.2023.02.005.

in the article. Sometimes new information is only added

[2]

Fredrik Jurgell and Theodor Borgman. 2024. Fake news detection : using a

large language model for accessible solutions. (2024).

as additional text(and does not seriously change the main

[3]

Ye Liu, Jiajun Zhu, Kai Zhang, Haoyu Tang, Yanghai Zhang, Xukai Liu, Qi Liu,

topic).

and Enhong Chen. 2024. Detect, investigate, judge and determine: a novel

•

llm-based framework for few-shot fake news detection. (2024). https://arxiv

Fact is not always incorporated in all places where it is

.org/abs/2407.08952

arXiv: 2407.08952 [cs.CL].

referenced, which leads to inconsistencies. The new article

[4]

Taichi Murayama. 2021. Dataset of fake news detection and fact verification:

is then a blend of old and new information.

a survey. (2021). https://arxiv.org/abs/2111.03299 arXiv: 2111.03299 [cs.LG].

•

[5]

Fatima K Abu Salem, Roaa Al Feel, Shady Elbassuoni, Mohamad Jaber, and

There are problems with ˙˙detailed” prompts. Containing

May Farah. 2019. Fa-kes: a fake news dataset around the syrian war. In

more information results in contradictions as we change

Proceedings of the international AAAI conference on web and social media.

only one fact at a time.

Vol. 13, 573–582.

[6]

Abdul Sittar, Dunja Mladenic, and Tomaž Erjavec. 2020. A dataset for in-

formation spreading over the news. In Proceedings of the 23th International

5

Conclusion

Multiconference Information Society SiKDD. Vol. 100, 5–8.

[7]

Yanshen Sun, Jianfeng He, Limeng Cui, Shuo Lei, and Chang-Tien Lu. 2024.

In this article, we focused on exploring the potential of LLMs in

Exploring the deceptive power of llm-generated fake news: a study of real-

world detection challenges. (2024). https://arxiv.org/abs/2403.18249

arXiv:

fact extraction and generation of fake news. Our motivation was

2403.18249 [cs.CL].

primarily to understand how accurate are LLMs in fact extraction

[8]

Lionel Z. Wang, Yiming Ma, Renfei Gao, Beichen Guo, Zhuoran Li, Han

and how reliably LLMs generate synthetic news by altering facts.

Zhu, Wenqi Fan, Zexin Lu, and Ka Chung Ng. 2024. Megafake: a theory-

driven dataset of fake news generated by large language models. (2024).

As a result of our experiment, we have generated 100 synthetic

https://arxiv.org/abs/2408.11871

arXiv: 2408.11871 [cs.CL].

news by randomly transforming there out of seven facts and

have performed a manual evaluation, to observe the quality of

the generated news dataset.

2 https://github.com/golobluka/Fake-news-generation-from-FA-KES-dataset

50





Borrowing Words: Transfer Learning for Reported Speech

Detection in Slovenian News Texts

Zoran Fijavž

Jožef Stefan Postgraduate International School

Peace Institute

Slovenia, Ljubljana

zoran.fijavz@mirovni- institut.si

Abstract

to explore speaker representation by gender [1], institutional affiliations [8], and topic stances [15], or to distinguish between This paper describes the development of a reported speech clas-journalists’ and sources’ voices [11].

sifier for Slovenian news texts using transfer learning. Due to a

lack of Slovenian training data, multilingual models were trained

2.2

Existing Datasets and Modelling

on English and German reported speech datasets, reaching an

F-score of 66.8 on a small manually annotated Slovenian news

Approaches

dataset and a manual error analysis was performed. While the

Datasets with reported speech annotations mostly cotain liter-

developed model captures many aspects of reported speech, fur-

ary or news texts. Key corpora include RiQuA [12], SLäNDa 2.0

ther refinement and annotated data would be needed to reliably

[19], Redewiedergabe [3], QUAC [14], PolNeAR [10], Quotebank predict less frequent instances, such as indirect speech and nom-

[21], and STOP [22]. RiQuA and Redewiedergabe are the largest inalizations.

th

annotated corpora, covering English and German 19

century

texts. QUAC contains 212 annotated articles from the Portuguese

Keywords

newspaper Público, while Quotebank spans 162 million news ar-

reported speech, natural language processing, transfer learning,

ticles with automatic annotations. PolNeAR, consisting of 1,028

news analysis

news articles, includes attribution annotations, which include

and exceed the definition of reported speech. A summary of the

1

Introduction

datasets is provided in Table 1.

The corpora differ in annotation complexity and size. They are

Reported speech, ubiquitous in literary and news texts, has clear

mostly monolingual, warranting the used cross-lingual transfer

lexical and syntactic patterns which may be reliably modeled

learning for low-resource languages by employing multilingual

via natural language processing (NLP) and may be useful for

models such as mBERT [6] and XLM-R [4]. Narrower multilingual downstream tasks by drawing a distinction between source and

models, such as CroSloEngual BERT, often outperform broader

background information. The paper applies transfer learning to

ones [20]. Reported speech modeling may be operationalized as

extend reported speech classification to Slovenian news texts and

speaker or quotation detection tasks [23, 17]. Simplifying the task provides a provisional classification model. A manual error anal-to sentence-level classification is warranted by the fact news (un-

ysis reveals the model’s strengths and weaknesses, highlighting

like literary texts) rarely mix statements by sources and authors

possible steps for further improvements.

in the same sentence and can improve classification reliability at

2

Related Work

the expense of detailed aspects of reported speech [17] and simplify the annotation structure. Missing fine-grained outputs, such

2.1

Role of Reported Speech

as speakers and boundaries of reported and reporting clauses,

Reported speech is common in news texts, generally expressed as

may thus be an acceptable trade-off for NLP-based content analy-

direct or indirect speech, with the former repeating the original

sis in news texts. A systematic review of such approaches points

utterance verbatim and the latter embedding it in a that-clause

to the limits resulting from a low number of features with no

[18] (e.g., Jimmy said: “Another systematic review would be great!”

guarantee of reliable ( joint) prediction, which preclude drawing

and Jimmy said that another systematic review would be great.).

rich conclusions expected from the method’s manual counterpart

More complex forms include mixed speech (City officials rebuffed

[2].

the accusations as "groundless and blatantly false".) and reportative

nominalizations with an analogous function as reported speech

3

Experimental Setting

(The speaker particularly emphasized the pressures on the media)

3.1

Task Overview

[7]. Around 50% of sentences in newspaper corpora may be at-

We treated reported speech as a sentence-level classification task.

tributed to a source in the text, predominantly through direct

Sentence splitters were applied to existing datasets, and binary

and indirect speech [17]. Verbs cue 96% of reported speech, fol-labels were assigned by matching annotated spans with the split

lowed by prepositional phrases (3%) [13]. Reported speech lends sentences. Reported speech sub-types were unified under a single

objectivity to statements [9], summarizes source statements [16],

label, joining the annotation schemes of individual datasets. A

and is used in discourse analysis and communication studies

Slovenian dataset of 10 news texts was manually annotated at

Permission to make digital or hard copies of all or part of this work for personal the sentence level. The datasets were split into training, evalu-or classroom use is granted without fee provided that copies are not made or

ation, and test sets to train multilingual pretrained models. For

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this CroSloEngual BERT, preprocessing also involved machine trans-work must be honored. For all other uses, contact the owner /author(s).

lating the German training data into English. The model outputs

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

were binary labels indicating reported speech, used to calculate F-

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2024.sikdd.21

scores on the test data. A manual error analysis was performed on

51





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Z. Fijavž

Table 1: Summary of Datasets’ Characteristics.

Corpus

Type

Annotations

Language

Sentence No.

Role

Positive Class

RiQua

fiction

direct

and

in-

English

38,610

72%

train,

18%

48%

direct

speech,

development, 10%

cues,

speakers,

test

addressees

Redewiedergabe

fiction,

direct,

indirect,

German

24,033

76%

train,

16%

33%

news

free indirect and

development, 9%

reported

speech,

test

speaker, cues

Quotebank (man-

news

speaker,

direct

English

9,071

test

30%

ual)

speech

QUAC

news

speaker,

direct

Portuguese

11,007

test

11%

speech

PolNeAR

news

speaker, cues, at-

English

34,153

test

59%

tributions

Slovenian

parlia-

news

sentence-level bi-

Slovenian

744

test

43%

mentary news

nary labels

Figure 1: Flowchart of Data Preprocessing, Model Training and Evaluation Processes for Sentence-Level Reported Speech Classification.

the best model’s outputs for Slovenian. Preprocessing, training,

on the media and the "illegal non-funding of the Press Agency.")

and evaluation steps are visualized in 1.

as well as implied quotes (e.g., There will be more than 300,000

recipients, he emphasized. 169 million euros will have to be paid

3.2

Training and Test Data

out.).

Our experiments were based on existing annotated reported

speech datasets and a small Slovenian dataset. The training data

3.3

Evaluation Procedure

included sections from RiQuA and Redewiedergabe, both large

datasets with labels for direct and indirect speech. For CroSlo-

The models’ performance on the test datasets was calculated with

Engual BERT training, the Redewiedergabe data was machine

an F-score. A baseline of assigning a positive label to all examples

translated into English. Testing was conducted on the test sec-

was calculated for all test datasets. The models’ results on the

tions of RiQuA, Redewiedergabe, the entire Portuguese corpus

test datasets were compared with a Friedman’s test as suggested

QUAC, and the manually annotated portion of the English Quote-

in the literature [5].

bank corpus. Additionally, we manually annotated 10 Slovenian

The best Slovenian model’s predictions were reviewed with

news articles from RTV Slovenia. The datasets are summarized

close reading. The error typology consisted of direct speech, in-

in Table 1.

direct speech, speech fragments, annotation errors, annotation

The Slovenian dataset comprised 10 parliamentary news texts,

errors and unrelated and other tags. Direct speech fragments were

covering various reporting strategies. Retrieved articles were

sentences part of multi-sentence direct speech quotations. Anno-

split into sentences and annotated. Sentences were considered

tation errors were examples with annotations inconsistent with

reported speech if they included direct or indirect speech cued by

the definition described in Section 3.2. For unrelated examples,

a reporting clause or prepositional phrase. We excluded nominal-

close reading revealed no clear misclassification cause. Other was

izations and phrasal quotes (e.g., They emphasized the pressures

used for examples that did not fit any of the mentioned categories.

52





Transfer Learning for Reported Speech Detection in Slovenian

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

3.4

Training Settings

XLM-R and mBERT were used as base models with the default

training settings from the transformers library with the excep-

tion of using 16 gradient accumulation steps and freezing the

bottom 8 layers of all models. The latter reduces the training

time without significant performance drops (Kovaleva idr., 2019;

Merchant idr., 2020). Additionally, a Slovenian-Croatian-English

BERT model was trained on English machine-translated data

from Redewiedergabe.

4

Results

4.1

Model Results

The model performance varies based on the congruence between

the language and precise task definitions in each dataset. The

Figure 3: False Negatives from the CroSloEngual BERT

differences between model predictions were not statistically sig-

Classifier.

2

nificant ( 𝜒

= 9.66; df = 5; n = 8; p = 0.14) so post-hoc tests were not

𝐹

performed. As Table 2 demonstrates, the XLM-R model trained

on both RiQuA and Redewiedergabe performed well across the

unmarked examples of direct or indirect speech (9.1%). The dis-

datasets with an F-score of 80.5 and 77.6 on the Redewiedergabe

tribution of categories identified in the sample of false positives

and RiQuA test set, respectively. The high results from train-

are illustrated in Figure 2. The most common errors in the 73

ing on combined data suggests the RiQuA and Redewiedergabe

false negative examples were instances of indirect speech (34.2%

datasets may benefit from additional or complementary data,

of false negatives) and prepositional queing of reported speech

at least when using cross-lingual transfer learning. The most

(27.4%). The remainder were instances of direct speech, direct

successful strategy for Slovenian data was training on RiQuA

speech fragments and annotation errors representing 11%, 8.2%

and English machine-translated Redewiedergabe data using the

and 9.6% of the false negatives, respectively. The annotation

CroSloEngual BERT model, reaching a F-score of 66.8. We did

errors included nominalizations and statements reported as ad-

not evaluate the impact of using translated training data with

jective complements (The speaker was happy that the provisions

mBERT and XLM-R.

were accepted) not included in our annotation schema. Figure 3

summarizes the identified false negative categories .

5

Discussion

This paper presents the development of a reported speech classi-

fier, tested through a small annotated Slovenian dataset and man-

ual error analysis. Cross-lingual transfer learning from the anno-

tated RiQuA and Redewiedergabe datasets achieved an F-score

of 66.8 on a small manually annotated dataset of Slovenian news

of parliamentary sessions using the base CroSloEngual model

with RiQuA and English machine-translated Redewiedergabe

1

training data

. This these results corroborate the observation

that language models trained on a limited number of languages

may outperform less specialized ones such as mBERT and XLM-R

[20]. The major source of errors were false positives (23.4% of all sentences) for which no systematic pattern was discernible in the

majority (72.9%) of examples. Instances of indirect speech and

Figure 2: False Positives from the CroSloEngual BERT Clas-

prepositional queing of statements were overrepresented in the

sifier.

false negatives, accounting for 61.6% of false negatives. Although

rare, nominalizations were present in both false positives and

false negatives and should be considered in future annotation

guidelines. These obeservations indicate reported speech clas-

4.2

Error Analysis Results

sifiers may benefit form approaches for addressing imbalanced

The results from CroSloEngual BERT on Slovenian data were

classes.

analyzed further. False positives were more common than false

negatives, representing 23.4% and 9.8% of all examples (n = 744),

6

Conclusion

respectively. Close reading of a sample of 100 false positives

This study developed a sentence-level reported speech classifier

did not show a definite pattern for most (72.9%) of them. These

for Slovenian news texts using cross-lingual transfer learning.

examples were clearly unrelated to reported speech, although

By leveraging existing multilingual models (mBERT, XLM-R, and

some did include words lexically related to reporting verbs (e.g.

CroSloEngual BERT) with the English and German datasets Ri-

The proposed law is still under discussion). The second category

QuA and Redewiedergabe, we demonstrated that sentence-level

of false positives were nominalizations of reported statements

(13.1%) not included in our annotation schema. The final source

1 The fine-tuned CSE model is available on the Hugging Face Hub under the name

of false positives were annotation errors consisting of wrongly

zo-fi/rep-sp-CSE-rwg-riq.

53





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Z. Fijavž

Table 2: Model Performances across Datasets (F-scores).

Redewiedergabe

RiQuA

PolNeAR

QUAC

Quotebank

Slovenian dataset

Positive by default

52.1

60.6

74.2

19.5

45.8

60.3

mBERT+Both

77.5

77.4

73.1

40.5

53.5

63.2

mBERT+RiQuA

68.2

76.9

72.6

31.1

52.6

39.1

mBERT+RWG

78.4

70.4

65.5

43.4

49.1

63.2

XLM-R+Both

80.5

77.6

70

38.8

57.7

63.2

XLM-R+RiQuA

66.6

76.7

73.6

25.5

53.7

60.3

XLM-R+RWG

80.9

70.7

66.4

43.9

50

63.2

CroSloEngBERT+Both+MT

54

76.6

73

24

52.5

66.8

classification can detect some aspects of reported speech in Slove-

[10]

Edward Newell, Drew Margolin, and Derek Ruths. 2018. An Attribution Re-

lations Corpus for Political News. In Proceedings of the Eleventh International

nian. However, the performance estimates are limited due to the

Conference on Language Resources and Evaluation (LREC 2018). LREC 2018.

small size of the Slovenian testing set and the limited definition

Nicoletta Calzolari et al., editors. European Language Resources Association

used for the annotations. Future research should focus on de-

(ELRA). Retrieved Apr. 10, 2024 from https://aclanthology.org/L18- 1524.

[11]

Mojca Pajnik and Marko Ribać. 2021. Medijski populizem in afektivno

veloping a Slovenian annotated dataset, refining the annotation

novinarstvo: časopisni komentar o »begunski krizi«. Javnost - The Public,

schema for multiple use cases, and exploring additional modeling

(Dec. 14, 2021). Retrieved Apr. 24, 2024 from https://www.tandf online.com

features such as encoding broader sentence contexts. This work

/doi/abs/10.1080/13183222.2021.2012943.

[12]

Sean Papay and Sebastian Padó. 2020. RiQuA: A Corpus of Rich Quotation

contributes a provisional tool for computational discourse analy-

Annotation for English Literary Text. In Proceedings of the Twelfth Language

sis of Slovenian media texts. Further development is necessary

Resources and Evaluation Conference. LREC 2020. Nicoletta Calzolari et al.,

editors. European Language Resources Association, 835–841. isbn: 979-10-

for its application in more nuanced tasks.

95546-34-4. Retrieved Apr. 21, 2024 from https://aclanthology.org/2020.lrec-

1.104.

Acknowledgements

[13]

Silvia Pareti, Tim O’Keefe, Ioannis Konstas, James R. Curran, and Irena Ko-

prinska. 2013. Automatically Detecting and Attributing Indirect Quotations.

This work was supported by the Slovenian Research Agency

In Proceedings of the 2013 Conference on Empirical Methods in Natural Lan-

grants via the core research programs Equality and Human Rights

guage Processing. EMNLP 2013. David Yarowsky, Timothy Baldwin, Anna

Korhonen, Karen Livescu, and Steven Bethard, editors. Association for Com-

in the Times of Global Governance (P5-0413) and Hate Speech

putational Linguistics, 989–999. Retrieved Apr. 17, 2024 from https://aclant

in Contemporary Conceptualizations of Nationalism, Racism,

hology.org/D13- 1101.

Gender and Migration ( J5-3102).

[14]

Marta Ercília Mota Pereira Quintão. 2014. Quotation A ttribution for Por-

tuguese News Corpora. In Retrieved Apr. 21, 2024 from https://www.seman

ticscholar.org/paper/Quotation- A- ttribution- f or- Portuguese- News- Corp

References

ora- Quint%C3%A3o/69f ea7d030d5e71b973ec67aa897a7c9aadadac2.

[15]

Masaki Shibata. 2023. Dialogic Positioning on Pro-Whaling Stance: A Case

[1]

Fatemeh Torabi Asr, Mohammad Mazraeh, Alexandre Lopes, Vasundhara

Study of Reported Speech in Japanese Whaling News. Japanese Studies, 43,

Gautam, Junette Gonzales, Prashanth Rao, and Maite Taboada. 2021. The

1, (Jan. 2, 2023), 71–90. doi: 10.1080/10371397.2023.2191839.

Gender Gap Tracker: Using Natural Language Processing to measure gender

[16]

Michael Short. 1988. Speech presentation, the novel and the press. In The

bias in media. PLOS ONE, 16, 1, (Jan. 29, 2021), e0245533. doi: 10.1371/journ

Taming of the Text. Willie Van Peer, editor. Routledge. isbn: 978-1-315-54452-

al.pone.0245533.

6.

[2]

Christian Baden, Christian Pipal, Martijn Schoonvelde, and Mariken A. C. G

[17]

Alexander Spangher, Nanyun Peng, Jonathan May, and Emilio Ferrara.

van der Velden. 2022. Three Gaps in Computational Text Analysis Meth-

2023. Identifying Informational Sources in News Articles. Version 1. doi:

ods for Social Sciences: A Research Agenda. Communication Methods and

Measures

10.48550/ARXIV.2305.14904.

, 16, 1, (Jan. 2, 2022), 1–18. doi: 10.1080/19312458.2021.2015574.

[18]

Stef Spronck and Daniela Casartelli. 2021. In a manner of speaking: how

[3]

Annelen Brunner, Stefan Engelberg, Fotis Jannidis, Ngoc Duyen Tanja Tu,

reported speech may have shaped grammar. Frontiers in Communication, 6,

and Lukas Weimer. 2020. Corpus REDEWIEDERGABE. In Proceedings of the

Twelfth Language Resources and Evaluation Conference

624486.

. LREC 2020. Nicoletta

[19]

Sara Stymne and Carin Östman. 2022. SLäNDa version 2.0: Improved and

Calzolari et al., editors. European Language Resources Association, 803–812.

Extended Annotation of Narrative and Dialogue in Swedish Literature. In

isbn: 979-10-95546-34-4. https://aclanthology.org/2020.lrec- 1.100.

Proceedings of the Thirteenth Language Resources and Evaluation Conference.

[4]

Alexis Conneau et al. 2020. Unsupervised Cross-lingual Representation

LREC 2022. Nicoletta Calzolari et al., editors. European Language Resources

Learning at Scale. In Proceedings of the 58th Annual Meeting of the Associ-

ation for Computational Linguistics

Association, 5324–5333. Retrieved Apr. 21, 2024 from https://aclanthology.o

. ACL 2020. Dan Jurafsky, Joyce Chai,

rg/2022.lrec- 1.570.

Natalie Schluter, and Joel Tetreault, editors. Association for Computational

[20]

Matej Ulčar and Marko Robnik-Šikonja. 2020. FinEst BERT and CroSlo-

Linguistics, 8440–8451. doi: 10.18653/v1/2020.acl- main.747.

Engual BERT. In Text, Speech, and Dialogue (Lecture Notes in Computer

[5]

Janez Demšar. 2006. Statistical Comparisons of Classifiers over Multiple

Science). Petr Sojka, Ivan Kopeček, Karel Pala, and Aleš Horák, editors.

Data Sets. The Journal of Machine Learning Research, 7, (Dec. 1, 2006), 1–30.

Springer International Publishing, Cham, 104–111. isbn: 978-3-030-58323-1.

[6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019.

doi: 10.1007/978- 3- 030- 58323- 1_11.

BERT: Pre-training of Deep Bidirectional Transformers for Language Under-

[21]

Timoté Vaucher, Andreas Spitz, Michele Catasta, and Robert West. 2021.

standing. In Proceedings of the 2019 Conference of the North American Chapter

of the Association for Computational Linguistics: Human Language Technolo-

Quotebank: A Corpus of Quotations from a Decade of News. In Proceedings

gies, Volume 1 (Long and Short Papers)

of the 14th ACM International Conference on Web Search and Data Mining.

. NAACL-HLT 2019. Jill Burstein,

WSDM ’21: The Fourteenth ACM International Conference on Web Search

Christy Doran, and Thamar Solorio, editors. Association for Computational

and Data Mining. ACM, 328–336. isbn: 978-1-4503-8297-7. doi: 10.1145/343

Linguistics, 4171–4186. doi: 10.18653/v1/N19- 1423.

7963.3441760.

[7]

Gabriel Dvoskin. 2020. Reported speech and ideological positions: the so-

[22]

M. Wynne. 1996. Speech, Thought and Writing Presentation Corpus. Re-

cial distribution of knowledge and power in media discourse. Bakhtiniana:

Revista de Estudos do Discurso

trieved Apr. 21, 2024 from https://ora.ox.ac.uk/objects/uuid:6caa73c1- d283-

, 15, 193–213.

4d51- a78f - 55df 69bae986.

[8]

Zoran Fijavž and Darja Fišer. 2021. Citatnost in reprezentacija v spletnem

[23]

Dian Yu, Ben Zhou, and Dong Yu. 2022. End-to-End Chinese Speaker Identi-

migracijskem diskurzu. In Sociolingvistično iskrenje. Maja Bitenc, Marko

fication. In Proceedings of the 2022 Conference of the North American Chapter

Stabej, and Žejn Andrejka, editors. Založba Univerze v Ljubljani. Retrieved

of the Association for Computational Linguistics: Human Language Technolo-

Apr. 3, 2024 from https://ebooks.uni- lj.si/ZalozbaUL/catalog/view/259/370

gies. NAACL-HLT 2022. Marine Carpuat, Marie-Catherine de Marneffe, and

/6011.

Ivan Vladimir Meza Ruiz, editors. Association for Computational Linguistics,

[9]

Elizabeth Holt. 1996. Reporting on Talk: The Use of Direct Reported Speech

2274–2285. doi: 10.18653/v1/2022.naacl- main.165.

in Conversation. Research on Language and Social Interaction, 29, 3, (July 1,

1996), 219–245. doi: 10.1207/s15327973rlsi2903_2.

54





What kind of ESG is profitable? Connecting company

performance to ESG terms in financial reports

Luka Andrenšek

Katarina Sitar Šuštar

trovato@corporation.com

University of Ljubljana

Jožef Stefan Institute

Ljubljana, Slovenia

Ljubljana, Slovenia

katarina.sitar@ef.uni- lj.si

Senja Pollak

Matthew Purver

Jožef Stefan Institute

Jožef Stefan Institute

Ljubljana, Slovenia

Ljubljana, Slovenia

senja.pollak@ijs.si

matthew.purver@ijs.si

ABSTRACT

in the last few years on applying computational machine learning

and statistical methods to ESG analysis (see e.g. the recent review

In this paper, we examine the relationship between the discussion

by Lim [9]).

of Environmental, Social and Governance (ESG) in companies’

However, much of this analysis examines numerical company

annual financial reports and their financial performance. Specifi-

performance data and categorical metadata; our interest is in

cally, we analyse the companies’ use of specific ESG terms along-

developing and applying natural language processing (NLP) tech-

side the performance metric, sector-normalized Return on Assets

nologies to not only help automate analyses, but allow under-

(ROA). Our motivation is to determine whether companies fre-

standing of how human actors discuss and understand the im-

quently mentioning terms such as “gender”, “equality”, “talent”,

portant and meaning of ESG aspects.

and “innovation” in their reports demonstrate a higher annual

Application of NLP in finance is not new: for example, topic

ROA compared to those that rarely used these terms. To explore

modelling has been used to predict company performance and

this, we used existing datasets with reports and performance met-

investigate strategies [14, 7]. Recent work also includes applica-rics from 348 companies, covering the years from 2009 to 2021. In

tion to ESG aspects: Nugent et al. [12] automatically extract news order to better examine differences, we then selected companies

about ESG controversies, and Lee et al. [8] analyse sentiment

whose ROA significantly differed from the average (either higher

on ESG issues. Closer to our interests, Purver et al. [13] investi-or lower), allowing for a more pronounced examination of the

gated how the use of ESG terms by companies has changed over

impact of ESG term usage on financial performance. The filtered

time. By analysing and annotating a set of existing resources,

dataset consisted of 107 companies, with a total of 427 reports;

they defined a set of 93 ESG terms categorised into 5 core ESG

split into two sections representing higher and lower performing

areas. They then showed how these terms can be used to anal-

companies. We then used an existing list of ESG terms derived

yse changes in reporting, by analysing a collection of company

from a range of separate data sources, and applied a basic sta-

annual reports, collated over a period of 8 years, using language

tistical n-gram language model to extract the probabilities of

modelling and distributional methods to reveal changes in the

each ESG term’s occurrence in each of the higher- and lower-

frequency and in the usage of the ESG terms.

performing dataset sections. Results show that while certain sets

Here, we are interested not in changes in ESG discussion over

of ESG concepts correlate with higher financial performance,

time, but in whether and how the reporting of ESG aspects is

others do the opposite, and give some initial interpretation into

connected to financial performance. We take Purver et al. [13]’s

the light this sheds on company reporting behaviour.

resources and methods as a starting point, but augment the fi-

KEYWORDS

nancial report text data with available metadata on financial

performance, allowing us to compare how ESG reporting varies

financial report analysis, language modelling, environmental,

between more and less well-performing companies.

social and governance reporting

1

INTRODUCTION & RELATED WORK

2

DATA AND METHODS

There is increasing interest in the behaviour of companies in

2.1

Hypotheses

the area of Environmental, Social and Governance (ESG) criteria,

In general, we expect increased probability of appearance of ESG

including a company’s environmental impact (Environmental),

terms in the annual reports from the more profitable firms, based

relationships with the community including employees, suppliers

on a number of factors. In general, overall high ESG performing

and customers (Social), and leadership structures including exec-

companies exhibit high financial performance [1, 5]; although we utive pay and shareholder rights (Governance). Although until

note that the link between high ESG score performance and men-

recently, ESG analyses were almost entirely performed manually

tion of ESG terms is not guaranteed to be straightforward. More

by experts (see e.g. [10]), there has been a large amount of work specifically, during the period between 2010-2020 analysed here,

Permission to make digital or hard copies of part or all of this work for personal there was a growing emphasis on corporate social responsibility

or classroom use is granted without fee provided that copies are not made or

(CSR) and sustainability. Investors, consumers, and other stake-

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this holders increasingly prioritised companies that demonstrated a

work must be honored. For all other uses, contact the owner /author(s).

commitment to innovation, diversity, and environmental sustain-

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

ability [11, 2]. Busru and Shanmugasundaram [3] find that firms

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2024.sikdd.3

closely engaging in fostering innovation, attracting top talent,

55





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Andrenšek et al.

Year

# Reports

# Words

’positive’ group included reports with an ROA of at least 0.2,

2012

178

12.5M

reflecting very good yearly performance.

2013

181

14.0M

Subsequently, we employed a statistical n-gram language model

2014

184

15.0M

3

(using NLTK ) to analyze the occurrence of each ESG term. For

2015

196

16.3M

each term, we calculated the probability of its occurrence in pos-

2016

198

17.5M

itive reports (𝑝+) and in negative reports (𝑝 − ), and the difference

2017

200

18.4M

(𝑝+ − 𝑝 − ). Terms with a large difference in these probabilities are

2018

200

19.6M

more strongly associated with positive reports than with nega-

2019

202

21.2M

tive ones, and vice versa: terms with a large negative difference

total

1539

134.6M

are common in negative reports but rare in positive ones. We

Table 1: Number of annual reports available by year

conducted this analysis for both unigrams and bigrams.

3

RESULTS AND DISCUSSION

The results for 1- and 2-grams are shown in Figures 1 and 2 below promoting gender and diversity initiatives, could confer a com-

4

(3- and 4-grams showed no clear interpretable associations).

As

petitive advantage over the industry peers. Furthermore, some

hypothesized, many ESG terms show a strong association with

policy and regulatory changes (e.g. the 2018 UK Corporate Gov-

positive performance, with many of these being core terms as-

ernance Code, the 2014 EU Directive on Non-Financial Reporting,

sociated with human resources (innovation, talent), with social

Carbon Disclosure Project (CDP)) directly or indirectly encour-

aspects (gender, diversity), environmental aspects (renewable, car-

aged companies to address issues related to diversity, gender

bon footprint, environmental impact) and overall ESG descriptors

equality, and environmental sustainability.

(ethical). However, many terms are conversely (and contrary to

our general hypothesis) associated with negative performance,

2.2

Data and pre-processing

including, again, terms across various ESG categories including

To test this hypothesis, we build on the resources and methods

environmental (carbon emissions, energy efficiency, greehouse),

of Purver et al. [13], who provide a dataset of annual reports human resources (mental health, wellbeing) and general ESG

from FTSE350 companies over the years 2012-2019, based on the

descriptors (governance).

FTSE350 list as of 25th April 2020 and obtained from the publicly

However, by combining these terms with recent work in clus-

accessible collection at www.annualreports.com. The reports are tering and describing ESG terms [4], we can shed more light

already converted to plain text, and we use their publicly avail-

on which categories seem to be more positive and which more

able tools to tokenize the collection into words and build ngrams

negative. Ferjancic et al. [4], using the same dataset and ESG

of length 1-4 padded with sentence start and end symbols; the

term list [13], perform a further topic analysis using BERTopic dataset size is reported in Table 1 below (taken from [13]). We

[6], in which they derive 30 ESG-related topics and 6 higher-level use their set of ESG terms, defined via a process of extracting

clusters of ESG concepts; they then examine the correlations

candidate terms from a set of public ESG definitions and tax-

between these ESG topics and company ESG scores as obtained

onomies, asking financial expert annotators to label them as to

from external analysts. We align our ESG terms with Ferjancic

their representativeness as ESG terms and their ESG subcategory,

et al. [4]’s 30 topics by matching against the words most asso-and keeping the terms with high inter-annotator agreement (see

ciated with each topic (if a term appears in the top 10 words

[13] for details).

associated with a topic, we take the term and topic as aligned);

we can then compare our positive/negative associations with Fer-

2.3

Financial performance analysis

jancic et al. [4]’s correlations with company ESG scores. Table 2

The reports were then linked to financial indicators for the re-

shows this alignment for our most positive and negative bigram

spective year and company. The data on company fundamentals

terms here, with the topic labels and an indication of the strength

1

was obtained from the Refinitiv EIKON Datastream.

Each entry

and direction of correlation with overall company ESG scores, as

contained annual financial indicators, as well as the companies’

given by [4].

industry and sector codes. The main variable of interest was

Given this, we see some systematic groupings. Climate change,

2

normalized, averaged return on assets (ROA) as defined below:

as part of the ‘climate risk and policy’ topic, as well as supply

chain and human trafficking as part of the ‘human rights’ topic,

𝑁 𝑒𝑡 𝐼 𝑛𝑐𝑜𝑚𝑒 − 𝐵𝑜𝑡 𝑡𝑜𝑚𝐿𝑖𝑛𝑒

represent the themes that appear to be, across different industries,

+( (𝐼𝑛𝑡𝑒𝑟𝑒𝑠𝑡𝐸𝑥𝑝𝑒𝑛𝑠𝑒𝑂𝑛𝐷𝑒𝑏𝑡 − 𝐼𝑛𝑡𝑒𝑟𝑒𝑠𝑡𝐶𝑎𝑝𝑖𝑡𝑎𝑙𝑖𝑧𝑒𝑑) related to high company ESG scores. A similar observation holds

×(1 − 𝑇 𝑎𝑥𝑅𝑎𝑡𝑒))

for gender balance, gender pay and environmental impact, which

′

′

𝐴𝑣 𝑒𝑟 𝑎𝑔𝑒𝑂 𝑓 𝐿𝑎𝑠𝑡 𝑌 𝑒𝑎𝑟 𝑠𝐴𝑛𝑑𝐶𝑢𝑟 𝑟 𝑒𝑛𝑡 𝑌 𝑒𝑎𝑟 𝑠𝑇 𝑜𝑡 𝑎𝑙 𝐴𝑠𝑠𝑒𝑡 𝑠

all fall in a group of topics which are strongly and significantly

After extracting financial reports with available ROA data, we

correlated with high ESG scores throughout different industries.

categorized the financial reports into two groups, in order to

Overall high ESG performing companies exhibit high financial

examine differences in the associated reports’ use of ESG terms.

performance [1, 5], therefore our results for terms such as climate The distribution of ROA shows a heavy concentration around

change, supply chain and human trafficking are not surprising: as

the mean, so in order to derive two distinctive groups we took

indicators of topics associated with high ESG, they are good terms

the two extremes and excluded the central group around the

for tracking these ESG aspects associated with high financial

mean. The ‘negative’ group comprised reports with a yearly ROA

performance.

less than -0.2, indicating very poor performance. Conversely, the

3 https://www.nltk.org/

1

4

https://www.refinitiv.com

Note that these figures show differences in absolute probabilities: magnitudes are 2We use this normalization and averaging to smooth and remove one-off effects.

comparable within 1-grams, and within 2-grams, but not between 1- and 2-grams.

56





What kind of ESG is profitable?

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Figure 1: Difference in probability between positive and negative reports 𝑝+ − 𝑝− for the most positive and negative unigram ESG terms.

Figure 2: Difference in probability between positive and negative reports 𝑝+ − 𝑝− for the most positive and negative bigram ESG terms.

Looking at the terms with low values which are associated

efficiency. It seems that better performing companies use carbon

with low RoA, waste management and corporate responsibility are

footprint instead of carbon emissions, and discuss more on the

associated with topics, for which in some industries proportion

use of renewable energy than on energy use, energy efficiency

of these correlate with ESG scores significantly positively and in

and/or fossil fuels. In future work, we plan to analyse the use of

other industries this correlation is significantly negative. Based

these terms in more depth, including analysis of the lexical and

on overall correlation between ESG scores and topic proportions

topical contexts in which they appear, and adding techniques

across different industries, these two topics are among the third

such as sentiment and topic analysis to shed more light on these

of the topics for which negative correlation between the topic

distinctions.

proportion and ESG score prevails. Due to the aforementioned

correlation between ESG and financial performance it is therefore

ACKNOWLEDGEMENTS

understandable that these terms are associated with mention in

The authors thank the reviewers for helpful suggestions, and ac-

annual reports of companies with low RoA. Overly extensive

knowledge financial support from the Slovenian Research Agency

discussion on specific topics (such as ‘waste management’ and

for research core funding (No. P2-0103), as well as for funding of

‘corporate responsibility’) can negatively impact ESG score (see

the research project Quantitative and qualitative analysis of the

[4]) which can by analogy of ESG and financial performance [1,

unregulated corporate financial reporting (No. J5-2554).

5] hold for companies with low RoA.

There is a surprising number of bigrams in both the high RoA

REFERENCES

and low RoA groups which seem to be associated with the same

[1]

Nisar Ahmad, Asma Mobarek, and Naheed Nawazesh Roni. 2021. Revisiting

topic, namely ‘climate footprint and energy management’. For

the impact of ESG on financial performance of FTSE350 UK firms: static and

dynamic panel data analysis. Accounting, Corporate Governance & Business

companies with high RoA, these terms are carbon footprint and

Ethics. doi: 10.1080/23311975.2021.1900500.

renewable energy, and for companies with low RoA, the terms are

fossil fuels, carbon emissions, energy use, air quality and energy

57

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Andrenšek et al.

2 grams

Term/ROA

Topic

Topic/ESG

score

correlation

correlation

Supply chain

+

Human rights

++

Business model

+

Customer services, People and culture

+; -

Gender balance

+

Diversity and inclusion

++

Environmental impact

+

General ESG

+

Carbon footprint

+

Climate footprint and energy management

=

Gender pay

+

Diversity and inclusion

++

Climate change

+

Climate risk and policy

++

Human trafficking

+

None directly related, in broader context in Hu-

++

man rights

Working environment

+

People and culture

-

Renewable energy

+

Climate footprint and energy management

=

Waste management

-

Waste management

–

Fossil fuels

-

No explicit match; contextually appears in Cli-

=

mate footprint and energy management

Corporate responsibility

-

Corporate governance

–

Carbon emissions

-

Climate footprint and energy management

=

Mental health

-

Health and safety

+

Energy use

-

Climate footprint and energy management

-

Air quality

-

No explicit match; contextually appears in Cli-

-

mate footprint and energy management

Energy efficiency

-

Climate footprint and energy management

-

Product safety

-

Health and safety

=

Table 2: Selected ESG terms with their ROA correlation direction (+/−), topic according to [4], and topic/ESG score correlation strength (+ + /+/= /−/−−) as calculated by [4].

[2]

A. C. Amason and H. J. Sapienza. 2012. The effects of top management team

rep. Available from https://iri.hks.harvard.edu/links/transparency- p

size and interaction norms on cognitive and affective conflict. Journal of

erf ormance- industry- based- sustainability- reporting- key- issues.

Management, 23, 495–516.

Hauser Center for Nonprofit Organizations at Harvard University.

[3]

S. A. Busru and G. Shanmugasundaram. 2017. Effects of innovation invest-

[11]

M. Marzook and B. Al Ahmady. 2022. Linking organisational performance

ment on profitability and moderating role of corporate governance: empiri-

and corporate social responsibility. European Jnl. of Business and Manage-

cal study of indian listed firms. Indian Journal of Corporate Governance, 10,

ment Research, 7, 335–343, 3. https://doi.org/10.24018/ejbmr.2022.7.3.1466.

97–117, 2. https://doi.org/10.1177/0974686217730938.

[12]

T. Nugent, N. Stelea, and J. L. Leidner. 2020. Detecting ESG topics using

[4]

Ursa Ferjancic et al. forthcoming. Textual analysis of corporate sustainability

domain-specific language models and data augmentation approaches. (2020).

reporting and corporate ESG scores. under review. (Forthcoming).

http://arxiv.org/abs/2010.08319.

[5]

Gunnar Friede, Timo Busch, and Alexander Bassen. 2015. ESG and financial

[13]

Matthew Purver, Matej Martinc, Riste Ichev, Igor Lončarski, Katarina Sitar

performance: aggregated evidence from more than 2000 empirical studies.

Šuštar, Aljoša Valentinčič, and Senja Pollak. 2022. Tracking changes in ESG

Journal of Sustainable Finance & Investment, 5, 4, 210–233. doi: 10.1080/204

representation: initial investigations in UK annual reports. In Proceedings of

30795.2015.1118917.

the First Computing Social Responsibility Workshop within the 13th Language

[6]

Maarten Grootendorst. 2022. BERTopic: neural topic modeling with a class-

Resources and Evaluation Conference. Mingyu Wan and Chu-Ren Huang,

based TF-IDF procedure. (2022). https://arxiv.org/abs/2203.05794

arXiv:

editors. Marseille, France, (June 2022), 9–14. https://aclanthology.org/2022.c

2203.05794 [cs.CL].

srnlp- 1.2.

[7]

M. Jagannathan, D. Roy, and V. S. K. Delhi. 2022. Application of NLP-based

[14]

W. Xu and K. Eguchi. 2021. Topic embedding regression model and its appli-

topic modeling to analyse unstructured text data in annual reports of con-

cation to financial texts. In Proceedings of the Third Workshop on Financial

struction contracting companies. CSI Transactions on ICT, 10, 2, 97–106.

Technology and Natural Language Processing, 15–21.

[8]

H. Lee, S. H. Lee, K. R. Lee, and J. H. Kim. 2023. Esg discourse analysis

through bertopic: comparing news articles and academic papers. Computers,

Materials & Continua, 75, 3, 6023–6037.

[9]

Tristan Lim. 2024. Environmental, social, and governance (esg) and artificial

intelligence in finance: state-of-the-art and research takeaways. Artificial

Intelligence Review, 57, 76. doi: 10.1007/s10462-024-10708-3.

[10]

Steve Lydenberg, Jean Rogers, and David Wood. 2010. From Transparency to

Performance: Industry-Based Sustainability Reporting on Key Issues. Tech.

58





Classification of Patents Into Knowledge Fields: Using a

Proposed Knowledge Mapping Taxonomy (KnowMap)

Elham Motamedi

Inna Novalija

Luis Rei

elham.motamedi@upr.si

inna.koval@ijs.si

luis.rei@ijs.si

University of Primorska

Jožef Stefan Institute

Jožef Stefan Institute

Koper, Slovenia

Ljubljana, Slovenia

Ljubljana, Slovenia

Abstract

Table 1: Example of a sequence of codes across different

levels of the CPC hierarchy

Various platforms, including patent systems and repositories like

GitHub and arXiv, support knowledge dissemination across do-

mains. As knowledge increasingly spans multiple disciplines,

CPC

Code

Title

there is a need to track innovations that intersect various fields.

Section

H

Electricity

Despite available data, a comprehensive knowledge taxonomy for

Class

H03

Electronic circuitry

effectively tracking innovations across domains is lacking. Devel-

Subclass

H03C

Modulation

oping such a taxonomy and employing automated classification

Group

H03C3/00

Angle modulation

methods will enhance the ability to track shared knowledge.

Subgroup

H03C3/005

Circuits for asymmetric modulation

In this work, we first developed a knowledge taxonomy based

on the CPC schema. We formulated the classification of textual

data into defined knowledge fields as a multi-label problem. Then,

study, we created a knowledge field taxonomy by merging CPC’s

we evaluated the effectiveness of the classification models by

detailed classes into a more abstract representation. This taxon-

fine-tuning pre-trained transformer language models. The multi-

omy not only serves as a framework for knowledge representa-

label framework enables the tracking of knowledge trends at the

tion but also offers a benchmark for patent classification systems.

intersection of various disciplines.

While some studies address the issue of numerous class labels by

Keywords

excluding less-represented classes or truncating hierarchies [24],

a consistent benchmark taxonomy has been lacking. Since our

Knowledge Taxonomy, Knowledge Tracking, Patent Classifica-

proposed knowledge taxonomy aligns with the CPC schema, it

tion, Hierarchical Classification, Multi-label Classification

is able to provide a benchmark for future studies, facilitating the

comparison of different models.

1

Introduction

In summary, our paper’s contribution is the proposal of a

According to the World Intellectual Property Organisation (WIPO),

knowledge field taxonomy, KnowMap, which aligns with the

a patent is an exclusive right granted for an invention, providing

widely used CPC schema. The KnowMap merged several class

legal protection to the inventor while simultaneously benefiting

labels within the CPC schema based on the scope of the knowl-

1

society by making the invention publicly accessible

. Each year,

edge field and the number of patents associated with each class.

2

patent offices receive numerous patent applications that need to

The KnowMap taxonomy is available online

. In this study, we

be processed [13].To ensure the novelty of patent applications, in-also performed a classification task to categorise patents into the

ventors should also be able to search existing patents. Organising

fine-grained classes defined by our proposed taxonomy.

patents with unique codes in a hierarchical structure aids efficient

retrieval and aligns with natural human navigation, starting from

2

Related Work

broad categories and narrowing down to specifics[21]. Among

Patent documents contain various types of information, including

these hierarchical structures, the CPC system is widely recog-

text, diagrams, plots, and references to other patents or scientific

nised [6]. The CPC codes are organised as a taxonomy, meaning

publications [20]. The textual content of a patent is divided into that each entity in the lower level is the detail group of the parent.

several sections, such as the title, abstract, claim, and description

A patent can be assigned to one or more labels by the experts

[11]. The title and abstract are shorter than the description but in patent offices [8, 18]. In the first level of the CPC hierarchy, still provide relevant information for classification. Li et al. [15]

there are nine sections, which are divided into classes, subclasses,

evaluated various lengths of the abstract and title, finding that

groups, and subgroups. Each level of this hierarchy can have

using the first 100 words of title and abstract resulted in the best

several codes ending in approximately 250,000 classification la-

classification performance in their study.

bels [11]. An example of the hierarchical structure of CPC code Various classification systems exist for organising patents [6].

is provided in Tab. 1.

In this work, we focus on the CPC schema. The hierarchical repre-

The CPC schema’s top level has only nine sections, but the

sentations help organise patents and facilitate efficient searching.

number of groups increases substantially at lower levels. In this

Kamateri et al. [11] discussed several potential challenges that

1

artificial intelligence technologies face in patent classification.

https://www.wipo.int/portal/en/

One such challenge is the extensive number of class labels. As an

Permission to make digital or hard copies of all or part of this work for personal example, the IPC contains approximately 86,000 classes, while

or classroom use is granted without fee provided that copies are not made or

distributed for profit or commercial advantage and that copies bear this notice and the CPC has around 250,000.

the full citation on the first page. Copyrights for third-party components of this Patent classification is a multi-label classification problem

work must be honored. For all other uses, contact the owner /author(s).

since every patent can belong to several knowledge fields [18,

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

© 2024 Copyright held by the owner/author(s).

2

https://doi.org/10.70314/is.2024.sikdd.19

https://github.com/elmotamedi/KnowMap- Taxonomy

59





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Motamedi et al.

10]. Given the large number of classes at the lowest level of the or higher were considered duplicates. To generate the hash sig-taxonomy tree, the performance of automatic models in predict-

natures in MinHash, we used 128 permutations. For the n-gram

ing such granular categories is limited. Various models have been

representation, we used a range of 1 to 3, incorporating 1-grams,

used to classify patents in a multi-label setting, ranging from clas-

2-grams, and 3-grams.

sical machine learning models to deep learning models [15, 5,

8]. Several previous studies have focused on higher levels of the 3.2

Refining Hierarchical Structure Through

hierarchy, limiting classification to broader categories such as

Group Merging

sections, classes, or subclasses within the taxonomy [3]. Bekamiri The hierarchical structure of the CPC groups was refined at each

et al. [3] fine-tuned the SBERT model to predict labels at the sub-level of the tree. We started with nine sections at the top level (i.e.,

class level (i.e., 663 class labels) using a multi-label formulation.

level 1), which were preserved. At subsequent levels (i.e., level 2 to

They achieved F1-score of 66%, outperforming previous studies

level 4), groups were merged by manual analysis based on shared

that used the same datasets. Aroyehun et al. [1] similarly trun-knowledge and the number of documents. Groups with relatively

cated the IPC hierarchy at the subclass level and predicted these

few documents (i.e., groups with fewer than 40,000 for level 2,

labels by transferring knowledge from two higher levels (section

20,000 for level 3, and 9,000 for level 4) were combined with other

and class) to the lower level (subclass), achieving a precision

groups at the same level that shared similar knowledge. As an ex-

score of 0.53. While it remains valuable for patent office experts

ample, at the subclass level of the CPC hierarchy, "A01B" (i.e., Soil

to use an automatic model that can narrow down applications to

working) and "A01C" (i.e., Planting, Sowing, Fertilising) represent

higher levels of the taxonomy tree, this approach has limitations

related steps in agricultural practices, as both are foundational

and challenges. One such challenge is that the choice of target

processes in land preparation and management. We merged them

class labels does not depend on the scope of the knowledge area.

into a single group labelled "Soil working and planting," resulting

More established and expansive areas may benefit from directing

in 162,567 patents in this category. The refinement continued

experts to detailed groups, while less developed areas may be

until the fine-grained classes contained at least 9,000 documents.

adequately served by broader classifications.

3.3

Text Classification

3

Methods and Materials

We formulated the classification problem as a multi-label problem,

In this work, we developed a knowledge taxonomy and classi-

in which each document can be assigned to multiple knowledge

fied patents into fine-grained classes by fine-tuning pre-trained

fields. In this study, we aimed to classify the patents into the fine-

models. Below, we outline the methods and materials used.

grained classes in the lowest level of the proposed taxonomy (i.e.,

83 classes). To balance performance and computational cost given

3.1

Patent Collection and Preprocessing

the large size of the dataset, We used the pre-trained language

models distilroberta-base, a distilled version of RoBERTa [16, 19],

The dataset used in our experiments is the Google Patents Pub-

3

and all-MiniLM-L6-v2, a version of MiniLM fine-tuned for seman-

lic Datasets on BigQuery

. Each patent has several pieces of

tic similarity [22, 17]. The pre-trained models were fine-tuned information, including the publication number, application num-for the downstream task by adding a classification head. The

ber, CPC code, title, abstract, and detailed description. We have

classification head takes the hidden state of the first token from

expanded the dataset to include the titles associated with each

4

the model and processes it through a fully connected dense linear

CPC code from Espacenet.

. In this study, we focused on the tex-

layer, followed by a dropout layer for regularisation and a tanh

tual data. We generated the input text by concatenating the title,

activation function for non-linearity. Since our task is multi-label

followed by the abstract, and then the description. We included

classification, the output logits for each class are converted into

only those documents where the concatenated text is at least 100

probabilities using a sigmoid function.

words long. Previous studies have examined various lengths of

For model training, we used a learning rate of 4e-5 with a

textual data and found that using the first 100 words often results

linear scheduler and a weight decay of 0.1. To prevent overfitting,

in higher performance for classification tasks [15].

the best checkpoint was selected based on evaluation metrics

To create a hierarchical structure where we have enough doc-

on the validation set. We trained the model for up to 5 epochs

uments among leaf-node labels (i.e., avoiding scenarios where

with early stopping criteria based on validation accuracy. The

one group contains only a few hundred documents while others

dataset, consisting of 1,092,991 samples randomly selected after

contain hundreds of thousands as an example), we needed to

deduplication, was split into training, validation, and test sets

count the number of documents which fall into the defined cate-

with ratios of 0.8, 0.1, and 0.1, respectively. To preserve the ratio

gories. As a preprocessing step before counting, we performed

of samples per class in training, validation, and test sets, we used

de-duplication, which involved removing duplicate and near-

5

stratified splitting

.

duplicate textual data [4, 12, 14].

Due to the large size of the dataset, we employed MinhHash

3.4

Classification Evaluation

Locality Sensitive Hashing (LSH) as a deduplication method to

efficiently identify similar documents [7, 9, 22]. Specifically, we The F1-score is a common metric for classification tasks. We

used MinHash to approximate the Jaccard similarities between

report both Micro-F1, averaged across all instances, and Macro-

sets of n-grams within the documents. MinHash is particularly

F1, averaged across all classes.

advantageous for large datasets because it supports parallel com-

4

Results and Analysis

putation, enhancing scalability [2]. We set the similarity threshold at 0.9, meaning that documents with a Jaccard similarity of 90%

In this section, the results are presented in two parts. First, we

present our proposed KnowMap taxonomy. Then, we report the

3

5

https://github.com/google/patents- public- data

https://github.com/trent- b/iterative- stratification?tab=readme- ov- file#multilab

4 https://worldwide.espacenet.com/

elstratifiedkfold

60





Classification of Patents Into Knowledge Fields: Using KnowMap

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Table 3: Classification Results

performance of classifiers in categorising patents into the fine-

grained classes of this taxonomy.

Metric

RoBERTa

SBERT

4.1

The Proposed Knowledge Mapping

Micro-F1 (Val)

0.76

0.76

Taxonomy (KnowMap)

Macro-F1 (Val)

0.86

0.86

The taxonomy, along with the associated CPC sections, classes,

Micro-F1 (Test)

0.77

0.76

subclasses, groups, and subgroups are provided in the shared

Macro-F1 (Test)

0.90

0.90

online source. An example of detailing the knowledge field of

soil working and planting within the broader knowledge field of

human necessities is illustrated in Fig. 1.

1.0

CPC

All groups in

A

A01

0.8

A01B, A01C

A01B, A01C

162,567 docs

162,567 docs

KnowMap

SOIL WORKING AND

SOIL WORKING AND

PLANTING

PLANTING

0.6

alue

HARVESTING AND

1,543,195 docs

PRODUCE PROCESSING

AGRICULTURE

ANIMAL HUSBANDRY

malized V

AND CONTROL

0.4

30,813,838 docs

FOODSTUFFS TOBACCO

Nor

HUMAN NECESSITIES

DAIRY PRODUCTS

PERSONAL OR DOMESTIC

ARTICLES

OPERATIONS AND

TRANSPORTING

0.2

HEALTH AMUSEMENT

ocs

CHEMISTRY AND

d

METALLURGY

2

,02

49

F1 Macro

TEXTILES AND PAPER

0.0

Test Size

7,7

t

18

oo

0

12

20

41

62

82

FIXED CONSTRUCTIONS

R

Class Index

MECHANICAL

ENGINEERING

Figure 2: Normalised test size along with F1 Macro scores

PHYSICS

for each class. The x-axis represents class indices. The y-

ELECTRICITY

axis shows normalised values for test size and F1 Macro

scores (blue dots).

NEW TECHNOLOGIES

Level 1

Level 2

Level 3

Level4

We demonstrated the experimental results on the two classifi-

Figure 1: An example of a branch extension in KnowMap

cation models RoBERTa and SBERT in Tab. 3.

from the root to the lowest level, showing the association

As observed from the results, the Macro-F1 score is higher than

of KnowMap classes with corresponding CPC classes at

the Micro-F1 score, which may indicate that the model performs

each level.

better for minority classes compared to majority classes. To gain

more insights into these results, we generated a plot (see Fig.2),

showing the F1 scores along with the normalised number of

documents for each class in the test set. We used normalised

4.2

Classification Results

values to allow both F1 scores and class sizes to be displayed in a

single figure, facilitating better comparison.

The classification task in this study was to classify patents into

The plot shows that the Macro-F1 score is higher for minority

83 fine-grained classes within our proposed KnowMap taxonomy.

classes than for majority classes, also indicating that random

The dataset comprised 1,092,991 documents, which were split

sampling led to an unbalanced dataset. The imbalanced sample

into the train, validation, and test sets with a ratio of 0.8, 0.1,

likely caused the higher Macro-F1 score relative to Micro-F1,

and 0.1 respectively. We preserved the ratio of samples per class

reflecting poorer performance in the majority classes. Future

in all three sets with stratified splitting. The average number

work will focus on using balancing techniques when sampling

of documents in the train set, validation set, and test sets are

to address this issue and enhance model performance.

presented in Tab. 2.

When looking more closely at the lowest F1-Macro scores, we

found that the bottom 10 classes were all leaves under the chem-

Table 2: Overview of sample metrics: total number of sam-

istry and metallurgy section. Moreover, the highest F1-Macro

ples, average number of samples per class, and normalised

scores (0.996) were achieved by the two classes in the textiles

average number of samples per class across training, vali-

and paper section, followed by all 17 leaves from the physics

dation, and test sets.

section. We suspect this performance difference may be due to

greater variation in the textual data of chemistry and metallurgy

Set

Total

Avg/ class

Normalised Avg

class compared to physics and textiles and paper, leading to more

variation between the training and test sets. Analysing this vari-

Train

1,092,991

132,202

0.012

ation in detail remains a task for future work. Additionally, we

Val

874,372

16,476

0.012

believe future work could benefit from adapting the classifier to

Test

218,619

15,543

0.012

a hierarchical structure, prioritising correct predictions at higher

61





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Motamedi et al.

levels before refining predictions at the leaf level. In our current

[4]

Gianni Costa, Alfredo Cuzzocrea, Giuseppe Manco, and Riccardo Ortale.

2011. Data De-duplication : A Review Data De-duplication : A Review. Learn-

approach, the classifier does not account for the hierarchy and

ing structure and schemas from documents, January. isbn: 9783642229138.

predicts all leaves directly.

doi: 10.1007/978- 3- 642- 22913- 8.

[5]

C. J. Fall, A. Törcsvári, K. Benzineb, and G. Karetka. 2003. Automated cate-

5

Discussion and Conclusions

gorization in the international patent classification. ACM SIGIR Forum, 37,

1, 10–25. doi: 10.1145/945546.945547.

In this work, we proposed a knowledge field taxonomy, KnowMap,

[6]

Juan Carlos Gomez and Marie Francine Moens. 2014. A survey of automated

hierarchical classification of patents. Lecture Notes in Computer Science

which aligns with the widely used CPC schema. The taxonomy

(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes

consists of 83 groups at the lowest level, with fine-grained classes

in Bioinformatics), 8830, 215–249. doi: 10.1007/978-3-319-12511-4_11.

containing a minimum of 9,000 samples from the original Google

[7]

Bikash Gyawali, Lucas Anastasiou, and Petr Knoth. 2020. Deduplication of

scholarly documents using locality sensitive hashing and word embeddings.

Patents Public Dataset after preprocessing. KnowMap serves as a

In Proceedings of the 12th Conference on Language Resources and Evaluation

benchmark taxonomy, addressing a gap in the existing literature.

(LREC 2020). European Language Resources Association, 894–903.

[8]

Arousha Haghighian Roudsari, Jafar Afshar, Wookey Lee, and Suan Lee.

From the preprocessed original dataset, we randomly selected

2022. PatentNet: multi-label classification of patent documents using deep

1,093,151 samples to fine-tune pre-trained RoBERTa and SBERT

learning based language understanding. Scientometrics, 127, 1, 207–231. doi:

models for downstream tasks. However, the random sampling

10.1007/s11192- 021- 04179- 4.

[9]

Omid Jafari, Preeti Maurya, Parth Nagarkar, Khandker Mushfiqul Islam,

resulted in an unbalanced dataset, which contributed to higher

and Chidambaram Crushev. 2021. A Survey on Locality Sensitive Hashing

Macro-F1 scores compared to Micro-F1 scores. To enhance clas-

Algorithms and their Applications. ACM Computing Surveys. eprint: 2102.0

sification results, we plan to create a balanced dataset from the

8942.

[10]

Guik Jung, Junghoon Shin, and Sangjun Lee. 2023. Impact of preprocessing

original data. Additionally, we aim to use larger models than

and word embedding on extreme multi-label patent classification tasks.

those used in this study to further improve the fine-tuning pro-

Applied Intelligence, 53, 4, 4047–4062. doi: 10.1007/s10489-022-03655-5.

[11]

Eleni Kamateri, Michail Salampasis, and Eduardo Perez-Molina. 2024. Will

cess.

AI solve the patent classification problem? World Patent Information, 78,

June, 102294. doi: 10.1016/j.wpi.2024.102294.

6

Future Work

[12]

Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022. Deduplicating Training

Data Mitigates Privacy Risks in Language Models. In International Confer-

Several knowledge platforms, such as news sites and GitHub,

ence on Machine Learning, Baltimore number 1. Vol. 162, 10697–10707.

host various types of information shared online. In future work,

[13]

Jong Wook Lee, Won Kyung Lee, and So Young Sohn. 2021. Patenting trends

in biometric technology of the Big Five patent offices. World Patent Infor-

we aim to incorporate these sources to extend and enhance the

mation, 65, March, 102040. doi: 10.1016/j.wpi.2021.102040.

knowledge taxonomy’s coverage. For example, the All Science

[14]

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Dou-

Journal Classification (ASJC), which organises research publica-

glas Eck, Chris Callison-Burch, and Nicholas Carlini. 2022. Deduplicating

Training Data Makes Language Models Better. Proceedings of the Annual

tions by subject area, can be used to identify alignments with

Meeting of the Association for Computational Linguistics, 1, 8424–8445. eprint:

the existing taxonomy. This taxonomy alignment can then be

2107.06499. doi: 10.18653/v1/2022.acl- long.577.

[15]

Shaobo Li, Jie Hu, Yuxin Cui, and Jianjun Hu. 2018. DeepPatent: patent

further analysed to determine whether to merge or split classes

classification with convolutional neural networks and word embedding.

at various levels. Beyond patents, we plan to evaluate the classi-

Scientometrics, 117, 2, 721–744. isbn: 1119201829. doi: 10.1007/s11192-018-2

fier on other data, using domain adaptation methods to transfer

905- 5.

[16]

Yinhan Liu et al. 2019. Roberta: a robustly optimized bert pretraining ap-

knowledge from the labelled patent domain to those with limited

proach. ArXiv, abs/1907.11692.

or no labels. Large language models (LLMs) could further aid in

[17]

Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: sentence embeddings

evaluating the classifier’s performance across different domains.

using siamese bert-networks. In Conference on Empirical Methods in Natural

Language Processing. Association for Computational Linguistics.

Recent research has shown the potential of LLMs to augment or

[18]

Arousha Haghighian Roudsari, Jafar Afshar, Charles Cheolgi Lee, and

even replace human-labeled training data with labels generated

Wookey Lee. 2020. Multi-label patent classification using attention-aware

deep learning model. In Proceedings - 2020 IEEE International Conference on

by these models [23].

Big Data and Smart Computing, BigComp 2020, 558–559. isbn: 9781728160344.

Moreover, we plan to enhance the classification task by bal-

eprint: arXiv:1910.01108. doi: 10.1109/BigComp48618.2020.000- 2.

ancing the dataset using balancing techniques for multi-label

[19]

Victor Sanh, L Debut, J Chaumond, and T Wolf. 2019. Distilbert, a distilled

version of bert: smaller, faster, cheaper and lighter. arxiv 2019. arXiv preprint problems and leveraging larger pre-trained models. we will also

arXiv:1910.01108.

closely examine the different knowledge fields to better under-

[20]

Mirac Suzgun, Luke Melas-Kyriazi, Suproteem K. Sarkar, Scott Duke Komin-

stand the variations in classifier performance across them.

ers, and Stuart M. Shieber. 2023. The Harvard USPTO Patent Dataset: A

Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applica-

tions. In 37th Conference on Neural Information Processing Systems (NeurIPS

Acknowledgements

2023) Track on Datasets and Benchmarks number NeurIPS, 1–39. eprint:

2207.04043.

This work was supported by the Slovenian Research and Inno-

[21]

Christoph Trattner, Philipp Singer, Denis Helic, and Markus Strohmaier.

vation Agency under grant agreements CRP V2-2272, V5-2264,

2012. Exploring the differences and similarities between hierarchical decen-

tralized search and human navigation in information networks. In ACM

CRP V2-2146 and the European Union through enrichMyData

International Conference Proceeding Series, 0–7. isbn: 9781450312424. doi:

EU HORIZON-IA project under grant agreement No 101070284.

10.1145/2362456.2362474.

[22]

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou.

References

2020. Minilm: deep self-attention distillation for task-agnostic compres-

sion of pre-trained transformers. Advances in Neural Information Processing

[1]

Segun Taofeek Aroyehun, Jason Angel, Navonil Majumder, Alexander Gel-

Systems, 33, 5776–5788.

bukh, and Amir Hussain. 2021. Leveraging label hierarchy using transfer and

[23]

Xinru Wang, Hannah Kim, Sajjadur Rahman, Kushan Mitra, and Zhengjie

multi-task learning: A case study on patent classification. Neurocomputing,

Miao. 2024. Human-llm collaborative annotation through effective verifi-

464, 421–431. doi: 10.1016/j.neucom.2021.07.057.

cation of llm labels. In Proceedings of the 2024 CHI Conference on Human

[2]

Mehmet Aydar and Serkan Ayvaz. 2019. An improved method of locality-

Factors in Computing Systems (CHI ’24) Article 303. Association for Com-

sensitive hashing for scalable instance matching. Knowledge and Information

puting Machinery, Honolulu, HI, USA, 21 pages. isbn: 9798400703300. doi:

Systems, 58, 2, 275–294. isbn: 1011501811995. doi: 10.1007/s10115-018-1199

10.1145/3613904.3641960.

- 5.

[24]

Junghwan Yun and Youngjung Geum. 2020. Automated classification of

[3]

Hamid Bekamiri, Daniel S. Hain, and Roman Jurowetzki. 2024. PatentS-

patents: A topic modeling approach. Computers and Industrial Engineering,

BERTa: A deep NLP based hybrid model for patent distance and classifica-

147, July, 106636. doi: 10.1016/j.cie.2020.106636.

tion using augmented SBERT. Technological Forecasting and Social Change,

206, June, 123536. doi: 10.1016/j.techf ore.2024.123536.

62





Enhancing causal graphs with domain knowledge: matching

ontology concepts between ontologies and raw text data

Jernej Stegnar

Jože M. Rožanec

Jožef Stefan Institute

Jožef Stefan International Postgraduate School

Ljubljana, Slovenia

Ljubljana, Slovenia

jernej.stegnar@gmail.com

joze.rozanec@ijs.si

Gregor Leban

Dunja Mladenić

Event Registry d.o.o.

Jožef Stefan Institute

Ljubljana, Slovenia

Ljubljana, Slovenia

gregor@eventregistry.org

dunja.mladenic@ijs.si

ABSTRACT

foresight outcomes at such a pace. Nevertheless, this would be

When building a causal graph from textual sources, such as media

possible with the use of artificial intelligence.

reports, a key task is to provide an accurate semantic understand-

AI enhances strategic foresight by automating the analysis of

ing of the causal variables encoded as nodes and to link them

data and detecting patterns that may go unnoticed by human

to existing ontologies with at least two purposes: (i) expand the

experts [1]. Machine learning algorithms can continuously mon-

knowledge with the domain knowledge captured in such ontolo-

itor emerging trends, geopolitical shifts, and market fluctuations

gies and (ii) provide accurate and different levels of abstraction

in near-real time, offering dynamic insights into potential future

of the extracted causal variables. This article describes how we

scenarios. Natural language processing (NLP) enables AI to sift

used OntoGPT, a tool for matching raw text to ontology concepts

through massive amounts of text, extracting relevant informa-

initially designed for the medical domain, to match concepts from

tion from reports, news, and social media, thus accelerating the

media events to relevant ontologies. We build upon our previous

forecasting process. By integrating AI into strategic foresight,

work on extracting causal variables and enrich the extraction

organizations can adapt more swiftly and make more informed,

pipeline by matching causal variables to concepts from specific

data-driven decisions in the face of uncertainty.

domain ontologies. In particular, we describe our work regard-

Ontologies provide structured knowledge informing the rela-

ing the GEO ontology. Future work will focus on expanding

tionships between concepts within a specific domain. Further-

OntoGPT’s capabilities by utilizing a wider selection of ontolo-

more, they describe those concepts through properties and can

gies. Addressing its limitations, such as dealing with multiple

link such classes to specific instances observed in the real world.

instances of the same class, will also be crucial for improving its

As such, they are of key importance when building a causality

utility. These improvements will allow the tool to better support

graph, given they can augment our understanding of the causal

strategic foresight applications by providing more detailed in-

relationships between variables with a better understanding of

sights across a multitude of different sectors, further enriching

the context and the variable implications [3]. For example, if causal graphs and facilitating more accurate predictive modeling.

the causal relationship reports about the ceasing of an armed

conflict, knowing whether a causal variable relates to a coun-

KEYWORDS

try, the location of that country, the neighboring countries, and

international organizations it is involved in would help to un-

strategic foresight, ontology matching, artificial intelligence

derstand the magnitude of that event and contextualize other

likely outcomes (refugee repatriation, impacts on investments,

1

INTRODUCTION

and others).

Strategic foresight is a discipline concerned with anticipating

In the scope of the graph massive project, ontology matching

future trends, uncertainties, and disruptions to inform decision-

is being used to link the extracted causal relationships from text

making and enable the creation of resilient, long-term strategies.

to concepts inside the ontologies, allowing for a more detailed

As such, it is valuable to governments, organizations, and enter-

understanding of the concepts that appear in causal relationships

prises, who can use it to remain competitive and adaptable in a

and their interconnectivity.

rapidly changing world [4].

The pace of technological advancement, shifting geopoliti-

2

ENRICHING CAUSAL GRAPHS WITH

cal landscapes, environmental crises, and unpredictable market

DOMAIN KNOWLEDGE

trends make it essential to react quickly to change. Traditionally,

We consider ontologies a framework (an organized and structured

foresight has been based on trend analysis, expert opinion, and

system for representing knowledge) used to represent knowl-

qualitative insights. Such approaches lack the agility required to

edge within a specific domain by defining the relationships be-

scan real-world events in near-real time and produce strategic

tween concepts. They consist of classes (concepts), properties

(attributes), and relationships that connect different concepts.

Permission to make digital or hard copies of part or all of this work for personal This structure provides a standardized way to organize and in-or classroom use is granted without fee provided that copies are not made or

distributed for profit or commercial advantage and that copies bear this notice and terpret data, ensuring consistent understanding across systems.

the full citation on the first page. Copyrights for third-party components of this For example, in a medical ontology, concepts like "disease" might

work must be honored. For all other uses, contact the owner/author(s).

be linked to "symptoms," "treatments," and "causes," each with Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

© 2024 Copyright held by the owner/author(s).

its own defined properties. By formalizing these relationships,

https://doi.org/https://doi.org/10.70314/is.2024.sikdd.25

ontologies allow AI systems to better interpret and reason about

63





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Jernej Stegnar, Jože M. Rožanec, Gregor Leban, and Dunja Mladenić

complex information, leading to more accurate data processing

and decision-making.

Ontologies enhance causality graphs by providing domain-

specific knowledge that improves the accuracy and depth of

relationships represented. When extracting causal relationships

from large datasets, such as media reports, the data can often be

ambiguous or incomplete. Ontologies address this by offering

structured knowledge that defines concepts and their relation-

ships within a specific domain, linking extracted causal rela-

tionships to well-defined entities in the ontology. This enriches

the causality graph, uncovering implicit connections and non-

obvious relationships that may otherwise be missed. In strategic

foresight, for example, ontology-based enrichment helps capture

a broader range of potential future scenarios by incorporating

Figure 1: The figure showcases our pipeline for building

knowledge beyond the immediate dataset. This leads to more

a causality graph. The sub-figure B showcases how the

reliable predictions, especially when the training data is limited

process of ontology linking was executed as a part of our

or domain-specific. Ultimately, ontologies are expected to enable

pipeline

the system to generalize better, predict outcomes with higher

accuracy, and improve the overall reliability of causality graphs.

The causality graph pipeline in the Graph Massivizer strategic

consistent and accurate representation of complex information

foresight project is designed to automate the extraction, organi-

by defining structured relationships between concepts.

zation, and analysis of causal relationships from large datasets,

The primary purpose of OntoGPT is to enhance AI systems’

particularly news articles. The Figure 1 showcases the structure understanding, processing, and categorization of data by linking

of our causality graph’s data pipeline. The process begins with

extracted information to predefined concepts and relationships

extracting these relationships from news articles, which are then

within an ontology. This structured approach ensures greater

organized into a causality graph that maps the interactions be-

accuracy and reliability compared to traditional AI systems that

tween various factors and events. The goal is to develop link

rely on unstructured data.

prediction models that estimate the likelihood of future events

OntoGPT works by connecting data from sources such as text

based on observed patterns. For instance, one use case involves

or reports to specific concepts in an ontology, allowing for more

predicting oil price trends by analyzing factors that influence

informed and contextually accurate connections. For example, in

pricing.

healthcare, OntoGPT can link symptoms from patient records to

Ontology matching is then integrated into the pipeline to link

diseases and treatments outlined in medical ontologies, helping

extracted causal relationships with concepts from structured on-

to suggest possible diagnoses or treatment plans.

tologies. This enrichment adds layers of context and enables the

By combining the language-processing capabilities of LLMs

discovery of connections that may not be evident from raw data

with the structured knowledge available in ontologies, OntoGPT

alone. By incorporating ontologies, the pipeline transcends the

enables AI systems to go beyond keyword matching and consider

limitations of its training data, identifying causal relationships

the relationships between terms. This leads to more intelligent

that may be implied by broader knowledge contained in the on-

data interpretation and improved decision-making.

tologies. This not only enhances the accuracy of the graph but

OntoGPT is widely used in fields where structured knowl-

also allows it to capture more complex and non-direct relation-

edge is critical for high accuracy, such as healthcare, biology,

ships, improving its predictive capabilities.

and pharmaceutical research. In medical research, for instance,

As shown in Fig. 1B, the process of ontology linking in our

OntoGPT links clinical trial data, medical records, and scientific

pipeline consisted of creating ontology matching templates, then

literature to medical ontologies, supporting better analysis and

linking the concepts in text to ontologies, using the information

decision-making.

to add additional data to existing causalities, all with the purpose

The key advantage of OntoGPT lies in its ability to ground

of finding extra implicit connections based on the information

AI outputs in domain-specific, structured knowledge, reducing

provided by the ontologies.

the likelihood of errors and improving the relevance of insights.

The main problem that needed solving for that purpose was,

This grounding ensures that AI responses are not just based on

how to link ontologies to raw text data. In our case that was

patterns but also on well-defined concepts and their relationships.

done using OntoGPT [2], which is a tool used for ontology link-In summary, OntoGPT bridges the gap between the raw data-

ing. Another key challenge is inter-ontology matching, which

processing power of LLMs and the structured knowledge in on-

involves linking multiple ontologies through shared concepts.

tologies. By leveraging both, it provides a more accurate and

This process expands the knowledge framework, making it even

reliable approach to extracting and linking data across various do-

more valuable for our purposes. The challenge of inter-ontology

mains, particularly when working with large, complex datasets.

matching hasn’t been addressed yet and remains a matter of

future work.

3.1

OntoGPT’s role

At a lower level, OntoGPT operates using YAML templates that

define how data should be extracted from text and linked to onto-

3

ONTOGPT: A BRIEF OVERVIEW

logical concepts. These templates serve as blueprints, specifying

OntoGPT is an advanced tool that integrates large language

which types of entities, relationships, and properties to look for

models (LLMs) with ontologies to improve knowledge extraction

in the input text. The templates guide the large language model

and organization across various domains. Ontologies provide a

by mapping textual data to predefined concepts and relationships

64





Enhancing causal graphs with domain knowledge

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

classes inside the ontology, that we are trying to link the text data

to, and their descriptions, which assists OntoGPT in more accu-

rately identifying these classes inside the text. The YAML file also

contains the information of "annotators" which tells OntoGPT,

which ontology to ground the responses to. The generated YAML

templates are saved into a separate file after generation, which

makes them ready for use.

The python code that is used by OntoGPT in the process of

ontology linking, is similarly generated by using the extracted

information to fill in the "general template" and is then saved to

a separate file.

5

LIMITATIONS

Figure 2: A Showcase of the function of OntoGPT

5.1

Multiple Same-Class Concepts

OntoGPT has problems trying to link two or more concepts to a

place in the ontology if the concepts are of the same class. This

happens because both concepts suit the description and similar

criteria that OntoGPT extracts the information based on. This

causes OntoGPT to merge both concepts into a single string and

then try to locate the said string inside the ontology, which fails

because there is no individual inside the ontology class with such

Figure 3: The Process of Templates Generation

a name. An example of such a response is shown in Listing 1:

Listing 1: Example of a bad response

from the ontology, ensuring that the extracted information is

e x t r a c t e d _ o b j e c t :

both relevant and structured. The figure 2 shows the process

c o n t i n e n t : AUTO : Europe%2C%20 A f r i c a

of ontology linking for an example of a simple sentence. Each

n a m e d _ e n t i t i e s :

YAML template contains detailed instructions on how to identify

− i d : AUTO : Europe%2C%20 A f r i c a

key terms, their corresponding ontology classes, and the relation-

l a b e l : Europe , A f r i c a

ships between them. This allows OntoGPT to recognize when a

piece of text, such as a sentence from a media article, contains

If OntoGPT managed to locate the concept inside the text

a concept that aligns with an entity or event in the ontology.

in the ontology, it returns its id (an example of this is "sea:

Once identified, the tool links the extracted data to these ontol-

GEO:000055471" and "id: GEO:000055471 : White Sea") If the

ogy entries, enabling richer and more meaningful connections

concept suits the class criteria, but couldn’t be located inside the

in the data, as it is now grounded in an established knowledge

ontology, it returns it as a “AUTO” detection. For the purpose of

framework.

ontology linking this is not optimal as it does not give us access

The approach described in this article uses an ontology file

to the additional information that is stored inside the ontology’s

as input to create such templates for data extraction and link-

individual information. The ontology’s individual information is

ing. This enables for a broader range of ontology linking, as the

a set of predefined relationships and properties, that an individ-

templates can be created on demand.

ual concept has. For example, if the individual "Africa" is defined

inside the ontology, the individual’s data would include its size,

4

TEMPLATES AND PYTHON CODE

countries on the continent, population, and climates, among oth-

GENERATION

ers. This information gives us reliable information about a certain

concept, allowing for more contextual understanding.

The approach works by using the information defined inside

To solve this problem, the approach of creating "buffer" classes

the ontology, to generate the YAML templates. The Figure 3

was taken, where a certain class from ontology would be used to

showcases the process of how this is done.

generate three classes describing the different occurrences of the

First the class information, for each class inside the ontology,

ontology class and a description that would provide sufficient

is extracted. This is done by using the "owlready2" python library

context to OntoGPT to separate the same class concepts into

to parse the ontology into an object, and then extract the relevant

different entities. The corrected response is showcased in Listing

information from the new object.

2:

Every class inside the ontology is used to create a correspond-

ing template class, which is optimal, as it covers all parts of the

Listing 2: Example of a corrected response

ontology that could potentially be linked. A small portion of

e x t r a c t e d _ o b j e c t :

the data extraction process is ontology-specific and was custom-

c o n t i n e n t : GEO: 0 0 0 0 0 0 3 4 0

tailored to the individual ontology, as some information (like

class descriptions) is saved in different parts.

c o n t i n e n t _ 2 : GEO: 0 0 0 0 0 0 3 4 2

Secondly the data extracted from the ontology is processed

n a m e d _ e n t i t i e s :

and used to create custom YAML templates. This is done by sim-

− i d : GEO: 0 0 0 0 0 0 3 4 0

ply using the extracted information to fill in a "general template"

l a b e l : A f r i c a

we used for generation. Specifically the class names and descrip-

− i d : GEO: 0 0 0 0 0 0 3 4 2

tions are used, to do so. This gives OntoGPT the names of the

l a b e l : Europe

65





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Jernej Stegnar, Jože M. Rožanec, Gregor Leban, and Dunja Mladenić

While this approach deals with a high percentage of this type

ACKNOWLEDGMENTS

problem, it does not cover the cases where more than three same-

The Slovenian Research Agency supported this work. This re-

class concepts are inside the piece of text being analyzed.

search was developed as part of the Graph-Massivizer project

funded under the Horizon Europe research and innovation pro-

6

CONCLUSIONS

gram of the European Union under grant agreement 101093202.

Using OntoGPT in the Graph Massivizer strategic foresight project

will prove valuable for enriching causal graphs with linked on-

REFERENCES

tology data, aiming to improve predictive accuracy in predicting

[1] Patrick Brandtner and Marius Mates. 2021. Artificial intelligence in strategic future events. Despite OntoGPT’s initial focus on medical data,

foresight–Current practices and future application potentials: current practices

some custom adaptations were successfully implemented to suit

and future application potentials. In Proceedings of the 2021 12th International

Conference on E-business, Management and Economics. 75–81.

a portion of different domains. However, limitations persist in

[2] J Harry Caufield, Harshad Hegde, Vincent Emonet, Nomi L Harris, Marcin P

distinguishing between multiple instances of the same concept

Joachimiak, Nicolas Matentzoglu, HyeongSik Kim, Sierra Moxon, Justin T Reese,

Melissa A Haendel, et al. 2024. Structured prompt interrogation and recursive

class. These challenges highlight the need for further develop-

extraction of semantics (SPIRES): A method for populating knowledge bases

ment to enhance the tool’s versatility across a broader array of

using zero-shot learning. Bioinformatics 40, 3 (2024), btae104.

applications and ontologies.

[3] Fatma Özcan, Chuan Lei, Abdul Quamar, and Vasilis Efthymiou. 2021. Semantic

enrichment of data for AI applications. In Proceedings of the Fifth Workshop on

Data Management for End-To-End Machine Learning. 1–7.

[4] David Sarpong and Nicholas O’Regan. 2014. The Organizing Dimensions of

Strategic Foresight in High-Velocity Environments. Strategic Change 23, 3-4

(2014), 125–132.

66





Measuring and Modeling CO2 Emissions in Machine Learning

Processes

Ivo Hrib

Oleksandra Topal

Jožef Stefan Institute

Jožef Stefan Institute

Ljubljana, Slovenia

Ljubljana, Slovenia

ivo.hrib@gmail.com

Oleksandra.Topal@ijs.si

Jan Šturm

Maja Škrjanc

Jožef Stefan Institute

Jožef Stefan Institute

Ljubljana, Slovenia

Ljubljana, Slovenia

jan.sturm@ijs.si

maja.skrjanc@ijs.si

Abstract

offer insights into a model’s emissions before its construction or

use. The service we aim to provide addresses this gap by offering

With the rapid expansion of the computing industry, efficient

an estimation of emissions and power consumption for differ-

energy utilization and reduction of CO

emissions are critically

2

ent models before they are selected for specific use cases. This

important. This research develops analytical tools to predict CO2

forward-looking approach allows for more informed decisions

emissions from various machine learning processes. We present a

when choosing models, potentially reducing their environmental

novel methodology for data acquisition and analysis of CO

emis-

2

footprint.

sions during model training and testing. Our results demonstrate

the environmental impact of different algorithms and provide

2

Related Work

insights into optimizing energy consumption in artificial intelli-

gence applications.

The environmental impact of machine learning models has been

a growing concern in recent years. Several studies have focused

Keywords

on quantifying and reducing the carbon footprint of artificial

CO

Emissions, Machine Learning, Energy Consumption, Envi-

intelligence (AI) processes. For instance, [12] highlighted the en-2

ronmental Impact, AI Model Optimization, Green AI, Sustainable

ergy consumption of training large neural models and suggested

Computing, Carbon Footprint

methods for minimizing emissions. Similarly, tools like CodeCar-

bon [2] and eco2AI [3] have emerged to measure real-time CO2

1

Introduction

emissions from computational tasks. However, these tools often

lack predictive capabilities for assessing emissions before model

The global computing industry significantly contributes to CO2

selection, as pointed out by. Our work builds on these existing

emissions, with data centers accounting for 2.5 to 3.7 percent

methodologies, concretely on the work of eco2AI[3], by providing of global greenhouse gas emissions [1]. These emissions exceed a forward-looking approach that estimates emissions during the

those of the aviation industry due to continuous operations and

model selection phase, thus complementing real-time monitoring

heavy reliance on fossil fuels [11]. Given the growing demand for tools. This is achieved through heavy dependency on eco2AI[3]

artificial intelligence (AI) applications, there is an urgent need

measuring systems for data collection, later used for modeling

for CO -conscious solutions.

2

based on the collected data and registered hyperparameters.

This research aims to develop tools for predicting CO

emis-

2

sions associated with machine learning processes, thus enabling

2.1

Research Gap and Contribution

the reduction of the environmental impact of AI models. In col-

laboration with Eviden (Spain) and under the FAME EU project,

Despite the growing availability of tools like CodeCarbon [2]

we have developed a CO

emissions analysis system using tools

2

and eco2AI [3], a significant gap remains in the preemptive eval-like CodeCarbon [2] and eco2AI [3].

uation of environmental impact during the machine learning

(ML) model selection phase. The mentioned tools are valuable for

1.1

Research Goals

post hoc analyses but do not assist ML practitioners in making

The primary goal of this research is to develop a service that pre-

informed decisions upfront—before model development—on

dicts CO

emissions and power consumption of different machine

the environmental footprint of different model architectures or

2

learning models during both training and evaluation phases with

hyperparameters.

emphasis on hyperparameter dependency. The CO

emissions

This gap is crucial, as the model selection phase often involves

2

𝑘𝑔

trial-and-error across multiple models and configurations, po-

are measured in kilograms per second (

) , while the power

𝑠

tentially leading to unnecessary resource consumption. Without

consumption is measured in kilowatt-hours (kWh).

predictive capabilities, practitioners have limited insight into

While existing services, such as CodeCarbon [2] or eco2AI

which models will have the lowest environmental impact before

[3], provide real-time measurement of emissions, they do not

engaging in resource-intensive training.

Permission to make digital or hard copies of all or part of this work for personal Our research aims to fill this gap by introducing a predictive

or classroom use is granted without fee provided that copies are not made or

service that estimates the environmental footprint of different

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this ML models before they are trained or used. This service leverages

work must be honored. For all other uses, contact the owner /author(s).

the data collected from existing tools like eco2AI [3], incorporat-Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

ing key features such as hyperparameters, and model architec-

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2024.sikdd.23

turre into predictive models. By doing so, we enable developers

67





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Hrib et al.

to make more sustainable choices at the model selection stage,

3.3

CO2 Emission Measurement

reducing carbon emissions from the start of the ML lifecycle.

We measure CO

emissions produced during both the training

2

The table 1 below presents a feature matrix comparing our

and testing phases of the machine learning models. This involves

proposed service with current tools, showing how our approach

using tools like eco2AI [3] to track energy consumption and

addresses unmet needs:

convert it into equivalent CO

emissions. The measurements

2

are taken for various models, such as Decision Trees, Random

3

Methodology

Forests, Logistic Regression, and Neural Networks, to assess their

environmental impact under different computational loads.

Due to the lack of suitable data on CO

emissions of machine

2

learning models, we began by developing an infrastructure for

3.4

Feature Extraction

data collection. This infrastructure is composed of the following

steps:

To gain deeper insights, we extract various features that could im-

pact CO

emissions and energy consumption. These features in-

2

• Dataset Generation: Creating synthetic datasets using

clude project identifiers, detailed descriptions of each experiment,

random data generation methods.

the duration of each training epoch, power consumption metrics,

• Data Preprocessing: Cleaning and preparing the data for

hardware configurations (such as the type of CP U/GP U used), and

analysis.

hyperparameters. The project identifiers refer to unique alphanu-

• CO2 Emission Measurement: Recording CO emissions

2

meric codes assigned to each machine learning experiment upon

during both training and testing phases using different

execution. These identifiers help differentiate between various

machine learning algorithms.

model configurations and experimental setups. They are gener-

• Feature Extraction: Extracting relevant features such

ated and stored automatically by our system during the dataset

as project ID, experiment details, epoch duration, power

generation process to ensure traceability and reproducibility of

consumption, and hardware configurations.

the experiments.

• Adding Hyperparameters to Final Dataset: Document-

ing hyperparameters used in each experiment to assess

3.5

Adding Hyperparameters to Final Dataset

their impact on emissions.

• Containerization:

We document the hyperparameters used in each machine learn-

Utilizing Docker for containerization

ing experiment, such as learning rates, batch sizes, and the num-

to ensure reproducibility and scalability of the experi-

ber of layers in neural networks. This allows us to evaluate how

ments.

• Data Storage:

these hyperparameters influence CO

emissions and energy con-

2

Storing all datasets, features, and emission

sumption.

records systematically in a database for further analysis.

• Modeling: Developing and training machine learning

3.6

Containerization

models to predict CO

emissions and power consumption.

2

To ensure reproducibility and scalability of our experiments, we

The software implementation uses Python, with dependencies

employ Docker for containerization. This approach encapsulates

including pandas [7], scikit-learn [10], matplotlib [5], eco2AI [3],

the code, dependencies, and environment settings, allowing the

TensorFlow [abadi2016tensorflow], Keras [chollet2015keras],

experiments to be easily replicated and deployed across different

and Docker for containerization [merkel2014docker].

platforms.

3.1

Dataset Generation

3.7

Data Storage

In this step, we created a synthetic dataset by generating random

All datasets, extracted features, hyperparameter configurations,

data points using tools like sklearn.datasets.make_regression

and CO

emission records are systematically stored in a database.

2

or make_classification. The primary objective here is not

This central repository facilitates efficient querying, retrieval,

to reflect real-world data scenarios but to produce a controlled

and analysis of data to support ongoing and future research.

environment where the focus is on measuring CO

emissions

2

and power consumption during model training and evaluation.

3.8

Modeling

Datasets generated vary in size from ranges of 250 to 15000 sam-

In this step, we develop and train machine learning models to

ples and 5 to 2000 features. In classification cases additionally the

predict CO

emissions and power consumption based on various

2

number of classes ranges from 2 to 50. These parameter ranges

features, such as the type of algorithm used, hardware configura-

were selected to mitigate the risk of computational overload, en-

tion, and model parameters. This modeling allows us to estimate

suring that the experiments remain feasible within the available

emissions for different machine learning workflows before their

computational resources while maintaining the integrity of the

actual deployment. The models help identify the most efficient

analysis.

algorithms and configurations, thus guiding the selection of en-

vironmentally friendly AI solutions.

3.2

Data Preprocessing

The general pipeline for the previously mentioned steps can

be seen below (see Figure 1).

Before analysis, the dataset must be cleaned and prepared. This

A more thorough view of the workings of this can be seen as

includes handling missing values, normalizing or standardizing

shown below for running a single measurement (see Figure 2).

data, encoding categorical variables, and splitting the data into

training and testing sets. Proper preprocessing ensures that the

4

Model Architecture

data is in the optimal format for the models to learn from and min-

imizes biases that may affect model performance and emission

In this section, we explain the architecture of the model used

measurements.

for predicting CO

emissions and power consumption based on

2

68





Measuring and Modeling CO2 Emissions in Machine Learning Processes

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Tool/Technology

Platform

Model Cov-

Metric

Carbon

Energy

Additional

Real-time

Forward-

Compati-

erage

Granular-

Metrics

Metrics

Features

measure-

looking

bility

ity

ment

Prediction

CodeCarbon Cloud, On-

All ML mod-

Per training

CO

emis-

Energy con-

Dashboard

Yes

No

2

Premise

els

session

sions (kg)

sumption

Visualiza-

(kWh)

tion

eco2AI

Cloud,

On-

All ML mod-

Per training

CO

emis-

Energy con-

Not

RAPL

Yes

No

2

Premise

els

session

sions (kg)

sumption

based

(kWh)

Proposed

On-Premise

Specific

Per

model,

CO

emis-

Energy con-

Predictive

No

Yes

2

Service

kg

models

per

selec-

sions

sumption

modeling

s

(mentioned

tion phase

(kWh)

bellow)

Table 1: Feature comparison of existing tools and the proposed service

• Output Layer: A single neuron that outputs the predicted

value for either CO

emissions or power consumption.

2

4.1

Model Training

The model is compiled using the Adam optimizer [6] and the

Mean Squared Error (MSE) loss function. Seeing as we were un-

able to gather adequate real-time environmental data of factors

that may influence our predictions (e.g. Distribution of energy

sources, real time CO

per kWh), our model relies on static yearly

2

averages of these values[8] [9] . Our model uses the aforemen-

tioned features for the purpose of regression with the goal of

Figure 1: General Measurement Pipeline

predicting power consumption and CO

emissions gathered by

2

previously mentioned random tests. Each model is trained for 25

epochs using the preprocessed data. After training, the models,

along with their respective scalers and encoders, are saved to

disk for later use.

4.2

Prediction

Once trained, the model can predict CO

emissions and power

2

consumption for new data points by loading the appropriate

model, scaler, and one-hot encoder. The input data is prepro-

cessed in the same manner as during training, and the predictions

are obtained by applying the trained models.

This modular approach allows for easy extension to additional

models or data sources and provides a scalable solution for ana-

lyzing the environmental impact of machine learning processes.

5

Web Application Interface for CO2

Figure 2: Single Model Measurement Pipeline

Emissions and Power Consumption

Prediction

various features such as CP U type, GP U type, region, and other

In addition to the backend model developed for predicting CO2

experiment-specific details. The model implementation is en-

emissions and power consumption of various AI models, a web

capsulated within a Python class named MultiModel, which is

application was created to provide a user-friendly interface for

responsible for managing the entire process from data prepro-

real-time predictions. The web app, as shown in Figure 3, allows cessing to training and prediction.

users to select different machine learning models and configure

The model employs two separate neural networks for predict-

parameters to estimate the associated environmental impacts.

ing CO

emissions and power consumption. The architecture for

2

5.1

Key Features of the Web Application

each neural network is as follows:

• Input Layer: Receives the scaled and encoded features.

The web application interface is designed with simplicity and

• Hidden Layers: Consist of multiple Dense layers with

functionality in mind. It includes several key components:

ReLU activation functions. The CO

emissions model in-

• Model Selection: Users can choose the type of machine

2

cludes three hidden layers with 128, 64, and 128 neurons,

learning model they are interested in evaluating (e.g., Lo-

respectively, while the power consumption model has

gistic Regression ( abbr. LogR ), Decision Tree Classifier (

three hidden layers with 64, 64, and 128 neurons.

abbr. DTC ), Decision Tree Regression ( abbr. DTR ), Neural

69





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Hrib et al.

6

Results

6.1

Model Error

To evaluate the performance and accuracy of the models, we

conducted a 10-fold cross-validation to estimate the errors in

predicting CO

emissions and power consumption. The results

2

are presented in Table 2. The errors for both CO

emissions

2

and power consumption were computed for both training and

evaluation phases of each model type.

Note: In this context, "Train." refers not to the error on the

training set, but rather to the error made by our model in predict-

ing the CO

emissions / Power Consumption during the training

2

Figure 3: Web App Interface

phase of the listed model. Similarly, "Eval." refers not to the error

on the evaluation set, but rather to the error made by our model

in predicting the CO

emissions / Power Consumption when

2

Network Classifier ( abbr. NNC ), Neural Network Regres-

the listed model makes predictions. This distinction is crucial to

sion (abbr. NNR ), Linear Regression ( abbr. LinR ), Random

understanding the results accurately.

Forest Classifier ( abbr. RFC ) and Random Forest Regres-

sion ( abbr. RFR ) ). The dropdown menu in the upper-left

corner of the interface provides a list of available models.

Table 2: Model Scaled Error Estimates from 10-Fold Cross-

• Model Parameters Configuration: A section labeled

Validation

"Model Parameters" allows users to specify various inputs:

– Train or Evaluate: Users can choose whether to esti-

Model

Phase

CO2 Error

Power Er-

mate emissions for the training or evaluation phase of

ror

the model.

– Dataset Samples and Features: Input fields are pro-

DTC

Eval.

0.0036

0.0043

vided for users to define the size of the dataset in terms

DTC

Train.

0.0631

0.0649

of the number of samples and features.

DTR

Eval.

0.0032

0.0034

– CPU and GPU Specifications: The app allows the

DTR

Train.

0.0133

0.0517

selection of the CP U and GP U type, reflecting differ-

RFC

Eval.

0.0094

0.0098

ent hardware configurations, such as "Intel(R) Xeon(R)

RFC

Train.

0.3242

0.3582

Gold 6246R CP U @ 3.40GHz/1 device(s), TDP:205.0" or

RFR

Eval.

0.0087

0.0081

"AMD Ryzen 7 4800H with Radeon Graphics/1 device(s),

RFR

Train.

0.2565

0.2779

TDP:45.0".

LogR

Eval.

0.0063

0.0057

– Region/Country Selection: A dropdown to select the

LogR

Train.

0.0055

0.0043

geographic location where the model is being executed,

LinR

Eval.

0.0099

0.0105

which influences the CO

emissions based on local en-

LinR

Train.

0.0104

0.0095

2

ergy sources.

NNC

Eval.

0.0018

0.0030

• Real-Time Predictions: Once all parameters are config-

NNC

Train.

0.1083

0.1216

ured, the application dynamically calculates and displays:

NNR

Eval.

0.0045

0.0112

– CO2 Emissions: The predicted emissions are shown in

NNR

Train.

0.1051

0.1008

kilograms per second (kg/s).

– Power Consumption: The power consumption is pro-

vided in kilowatt-hours (kWh).

Based on the results obtained through the 10-fold cross-validation,

• Electricity Source Distribution: A graphical representa-

it is evident that the model performance varies significantly

tion is provided for the distribution of electricity sources,

across different algorithms and phases. One notable observa-

such as coal, gas, and oil, in the selected region. This in-

tion is that the errors in predicting CO

emissions and power

2

formation is crucial for understanding the environmental

consumption are relatively higher during the training phases,

impact of power consumption based on the local energy

particularly for more complex models like Neural Networks and

mix.

Random Forests [4].

This discrepancy in model performance can be attributed to

5.2

User Experience and Accessibility

the sparsity of the data collected during the measurement phase.

The web application is developed with accessibility in mind, en-

The limited data points lead to substantial gaps in the attribute

suring that users, regardless of technical background, can interact

space covered by the models, resulting in erratic behavior when

with the model’s predictive capabilities. By offering a clear and

predicting outside these ranges. Consequently, the models show

intuitive interface, it aims to make the process of estimating CO

diminished accuracy and reliability when confronted with input

2

emissions and power consumption transparent and straightfor-

configurations that fall beyond the scope of the original data.

ward.

Future research should focus on enhancing the robustness of

Figure 3 illustrates the application’s main screen, where the

these models by expanding the dataset to include a broader range

model type, parameters, and results are all visible at a glance. This

of scenarios and conditions. This would help mitigate the effects

real-time feedback loop allows users to make informed decisions

of sparsity and improve the model’s generalizability, ensuring

based on the predicted environmental impact.

more reliable predictions across diverse settings.

70





Measuring and Modeling CO2 Emissions in Machine Learning Processes

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

8

Limitations

This study presents several limitations, particularly regarding

the data, model evaluation, and hardware configurations, which

must be considered when interpreting the results.

8.1

Training Duration and Model Learning

The models were trained for a fixed number of epochs (e.g., 10 or

20), prioritizing computational cost over learning performance.

The focus was on estimating CO

emissions rather than model

2

accuracy or convergence, meaning the models may not have fully

captured patterns in the data. As such, the reported emissions

reflect standardized training durations (with an upper limit for

computational efficiency), not optimized learning outcomes.

Figure 4: Logarithmically scaled mean emissions across

8.2

Lack of Meaningful Learning Objective

different models

The use of randomly generated data limits the evaluation of model

learning. Since the data lacked inherent structure, the models’

ability to learn was not assessed. Instead, the models were pri-

marily evaluated on their resource consumption during training,

reducing the focus on generalization or predictive power.

6.2

CO2 Emission Analysis Across Different

Models

8.3

Hardware and Software Considerations

Figure 4 provides a comparative analysis of the mean CO

emis-

The experiments were conducted on specific hardware (e.g., GP U/CP U

2

sions generated by different machine learning models during

configurations), and variations in hardware were not examined.

their operation, represented on a logarithmic scale to accommo-

Different hardware setups, especially energy-efficient systems,

date the wide range of emission values.

could significantly impact CO

emissions and energy consump-

2

The chart highlights significant variations in CO

emissions

tion. Therefore, the findings may not generalize across all hard-

2

among models, with the Neural Network Classifier and Neu-

ware environments. However, we would like to point out that this

ral Network Regressor exhibiting the highest emissions by a

was due to lack of infrastructure for broader experimentation.

considerable margin. This is expected due to the intensive com-

putational requirements and numerous parameters these models

9

Future Work

necessitate, resulting in elevated power consumption and conse-

quently higher CO

output.

2

Future research should incorporate real-world datasets, optimize

In contrast, simpler models like Logistic Regression, Linear

hyperparameters, and evaluate diverse hardware configurations

Regression, and Decision Tree models show substantially lower

to extend these findings to broader machine learning scenarios.

CO

emissions, reflecting their reduced computational complex-

2

The exploration of more complex architectures and learning

ity and lower resource demand.

objectives will provide a deeper understanding of the trade-offs

Interestingly, the Random Forest models, particularly the Re-

between performance and environmental impact.

gressor, present moderate emissions, illustrating that even ensem-

ble methods, which typically involve training multiple decision

trees, can maintain reasonable emission levels depending on their

10

Conclusion

configuration.

Our study presents a methodology for monitoring and analyzing

This analysis underscores the importance of model selection

CO

emissions during machine learning processes. The find-

2

not only for performance but also for minimizing environmental

ings demonstrate that different machine learning models exhibit

impact, particularly when scaling up operations or deploying in

significant variability in their energy consumption and CO

emis-

2

resource-constrained settings.

sions, with complex models like neural networks having a higher

environmental impact. By providing predictive insights into these

emissions, our approach enables more informed decision-making

7

Discussion

during model selection, thus contributing to the broader goal of

The results highlight the significant environmental impact of

reducing the carbon footprint of AI applications.

training complex AI models, particularly neural networks. The

Future work will focus on expanding the dataset to include

variability in emissions suggests that optimizing model hyperpa-

more diverse models and configurations. Additionally, we plan

rameters and selecting appropriate hardware configurations can

to integrate real-time monitoring tools to compare predictions

reduce CO

output.

with actual emissions, further refining our predictive capabilities.

2

Future research should focus on model improvement for better

Moreover, optimizing model hyperparameters and exploring al-

and more accurate prediction, expanding the range of algorithms

ternative, more sustainable hardware configurations will be key

studied, as well as intensive data collection to accommodate gaps

areas of investigation for minimizing the environmental impact

in training data.

of machine learning workflows.

71





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Hrib et al.

Acknowledgements

[4]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. A

comprehensive resource on machine learning algorithms, including neural

This work was supported by the FAME project, funded by the

networks. MIT Press. http://www.deeplearningbook.org.

European Union’s Horizon 2023 Research and Innovation Pro-

[5]

John D Hunter. 2007. Matplotlib: a 2d graphics environment. Computing in

science & engineering, 9, 3, 90–95.

gramme under grant agreement nº 101092639.

[6]

Diederik P Kingma and Jimmy Ba. 2014. Adam: a method for stochastic

optimization. arXiv preprint arXiv:1412.6980.

References

[7]

Wes McKinney. 2010. Data structures for statistical computing in python.

In Proceedings of the 9th Python in Science Conference. Vol. 445, 51–56.

[1]

Climatiq. 2023. Climatiq: emissions intelligence platform. Provides data

[8]

OurWorldInData. [n. d.] (). https://ourworldindata.org/grapher/carbon- inte

on the carbon emissions of various activities, including computing. (2023).

nsity- electricity?tab=table.

https://www.climatiq.io/.

[9]

OurWorldInData. [n. d.] (). https://ourworldindata.org/electricity- mix.

[2]

CodeCarbon Development Team. 2023. Codecarbon: an open source tool

[10]

Fabian Pedregosa et al. 2011. Scikit-learn: machine learning in python. Jour-

for tracking the carbon emissions of machine learning experiments. An

nal of machine learning research, 12, 2825–2830.

open-source tool for tracking and reducing carbon emissions in machine

[11]

Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. 2020. Green

learning models. (2023). https://github.com/mlco2/codecarbon.

ai. arXiv preprint arXiv:1907.10597. 12 pages. doi: 10.48550/arXiv.1907.10597.

[3]

Eco2AI Development Team. 2023. Eco2ai: real-time co2 emission tracking

[12]

Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy

for machine learning. A tool for real-time tracking of CO2 emissions during

and policy considerations for deep learning in nlp. In Proceedings of the

machine learning processes. (2023). https://github.com/sb- ai- lab/Eco2AI.

57th Annual Meeting of the Association for Computational Linguistics (ACL).

Association for Computational Linguistics, 3645–3650.

72





Enhancing Ontology Engineering with LLMs: From Search

to Active Learning Extensions

Ganna Kholmska

Klemen Kenda

Joze Rozanec

Jožef Stefan Institute

Jožef Stefan Institute

Jožef Stefan Institute

Ljubljana, Slovenia

Ljubljana, Slovenia

Ljubljana, Slovenia

anna.kholmska@gmail.com

klemen.kenda@ijs.si

joze.rozanec@ijs.si



Abstract

Recent studies show that leveraging Large Language Models

(LLMs) can streamline ontology construction by reducing

This paper explores the use of LLMs in ontology engineering

manual effort and improving consistency and quality. For

within the HumAIne project, focusing on the discovery, analysis,

instance, [1] demonstrates semi-automatic knowledge graph

and extension of ontologies in Data Mining, Machine Learning,

construction using open-source LLMs, while [2] proposes

and manufacturing. The methodology leverages fine-tuned

methods for automatic concept hierarchy generation through

prompts and combines LLMs with traditional tools like Protege

LLM queries. Building on this research, this paper contributes a

for validation. A multi-LLM approach improved domain-

methodology that integrates LLMs with traditional tools like

specific concept coverage and reduced errors, though challenges

Protege to streamline the discovery, analysis, and extension of

remain in addressing deep domain-specific gaps and ensuring

ontologies. By employing a multi-LLM approach, we address

logical consistency.

challenges in domain-specific concept identification and ensure

more consistent, accurate results in ontology development for

Keywords

fields like Data Mining, Machine Learning, and manufacturing.

LLMs, Ontology Engineering, Active Learning, Data Mining,

Machine Learning, Ontology Selection, Ontology Extension

2 LLM-Assisted Search and Analysis of Domain

Ontologies

1 Introduction

Our experimentation with methodologies and tools for

The HumAIne project, funded by the European Commission

efficient web search and ontology analysis in Data Mining (DM),

under the Horizon Europe program, aims to develop a platform

Machine Learning (ML), and manufacturing domains led to the

integrating advanced AI paradigms such as Active Learning

development of the LLM-leveraging algorithm shown in Fig. 1.

(AL), Neuro-Symbolic AI, Swarm Learning, and Explainable AI.

This algorithm uses carefully crafted prompts to guide LLMs in

This platform is designed to enhance human-AI collaboration in

generating accurate, targeted queries. Before each step, the initial

dynamic, unstructured environments, with applications spanning

prompt is optimized through several iterations in a dialogue with

healthcare, manufacturing, finance, energy grids, and smart cities.

the LLM to improve accuracy and relevance. Further details on

Its primary goal is to support decision-making by combining

the iterative query refinement process are provided in the

human expertise with AI capabilities.

Discussion section.

One of the project's key challenges is developing multiple

Step 1: Define the Search Objective. At this stage, LLMs like

ontologies that provide a structured framework for integrating

Bing Chat, Google’s Bard, or ChatGPT with Web Browsing are

domain-specific knowledge. This framework is essential for

employed to iteratively refine the search objectives initially

enhancing the clarity and reliability of AI-driven decisions, while

formulated by the researcher, along with relevant keywords,

ensuring adaptability across diverse applications. To construct

phrases, and terms describing the ontologies or concepts of

these ontologies, we first explored publicly available ontologies

interest. For instance, our initial search objective for DM and ML

relevant to the project's scope, then extended selected ones with

ontologies was to "Find ontologies that offer up-to-date, detailed

concepts from HumAIne’s AI paradigms, starting with Active

descriptions of the DM and ML domains, following best

Learning

practices in ontology engineering." Keywords included "Active

However, manual ontology construction is a complex,

Learning" and "CRISP-DM standard."

resource-intensive process that requires expertise across multiple

Step 2: Formulate Search Queries Using LLMs. Based on the

domains, collaboration among stakeholders. Ensuring

refined search objectives and keywords, and using a carefully

modularity, reusability, and scalability adds to this complexity.

crafted prompt, LLMs generate targeted search queries. These

queries are fine-tuned through feedback or early search results to

maximize relevance and accuracy. For example, for a DM



Permission to make digital or hard copies of part or all of this work for personal or ontology, the LLM generated queries such as "Data Mining

classroom use is granted without fee provided that copies are not made or distributed ontology for semi-supervised machine learning," which were

for profit or commercial advantage and that copies bear this notice and the full further refined before finalizing the query.

citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Step 3: Conduct Web Search. This step involves real-time

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

browsing tools like Copilot in Microsoft Edge (GPT-4) and

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2024.sikdd.28

Perplexity AI to execute searches and identify relevant sources.

Our study prioritized high-quality sources like ontology

73



repositories (e.g., BioPortal, OBO Foundry) and academic

OWL and RDF code, were used alongside ontology tools like

databases (Google Scholar, IEEE Xplore, ACM Digital Library).

Protege. This combination ensured that the ontologies addressed

It is important to acknowledge that LLM-driven web

relevant concepts and aligned with frameworks like CRISP-DM.

searches are generally confined to public repositories and a

GPT-4 helped significantly in bridging the gap between textual

limited range of academic databases. As a result, proprietary or

descriptions and formal ontology representations.

lesser-indexed ontologies may require manual exploration to

Step 6: Cross-Reference and Compare Findings. LLMs with

ensure a more thorough search.

contextual understanding were employed to integrate and refine

information from multiple sources. For this task, we used.

Additionally, ChatGPT (GPT-4) categorized 65 manufacturing

ontologies categorized 65 manufacturing ontologies, assessing

them for relevance to process planning, standardization, industry

adoption, interoperability, and support for advanced

manufacturing concepts. Further exploration of the top 8 LLM-

scored ontologies showed strong alignment with expert

evaluations, but domain-specific tasks required carefully crafted

prompts and human oversight for effectiveness.

Step 7: Provide Recommendations for Further Exploration.

LLMs generated recommendations for the most suitable

ontologies or areas for additional research based on the previous

step's results. This includes identifying underexplored concepts

and areas needing further investigation.

Step 8: Validate and Document Findings. The findings were

manually validated for accuracy and relevance, then

systematically documented. ChatGPT (GPT-4) was used to

summarize and structure the documentation.

Step 9: Iterate and Refine Search (if needed). When results

were too broad or irrelevant (e.g., Active Learning

misinterpreted as an educational method), we refined the search

prompt by adding more context.

By using this LLM-based algorithm, we conducted

comprehensive web searches and extracted relevant information

to identify the most suitable ontologies for the HumAIne project.

In the DM and ML domains, we selected the OntoDM suite

(OntoDM-Core, Onto-KDD, and OntoDT). For the

manufacturing domain, we identified the Industrial Ontologies

Foundry Core (IOF Core) as the best fit.



Figure 1: Key steps of LLM-leveraging algorithm

3 LLM-Assisted Ontology Extension with

Active Learning Concepts

Step 4: Retrieve and Summarize Information. LLMs (Google

Bard, Copilot (GPT-4), Perplexity AI) were employed to extract

Integrating Active Learning (AL) into an ontology requires

and summarize key information from ontology descriptions

extending it with new classes, properties, and relationships

found in publications, technical papers, and repository

representing key AL concepts. While traditional methods of

documentation identified during the search. Using a specifically

building and extending ontologies are well-documented, we

tuned prompt, LLMs extracted 11 characteristics for each of the

leveraged GPT-4 for this task using iteratively refined prompts

34 identified DM and ML ontologies. These characteristics

(see Discussion section). This section outlines how LLMs,

included purpose, availability, ontology metrics, reused

particularly GPT-4, were used to extend the IOF Core ontology

ontologies, software editors, representation language, and

with AL concepts.

evaluation methodologies. This structured data, organized in

Step 1: Define the Problem and Objectives. Through

table format, provided valuable insights into each ontology’s

iteratively refined prompts, LLMs formulated clear objectives,

scope, quality, and reusability. From these results, we selected 6

specifying the domain (e.g., manufacturing) and key concepts

ontologies for further exploration, prioritizing comprehensive

(e.g., Active Learning). These outputs were used to guide further

coverage of DM and ML concepts, adherence to ontology

steps, with LLMs leveraging contextual understanding,

engineering best practices, and alignment with established

knowledge synthesis, and language generation to suggest

standards in these domains.

relevant AL applications such as adaptive scheduling. Queries



Step 5: Analyze and evaluate ontologies. LLMs were further

like "How can Active Learning improve adaptive scheduling in

utilized to assess the relevance, content, and structure of the

manufacturing?" generated valuable insights into potential use

selected ontologies. In our study of DM and ML ontologies,

cases. where AL would be most beneficial.

LLMs such as GPT-4, which can process, explain, and generate

Step 2: Analyze the Ontology to be Extended. By combining

Protege’s visualization and navigation tools with GPT-4’s ability

74



to process textual and machine-readable data (e.g., OWL/RDF),

efficiency and accuracy of reviewing, debugging, and validating

we thoroughly examined the IOF Core ontology structure and

OWL code.

identified areas for introducing AL concepts. For example, GPT-



4 helped uncover key classes like “Process,” “Resource,” and

“PerformanceMetric” within IOF Core, highlighting relevant

properties for AL integration. Queries such as "What aspects of

IOF Core can benefit from AL integration?" and "What key

concepts are missing from the IOF Core ontology for integrating

Active Learning in manufacturing?" guided us in identifying

areas for improvement, including handling uncertainty and

adjusting dynamic processes.

Step 3: Identify Active Learning Concepts. The main tasks of

this step and the role of LLMs in supporting each task are

summarized in the Table 1:



Table 1: LLMs applications for Identifying AL Concepts



Task

LLM Application Example Output

Figure 2: Screenshot of LLM-generated code defining the

1. Identify

Use LLMs to

Concepts like“Uncertainty

“LearningAlgorithm” class with properties “trainingData”

fundamental generate a list of

sampling,” “Query-by-

and “validationData”

AL concepts core AL strategies committee”

and techniques

Step 5: Ensure Semantic Consistency. LLMs, such as GPT-



2. Extract

Query LLMs about Concept like "Query

4, assisted in ensuring semantic consistency by reviewing new

domain-

AL in specific

Efficiency" in decision-

and existing ontology elements and suggesting how new

specific AL

industrial contexts

concepts could align with the existing framework. For example,

making for manufacturing

concepts

an LLM suggested how an AL “QueryStrategy” class fits within



the IOF Core ontology.





3. Mine AL

Process academic Concepts like “Stream-based

Example Prompt: " Review the new QueryStrategy class and

concepts from papers, reports to selective sampling” from

suggest how it can align with the existing classes in IOF Core."

literature

LLM Output: The QueryStrategy class aligns with decision-



extract relevant AL papers on AL in

making aspects of the Process concept. Strategies such as



terms

manufacturing

“UncertaintySampling,”

“QueryByCommittee,”





4. Assign

Generate properties QueryStrategy class

“ExpectedModelChange,” and “ExpectedErrorReduction” can

properties to for AL ontology

properties:

be viewed as specialized decision-making processes within the

new classes

broader process framework of IOF Core.



classes

“hasuncertaintySampling”



“queryByCommittee”

However, LLMs cannot guarantee logical consistency and face



5. Refine and Ensure definitions, Refined and validated terms

limitations in handling complex relationships, making it necessary

validate

resolve overlaps

to use ontology reasoners, such as Protege or HermiT, to perform



based on domain-specific

terminology

consistency checks.



standards



Step 6: Map to Existing Ontologies. LLMs, such as GPT-4,



By prompting, LLMs generated nearly 200 fundamental AL

assist in generating initial mapping suggestions by analyzing

concepts, structuring them into a hierarchy by leveraging their

similarities in definitions, relationships, and properties between

vast training data. Additionally, LLMs helped generate

new and existing concepts. This involves creating explicit

definitions, assisting in verifying and refining concepts.

relationships like “owl:sameAs,” “owl:equivalentClass”, and

However, after a point, LLMs began repeating concepts or

“owl:equivalentProperty”.

producing less relevant terms. LLMs were also effective in

Example LLM Output:

generating domain-specific concepts through targeted queries.

:FeedbackMechanism a owl:Class ;

For instance, querying AL in manufacturing led to concepts like

owl:equivalentClass :ControlSystem ;

"uncertainty management" and "query efficiency." More

rdfs:label "Feedback Mechanism" ;

specialized concepts required extraction from academic papers,

rdfs:comment "Mechanisms that provide feedback in

which were cross-referenced with existing standards in DM, ML,

Active Learning to control systems."

and manufacturing (e.g., CRISP-DM, IEEE 7000 Series, ISA-95,

While LLMs are effective in identifying high-level

ISO 15531). Ontology learning tools like Text2Onto and

similarities, they may face challenges with complex or domain-

OntoLearn were combined with LLMs for cross-verification.

specific relationships, requiring further refinement. Although we



Step 4: Develop Ontology Extensions. LLMs helped create

didn’t encounter these issues during our initial work extending

AL-related classes, properties, and relationships based on the

IOF Core with AL concepts, we used Protege’s alignment plug-

identified concepts, using OWL-compliant syntax (see Fig. 2).

ins to refine LLM-generated mappings. For more complex

By combining GPT-4’s knowledge synthesis with Protege’s

mappings, tools like AgreementMaker or COMA can further

structural reasoning and consistency checking, we improved the

refine the suggestions.

75

Step 7: Prototype and Test. LLMs, such as GPT-4, were prompt (clear, concise, and easily understood by you), b)

prompted to generate validation scenarios, competency questions,

Suggestions (on what details to include in the prompt to improve

and SPARQL queries based on the integrated AL concepts. For

it), and c) Questions (ask any relevant questions to improve the

instance, a prompt like "Suggest validation scenarios for adaptive

prompt). We will continue this iterative process with me

scheduling with Active Learning" helped us produce realistic test

providing additional information to you and you updating the

cases, including prototype code, descriptions of initial setup,

prompt until it's complete.”

process flows, validation steps, and queries based on newly

After 4-5 cycles, the prompts were highly optimized,

integrated concepts.

ensuring relevant outputs. This refinement process reduced

SPARQL queries generated by LLMs were executed using

inconsistencies and improved LLM-generated content across

Protege with SPARQL plugins to assess the ontology’s ability to

both search and extension phases.

retrieve relevant information and answer competency questions.

We integrated multiple LLMs, including Bing Chat (GPT-4),

However, some LLM-generated scenarios revealed

Google’s Bard, and Perplexity AI, to cross-validate outputs,

limitations in domain-specific knowledge, resulting in generic

reducing errors and refining results. This ensured consistency in

outputs that required refinement. Additionally, LLMs struggled

LLM-generated ontologies and mappings.

with modeling intricate relationships or complex data retrieval

To evaluate this multi-LLM approach, we propose the

conditions, making human oversight essential for ensuring

following

metrics:

Inter-Model

Consistency

(measures

accuracy and thorough testing.

alignment between LLM outputs). Error Rate Reduction (Tracks

Step 8: Iterative Refinement. Following initial prototyping

how often one LLM corrects another’s errors),.Coverage of

and testing, we gathered feedback from domain experts and users

Relevant Concepts (assesses LLMs' ability to capture domain-

to further refine the ontology. Validation reports were uploaded

specific concepts). Although these metrics provide a framework,

to AskPDF Research Assistant (GPT-4), where LLMs reviewed

formal measurements are yet to be implemented.

the reports, extracted key improvement suggestions, and refined

Future stages will involve applying these metrics to validate

task lists. The LLM provided insights into areas where ontology

outputs and testing extended ontologies in real-world

relationships or properties required adjustments and identified

applications. This hybrid method combines LLMs and traditional

additional concepts that might have been overlooked.

tools, ensuring both efficiency and accuracy in scalable ontology

Step 9: Document and Disseminate. LLMs like ChatGPT or

development.

Bard were instrumental in generating comprehensive

documentation, including details on the ontology extensions.

Additionally, LLMs contributed to drafting technical reports and

5 Conclusions

research papers.

This study demonstrates how LLMs can streamline ontology

Using this methodology, we successfully extended the IOF

engineering by automating the search, analysis, and extension of

Core ontology with Active Learning (AL) concepts. Future

domain-specific ontologies. Leveraging multiple LLMs, we

stages of the HumAIne project will focus on further validation

successfully identified and extended key ontologies, including

and refinement, particularly during pilot case implementations.

OntoDM and IOF Core, for the HumAIne project, improving

efficiency in generating classes, properties, and relationships.

While LLMs significantly enhance the process, they face

4 Discussion

challenges in domain-specific precision and require human

This study highlights LLMs' potential in ontology

oversight, particularly for complex relationships. Traditional

engineering by reducing manual effort and increasing efficiency.

tools like Protege and ontology reasoners remain critical for

LLMs rapidly identified key ontologies like OntoDM and IOF

ensuring logical consistency and validation.

Core and generated structured classes, properties, and

Future work will focus on refining these extended ontologies

relationships, reducing the need for manual OWL/RDF code

through real-world pilot tests and applying evaluation metrics to

generation and concept mapping. However, LLMs face

LLM-generated outputs. This hybrid approach, combining LLM

challenges in domain-specific precision, requiring human

automation with traditional validation tools, offers a scalable

oversight to refine outputs and address nuances in specialized

solution that balances efficiency with the need for human

fields. While tools like Protege excel at ensuring logical

expertise.

consistency, LLMs offer dynamic capabilities for generating new

concepts and relationships. Despite these advantages, traditional

Acknowledgments

tools like AgreementMaker and COMA are still necessary to

This work was supported by the European Commission under the

refine and validate LLM-generated mappings.

Horizon Europe project HumAIne, Grant Agreement No.

One strategy to mitigate LLM limitations was iterative

101120218.

prompt engineering. We refined prompts for ontology search and

extension tasks through multiple cycles of improvement. These

References

cycles, with LLMs like GPT-4, involved clarifying questions,

[1]

refining queries, and generating more focused outputs. Initial



Kommineni, Vamsi Krishna, Birgitta König-Ries and Sheeba Samuel.

“From human experts to machines: An LLM supported approach to

prompt for starting the cycle can be the following:

ontology and knowledge graph construction.” ArXiv abs/2403.08345

“Your role is my Prompt Creator. Your goal is to craft the

(2024): n. pag. DOI: https://doi.org/10.48550/arXiv.2403.08345

[2]

Funk, Maurice, Simon Hosemann, Jean Christoph Jung and Carsten Lutz.

best possible prompt for my needs. The prompt will be used by

“Towards Ontology Construction with Language Models.” ArXiv

you, [LLM's name]. I want to write about: [keyword/topic].

abs/2309.09898 (2023): DOI: https://doi.org/10.48550/arXiv.2309.09898

Based on my input, you will now generate 3 sections. a) Revised

76





On the Brazilian Observatory for Artificial Intelligence

Rafael Meira Silva,

Luiz Costa, Alexandre Barbosa

Joao Paulo Candia Vieira

Joao Pita Costa*

Cristina Godoy Oliveira

CETIC, OBIA

CIAAM, C4AI, Univ. of São Paulo

IRCAI, Quintelligence

CIAAM, C4AI, Univ. of São Paulo

São Paulo, Brazil

São Paulo, Brazil

Ljubljana, Slovenia

São Paulo, Brazil

tuca@nic.br

candia@usp.br

Joao.pitacosta@quintelligence.com

rafael@meirasilva.com.br

alexandre@nic.br

cristinagodoy@usp.br



ABSTRACT

Artificial Artificial Intelligence (AI) is rapidly transforming

industries and economies worldwide, with Brazil and South

America emerging as significant players in this global shift. The

fundamental need to monitor the impact of artificial

intelligence (AI) in the verticals for sustainable development,

government engagement, investment and society at large

motivated the Brazilian Artificial Intelligence Observatory

(OBIA). It is also an integral part of the Brazilian Artificial

Intelligence Plan (PBIA), and a former objective of the Brazilian



Strategy of AI aims to become the leading platform for

monitoring the uses of AI in the country. OBIA is part of Axis 5

Figure 1: Screenshot of OBIA showing some results on the

preparedness of Brazilian industry to adopt AI workflows.

of the PBIA focused on supporting the regulatory and

governance process of AI. This research paper explores the

The objectives of OBIA include compiling, recording, and

current state, challenges, and potential of AI development in

providing information related to Artificial Intelligence in Brazil,

the region, examining how technological advancements are

enabling analyses of its adoption and its main impacts on

influencing economic growth, societal change, and policy-

society. It also has the mission of consolidating and

making across South America, with a particular focus on Brazil

disseminating knowledge about the repercussions of this

as a leading hub of innovation. It is also investigating common

technology, providing support to guide policies, strategies, and

aspects of the research agendas as with IRCAI’s SDG

actions in promoting development and responsible use of AI.

Observatory, particularly in what regards machine learning

The observatory gathers Brazilian data on the use and adoption

workflows and approaches complementing traditional and

of Artificial Intelligence by different sectors, such as education,

crowdsourced heterogeneous data collection and analysis.

business, government, health, and others (see Figure 2).

The currently available indicators rely mostly on traditional

KEYWORDS

data sources for analysis, such as surveys and data sets made

Artificial Intelligence, Observatory, Survey Data Analysis, Complex Data

available for the team. The first product of OBIA is the book

Visualization, Multidisciplinary Collaboration.

“Artificial Intelligence in Healthcare - Potentialities, Risks and

1 Introduction

Perspectives”, published in July 2024. In a second line of action,

it functions as a repository of guiding documents in the area,

AI is increasingly shaping the economic landscape and societal

originating from all parts of the world. In a third line, it acts as

dynamics across Brazil and South America, positioning the

an "information exchange point" between AI centers operating

region as a growing hub for technological innovation. Despite

in Brazil: the IAX. All indicators collected will be public and can

challenges such as uneven infrastructure and regulatory

be accessed on the OBIA portal [4].

hurdles, Brazil is making significant strides in AI research and

The Center for Artificial Intelligence (C4AI) at the University of

development, contributing to the regulation and better

São Paulo, funded by FAPESP (the public agency for research

understanding of the impact of AI in South America. OBIA [5] is

funding in the State of São Paulo) and IBM, participates in the

answering this need, serving as a platform to support the

OBIA through its Humanities area. C4AI will contribute with

strategy and other government actions with data on the uses

qualitative research in the horizontal axes of "Legislation,

and impacts of AI (see Figure 1).

Regulation, and Ethical Use" and "AI Governance," while also

conducting studies across various vertical axes to be



Permission to make digital or hard copies of part or all of this work for personal monitored. The research group dedicated to this effort

or classroom use is granted without fee provided that copies are not made or

comprises scholars from the fields of law, computer science,

distributed for profit or commercial advantage and that copies bear this notice

and the full citation on the first page. Copyrights for third-party components of electrical engineering, sociology, and political science, allowing

this work must be honored. For all other uses, contact the owner/author(s).

for an interdisciplinary analysis of the key topics monitored by

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2024.sikdd.18

77



OBIA. This interdisciplinary approach will provide a

Python and R programming languages, based on the TJSP API,

comprehensive view of the current state of AI development and

by Jesus Filho (github.com/jjesusfilho/tjsp). For the Executive

implementation in Brazil. Various reports, articles, and data

Power, a script was developed to scrape data from the Data

will be provided to support OBIA in fulfilling its mission.

Download section of the Brazilian Transparency Portal

In addition to the participation of professionals from various

(portaldatransparencia.gov.br/download-de-dados).

NIC.br departments, the Observatory has a network of external

Currently, we are developing an automation tool, based on NLP

partners, including the Center for Management and Strategic

techniques, to enhance the qualitative analysis of these court

Studies (CGEE), the São Paulo State System Data Analysis

rulings, allowing for more efficient identification and

Foundation (SEADE), C4AI, CIAAM (Center of Artificial

categorization of data relevant to AI research. The first

Intelligence and Machine Learning) and others. The following

approach for this automation tool is using a NER (Named Entity

will explore how C4AI contributes to OBIA through a

Recognition) model, to automate the identification of relevant

complementary approach, focusing on the qualitative analysis

entities, including litigants and court judgments. The next step

of decisions by the São Paulo Court of Justice related to AI.

would be to apply a classification model, yet to be chosen, to

filter out noise data. The process of constructing the terms for

2. Data and Methodology

web scraping is a critical step to ensure the relevance and

accuracy of the data collected for AI research. This process

2.1. Legislation, Regulation, and Ethical Use: A Qualitative

begins with the development of a comprehensive list of AI-

Analysis

related terms, which is built using multiple authoritative

The research presented in this paper is the base of an action

sources. One primary source is the OECD's report "Identifying

contributing to implement the PBIA strategy [7], responsible

and Measuring Developments in Artificial Intelligence," which

for monitoring AI regulation and legislation. It has divided its

offers a foundation of 226 AI-related terms identified through

research into three main areas: the Executive, the Judiciary, and

extensive analysis of scientific articles, open-source systems,

the Legislative branch, combining traditional and modern data

and patents. Another source is the ISO/IEC 22989:2022

collection methods. Regarding the Executive branch,

standard [3], which provides a framework for AI concepts and

monitoring is being conducted through data scraping of

terminologies. These terms are carefully selected, refined, and

government transparency websites based on a curated and

translated into Portuguese by experts working within the

continuously updated list of AI-related terms developed by the

Brazilian Technical Standards Association (ABNT) to ensure

group. This monitoring aims to understand what AI systems are

that only those terms that are highly relevant and specific to AI

being purchased or contracted by public authorities. For the

are included. Terms that are too general or contextually

Judiciary, we have been analyzing court decisions from the São

irrelevant—such as "transparency," which could result in

Paulo Court of Appeal (TJSP) related to AI, to understand

unrelated hits concerning Brazil's Access to Information Law—

judicial interpretations and rulings in the absence of specific AI

are excluded to avoid false positives in the scraping process.

legislation [2]. As of the latest data scraping in August 2024,

The final list of terms, consisting of 103 terms in both English

more than 13.000 relevant decisions have been identified.

and Portuguese, is used to guide the web scraping data

Lastly, in relation to the Legislative Branch, the group is closely

collection processes, allowing a focused and efficient retrieval

following the progress of discussions on Bill 2338/2023, which

of information that aligns with specific research objectives.

focuses on AI regulation, by participating in public hearings and



issuing technical notes to guide legislators. The goal is to

expand this research to monitor AI-related legislation at the

state and municipal levels, as many municipalities are

legislating on the matter to prepare their cities to assume roles

of “smart cities”.



2.2. Monitoring and exploring the local data

Figure 2: Current Dimensions of OBIA’s monitoring topics

To effectively monitor developments in AI, it is essential to



establish a comprehensive list of AI-related terms that can

2.3. How to implement and classify repositories with

guide data collection efforts. This list is derived from multiple

reference documents and statistics?

sources, including scientific articles, standards like [3], and

As part of the data collection and structuring process for

reports such as OECD's [1]. The monitoring process involves

qualitative analysis, we are implementing and classifying

monthly web scraping of court rulings, based on the AI-related

repositories containing reference documents and statistics.

terms list, from TJSP (Judiciary Power) and data from the

These repositories will focus on key thematic areas, such as

Brazilian Transparency Portal (Executive Power), which occurs

"Legislation, Regulation, and Ethical Use" and "AI Governance,"

on the 15th of each month. For the Judiciary Power, the scrapes

and will be populated with data from sources like TJSP, the

and data treatment are performed with scripts developed in

78





Transparency Portal, and other relevant databases. By

the latter focusing on case-specific factors to draw broader

combining different methods, data retrieval becomes more

generalizations. From TJSP’s website, 597 rulings were

efficient and targeted, ensuring the collection of relevant

reviewed: "Facial Recognition" (1), "Facial Expression

information. Web scraping supplements this process by

Recognition" (1), "Machine Learning" (7), "Artificial

capturing data unavailable through APIs, ensuring

Intelligence" (163), "Artificial Intelligence" in English (4), comprehensive coverage. The data is regularly updated, with

"Machine Learning" in Portuguese (3), "Learning Agent" (1), documents classified by relevance to AI terms, creating a

and "Facial Biometrics" (417).

dynamic and organized repository (see Fig 3) described in [6].



Figure 4: Nr. of Decisions per Month from Jan 2018 to Jun 2024.



Figure 3: OBIA’s guiding principles and expected results [6]

2.4. How to establish and maintain cooperation networks?

Establishing and maintaining cooperation networks requires

fostering collaboration among interdisciplinary researchers

from fields such as law, computer science, engineering,

sociology, and political science. These networks are essential

for sharing insights and methodologies related to AI

monitoring. Using APIs and web scraping tools enables access

to current data, supporting continuous knowledge exchange.

Regular workshops, webinars, and joint research projects help

keep participants engaged. Publishing reports, articles, and

datasets strengthens the network and supports OBIAs mission

to monitor AI developments comprehensively.



3 Discussion of initial results

Figure 5: Number of results per AI term.

As of June 28, 2024, a total of 13,064 decisions were scraped

from the São Paulo State Court of Justice based on AI-related

terms. Out of 103 terms searched, 45 returned at least one

result. Graph 1 shows the monthly distribution of all results,

while Figure 5 (logarithmic scale) displays the distribution of

results by AI term. Both Portuguese and English terms were

used for scraping. The top 15 terms with the most occurrences

were analyzed over time, and Figure 6 presents the temporal

evolution of these results by publication date. A qualitative

review of 597 decisions from the São Paulo Court of Justice

(TJSP) using a detailed list of AI-related terms, focused on



terms like "Facial Recognition" and "Facial Biometrics,"

Figure 6: Evolution of results by year for top 15 terms.

showing they are often used in various legal contexts,

The rulings followed a structured format, and the analysis

sometimes diverging from their technological meanings.

included 14 categories, such as case number, appeal type,

Terms like "Facial Expression Recognition" and "Learning

judge, district, and the context of term usage. Key findings

Agent" were often interpreted in psychological or social

highlighted the use of "Artificial Intelligence" and "Machine contexts rather than purely technological ones. The analysis

Learning" in commercial disputes and credit issues rather than

used analytical, comparative, and monographic methods, with

solely technological matters. The rulings analyzed represent

79





decisions, rendered by collegiate bodies composed of multiple

study when, e.g., capturing the attention of media on the terms

magistrates. Each ruling follows a structured format:

“criminal law” and “AI” in “Brazil” in the past 12 months, where

Description and Qualification, covering aspects such as appeal,

1.4% exhibits discussions on Human Rights, and terms like

case number, judicial district, presiding judge, and parties

“democracy” and “discrimination” are within the top 30. When

involved; Summary of the ruling; Report, offering a brief

performing sentiment analysis over these results we can see

description of the facts; Majority Opinion; and Dissenting

large variations after the summer of 2022 with a

Opinion (if applicable). The analysis was conducted with each

predominantly negative sentiment regarding this search topic.

of the 14 subcategories corresponding to columns in a single



row: case number; type of appeal; reporting judge; district;

judicial body; subject matter; judgment date; publication date;

summary; parties; reasoning; final decision; context of term

usage in the full text; and relevant jurisprudence. While the first

nine categories were predefined based on the complete

jurisprudence search, the remaining five were more subjective,

created to enhance the understanding of the rulings' content

and improve data visualization. Significant findings were noted

in cases involving "Artificial Intelligence" and "Machine

Learning," where the terms were often associated with

commercial disputes, service contracts, or credit-related issues



rather than purely technological applications. A recurrent

theme in cases involving "Facial Biometrics" was the legality

and validity of loan contracts signed through biometric

recognition. The majority of decisions upheld the legality of

such contracts, highlighting issues of consent and the technical

reliability of biometric systems [1]. However, inconsistencies in

judicial reasoning were identified, where similar cases had



varying outcomes depending on the presiding judge. Overall,

Figure 7: Significance of criminal law and AI in the news.

the analysis highlighted several gaps and challenges in the legal

treatment of AI-related technologies, particularly concerning

ACKNOWLEDGMENTS

transparency, fairness, and consumer protection. The study

We would like to express our sincere gratitude to the Center for

underlined the need for more consistent legal standards and

Artificial Intelligence (C4AI) at the University of São Paulo

better understanding among judges of the technical nuances

(USP), supported by FAPESP and IBM, for their invaluable

involved in AI applications to ensure fair and equitable rulings.

support to the AI Observatory team. We thank to the CIAAM for

their continued collaboration and contributions to this

4 Conclusions and further work

research. We thank the support of the European Commission

The qualitative research findings from the analysis of court

project ELIAS - Lighthouse of AI for Sustainability (10080425).

decisions related to AI reveal several key conclusions. AI-

related terms such as "Facial Recognition," "Voice Recognition,"

REFERENCES

and "Autonomous Systems" are frequently used in judicial

[1] Baruffaldi, Stefano, et al. (2020) Identifying and measuring developments

in artificial intelligence: Making the impossible possible. OECD.

contexts that extend beyond their traditional technological

[2] Cristina Godoy B. de Oliveira, Otávio de Paula Albuquerque, Emily Liene

meanings, intersecting with areas like consumer protection,

Belotti, Isabella Ferreira Lopes, Rodrigo Brandão de A. Silva, Glauco Arbix.

Intelligent Systems: 12th Brazilian Conference, BRACIS 2023, Belo

contract law, and fraud. The inconsistency in judicial reasoning

Horizonte, Brazil, September 25–29, 2023, Proceedings, Part I, pp 18 – 32.

and varying outcomes in similar cases highlight the need for

[3] ISO (2022) Information technology — Artificial intelligence — Artificial

clearer legal frameworks and a deeper understanding of AI's

intelligence concepts and terminology. ISO/IEC 22989:2022. [Online].

Available: https://www.iso.org/standard/74296.html/ [27 8 2024]

technological implications among judges. Moving forward, the

[4] Luiz Costa et al. (2024) The Brazilian Artificial Intelligence Observatory

incorporation of NLP techniques into the analysis will help

(OBIA). [Online]. Available: https://www.obia.nic.br/ [27 8 2024]

[5] MCTI (2021). Brazilian Strategy of Artificial Intelligence. [Online]. Available: extract key arguments from judicial decisions, providing

ebia-documento_referencia_4-979_2021.pdf (www.gov.br) [07 9 2024]

deeper insights into the legal discourse on AI. This will enhance

[6] MCTI (2023). OBIA: Observatório Brasileiro de Inteligência Artificial,.

the robustness of future research on AI regulation and its

Available: https://www.gov.br/mcti/pt-br/acompanhe-o-

mcti/transformacaodigital/arquivosinteligenciaartificial/1_ebia-reuniao-

implications for public policy.

ro_7_24_05_2023_anexo_2_eixo2-pdf.pdf [27 8 2024]

Furthermore, a preliminary analysis of news using the NLP

[7] PBIA (2024). Brazilian Artificial Intelligence Plan . [Online]. Available:

https://www.gov.br/mcti/pt-br/acompanhe-o-

capabilities of the Eventregistry.org system (see Figure 7) show

mcti/noticias/2024/07/plano-brasileiro-de-ia-tera-supercomputador-e-

how this source can provide complementary results to the

investimento-de-r-23-bilhoes-em-quatro-

anos/ia_para_o_bem_de_todos.pdf/view [07 09 2024]

80





Pojavljanje incidentov ob uporabi Umetne Inteligence

Marko Grobelnik

Besher M. Massri

Alenka Guček

Dunja Mladenić

Department for Artificial

Department for Artificial

Department for Artificial

Department for Artificial

Intelligence,

Intelligence,

Intelligence,



Intelligence,

Jozef Stefan Institute

Jozef Stefan Institute

Jozef Stefan Institute

Jozef Stefan Institute

Ljubljana Slovenia

Ljubljana Slovenia

Ljubljana Slovenia

Ljubljana Slovenia

marko.grobelnik@ijs.si

m.besher.massri@gmail.com

alenka.gucek@ijs.si

dunja.mladenic@ijs.si

Povzetek

takšne incidente preprečujejo ali vsaj zmanjšujejo. Predstavljeni

sistem deluje kot orodje, ki pomaga uporabniku, ki si prizadeva v

Prispevek predstavi prve rezultate ob uporabi sistema, ki je bil

realnem času slediti dejanskim incidentom, povezanim z umetno

zasnovan in razvit v sodelovanju z OECD za spremljanje

incidentov, povezanih z umetno inteligenco. Glavna motivacija teh

inteligenco, ter zagotavljati dokazno osnovo za oblikovanje okvira

prizadevanj je podpora zakonodaji, povezani z umetno inteligenco,

poročanja o incidentih in povezanih političnih razpravah o UI. Z

in učinkovitemu oblikovanju politik, saj sistem zagotavlja vpoglede

zbiranjem podrobnih vpogledov v vsak incident omogoča učenje iz

na podlagi zbranih podatkov. OECD AI Incidents Monitor za

preteklih napak ter spodbuja varnejši in bolj odgovoren razvoj ter

spremljanje incidentov, povezanih z umetno inteligenco,

uporabo umetne inteligence. Koristi skupnosti, ki se ukvarja z

dokumentira incidente in nevarnosti v zvezi z umetno inteligenco,

umetno inteligenco, saj izpostavlja trende in področja, ki

da bi oblikovalcem politik, strokovnjakom za umetno inteligenco

potrebujejo pozornost ali regulativni poseg.

in vsem zainteresiranim stranem po vsem svetu pomagal pridobiti

dragocen vpogled v tveganja in škodo, ki jo povzročajo sistemi

Prednost sistema je, da je zbiranje podatkov avtomatizirano,

umetne inteligence. Ideja je, da bo sistem sčasoma pomagal

kar je prednost v primerjavi s podobnimi repozitoriji, ki so

povečati ozaveščen

urednikovani ročno, kot je na primer AIAAIC Repository [2].

ost javnosti in vzpostaviti skupno razumevanje

incidentov in nevarnosti umetne inteligence, in tako prispeval k

Repozitorij je prosto dostopen in namenjen tako oblikovalcem

zaupanja vredni umetni inteligenci.

politik, kot razvijalcem UI, raziskovalcem, pravnikom in javnim

organizacijam.

Ključne besede

V nadaljevanju predstavimo metodologijo za spremljanje

umetna inteligenca, analiza podatkov, oblikovanje politik,incidenti

incidentov, prikažemo delovanje sistema na nekaj realnih primerih,

predstavimo deležnike in nekaj zaključkov.

Abstract



This paper presents a system designed and developed in

collaboration with OECD for monitoring of AI-related incidents.

2 Metodologija

The main motivation behind the efforts is in supporting AI-related

legislation and effective policymaking, as the system provides

Metodologija OECD za spremljanje AI incidentov se osredotoča na

evidence based on the collected data. The OECD AI Incidents

identifikacijo in klasifikacijo incidentov, s čimer zagotavlja

Monitor documents AI incidents and hazards to help policymakers,

vpogled v realno dogajanje in podpira razvoj okvira za poročanje o

AI practitioners, and all stakeholders worldwide gain valuable

incidentih. Začetna točka je identifikacija in klasifikacija

insights into the risks and harms of AI systems. The idea is that over

incidentov, ki so poročani v uglednih mednarodnih medijih, s

time the system will help to raise awareness and establish a

pomočjo modelov strojnega učenja, kar omogoča gradnjo

collective understanding of AI incidents and hazards contributing

zanesljive baze podatkov (incidenti so zajeti od 2014 naprej).

to trustworthy AI.

Kljub prizadevanjem, ti incidenti predstavljajo le

podmnožico vseh globalnih AI incidentov. Incidenti so razvrščeni

Keywords

glede na resnost, industrijo, povezane AI principe (OECD AI

Artificial Intelligence, data analysis, policy making, AI incidents

Principles [3]), vrste škode in prizadete deležnike. Analiza temelji

na naslovih, povzetkih in prvih odstavkih novinarskih člankov, pri

čemer se pridobljeni podatki uporabljajo za izgradnjo zanesljive,

1 Uvod

objektivne in kakovostne baze podatkov o incidentih, povezanih z

AI. Kot vir novic služi sistem Event Registry

Ob vse širši uporabi umetne inteligence (UI) prihaja tudi do

[4].

incidentov ob njeni uporabi. Spremljanje teh incidentov je nujno za

Razvoj sistema, h kateremu smo prispevali, nadgrajuje delo

zagotavljanje preglednosti, nadzora in razvoj politik, ki lahko

mednarodne skupine strokovnjakov (OECD Expert group), ki

razvija teoretično ogrodje za poročanje o incidentih, definira pojem

∗Article Title Footnote needs to be captured as Title Note

†

AI incidenta in oblikuje povezano terminologijo, kot je AI

Author Footnote to be captured as Author Note

nevarnosti in njene potencialne posledice. Podrobna metodologija



Permission to make digital or hard copies of part or all of this work for personal or in definicije so razložene na spletni strani

OECD:

classroom use is granted without fee provided that copies are not made or distributed

https://oecd.ai/en/incidents-methodology.

for profit or commercial advantage and that copies bear this notice and the full

citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2024.sikdd.26

81





Slika 1 Prikaz začetne strani OECD monitorja AI incidentov (https://oecd.ai/en/incidents) prikazuje vmesnik za iskanje po konceptih, vizualizacijo incidentov v času (spodaj levo; y os: število incidentov; x os: čas (2014-danes) in statistični povzetek (spodaj



incidentov glede na izbrano področje (12 883 incidenti in

nevarnosti o katerih je poročalo 70612 novičarskih člankov)

3 AI Incidents Monitor

,

statistiko za zadnji mesec in mesece z največjimi vrednostmi

AI Incidents Monitor je do konca avgusta 2024 zaznal preko 12 000

(februar 2024). Iz statistik o spremembi glede na mesec, na

incidentov in nevarnosti v zvezi z UI, Kot je razvidno iz Slike 1.

četrtletje in na leto, vidimo padec števila incidentov in nevarnosti o

Sistem je popolnoma avtomatski in zaznava incidente s

katerih so mediji poročali v zadnjem mesecu glede na prejšnji

skeniranjem številnih podatkov objavljenih v novicah, ter nato s

mesec oz. prejšnje četrtletje.

pomočjo UI določa kaj se označi kot incident ali nevarnost. Na

naslovni strani (Slika 1) je prikazan črtni diagram naraščanja

3.1 Primer analize pojavitve incidentov UI

incidentov v času (levo) in pripadajoča statistika (desno).

Sistem omogoča napredno filtriranje po incidentih UI za sledeče

Uporabnik lahko izbira med absolutnim prikazom incidentov (kot

kategorije: čas, država, industrija, princip UI, resnost, tip škode,

na Sliki 1) ali v ustreznem meniju izbere pod-področja. Če se

oškodovanci, tip iskanja po vsebini (glej Sliko 1). Tako so na

poglobimo v prikaz na Sliki 1, vidimo z različnimi barvami

označeni kumulativni incidenti (vijolična) oz. njihovo trimesečno

primer možne vrednosti za resnost: smrt, poškodba, nevarnost, ne-

povprečje (modra). St

fizična nevarnost, možni tipi škode pa so: fizična, psihološka,

atistika na desni prikazuje absolutno število

ekonomska, ugled, javni interes, človekove pravice, neznana.

82

Sistem omogoča napredno iskanje po konceptih, recimo za

njihovih pravnih posledic lahko usmerja razvoj robustnih okvirov

primer generativne UI, sistem poroča statistike, ki kažejo 2302

upravljanja AI.

incidenta in nevarnosti, en od primerov incidentov, ki jih je sistem

Nazadnje lahko javne organizacije in zagovorniške skupine

zaznal pa se nanaša na Apple in razvoj »AI personality«, ki naj bi

uporabljajo AIM za spremljanje družbenih vplivov umetne

nadomestil obstoječi Applov Siri.

inteligence, s čimer zagotavljajo, da so interesi javnosti zaščiteni.

Poleg konceptov uporabnik lahko nadalje izbere tudi

To lahko vključuje analizo vzorcev incidentov z umetno

napredno iskanje za natančnejšo identifikacijo želene podskupine

inteligenco za zagovarjanje boljše zaščite potrošnikov in etičnih

incidentov, ki ga zanimajo. Tako lahko recimo izbere državo, ki je

standardov pri uvajanju AI.

povezana s poročanjem o incidentih in nevarnostih UI. Na Sliki 2

je tako prikazan primer iskanja po kategoriji države, za Slovenijo.

Sistem najde dva incidenta, ki sta bila povezana s Slovenijo. Prvi

5 Diskusija

incident se nanaša na Microsoftov povečan prispevek k emisiji

V prispevku smo predstavili OECD-jev monitor incidentov umetne

CO2. Na prvi pogled ni očitna povezava s Slovenijo, ko pa

inteligence, pri razvoju katerega smo sodelovali. Sistem služi kot

pogledamo povezane novice naletimo na omembo Slovenije:

dober vir za širok nabor uporabnikov, ki želijo razumeti in

»…But the tech giant’s electricity consumption last year rivaled

upravljati tveganja, povezana s tehnologijami UI. Sistem se

that of a small European country—beating Slovenia easily.« [6].

nadgrajuje z dodatnimi podatkovnimi viri.

Vsak primer je tudi semantično označen. Tako je na Sliki 2 za prvi

V prihodnosti je predvideno, da bo omogočen odprt postopek

primer označena povezanost s principi UI učinkovitost, trajnostni

oddaje podatkov, ki bo dopolnil informacije o incidentih,

razvoj. Microsoft s tem lahko prizadene več deležnikov: splošno

pridobljene iz trenutnih virov. Nadaljnje delo zajema tudi

javnost, podjetja, delavce, vlade (Affected Stakeholders, Slika 2).

avtomatsko analizo podatkov o incidentih za namen bolj celovitega

Poleg tega predstavlja nevarnost za okolje, javne interese in

vpogleda. To vključuje avtomatsko odkrivanja vzorcev, kot so

človekove pravice (Harm type, Slika 2). Klasificirano je kot ne-

verižne reakcije ali učinki na več industrij hkrati. Za potrebe

fizična nevarnost (Severity, Slika 2).

preverjanja resničnosti poročanih incidentov, bi lahko vključili

Iz podrobnih analiz, ki so zbrane v nedavnem poročilu

kombiniranje informacij iz več neodvisnih virov in uporabljal

»Observatory of the social and ethical impact of artificial

algoritme za odkrivanje lažnih novic, kot tudi ročno preverjanje.

intelligence« [5], je razvidno, da večina incidentov (96%) spada

pod kategorijo ne-fizično nevarnih, a imajo lahko zelo resne

psihološke in finančne posledice, vključujoč nadlegovanja,

Zahvala

odvisnosti in škodo ugledu tako posameznikom kot tudi

Delo, opisano v tem prispevku, so podprli OECD in številni

inštitucijam.

mednarodni eksperti, Ministrstvo za digitalno preobrazbo in Javna

agencija za raziskovalno dejavnost Republike Slovenije v okviru

CRP V2-2272 in V5-2264.

4 Deležniki

Acknolwedgements

OECD-jev monitor incidentov AI (AIM) je dragoceno orodje,

zasnovano za različne deležnike, ki sodelujejo pri razvoju,

The described work was supported by OECD and many os its

regulaciji in uporabi umetne inteligence. Potencialni uporabniki

international experts, Slovenian Ministry of Digital Transformation

tega orodja vključujejo oblikovalce politik, razvijalce AI,

and Slovenian Research and Innovation Agency under CRP V2-

raziskovalce, pravne strokovnjake in javne organizacije.

2272 and V5-2264.

Oblikovalci politik lahko AIM uporabljajo za sledenje in

analizo podatkov v realnem času o incidentih, povezanih z AI, po

vsem svetu, kar jim pomaga pri oblikovanju informiranih in na

Literatura

dokazih temelječih predpisov. Zmožnost orodja za kategorizacijo

[1] OECD AI Incidents Monitor (AIM), https://oecd.ai/en/incidents. August incidentov glede na resnost, industrijo in vrste škode je ključna za

2024

razumevanje širših posledic tehnologij umetne inteligence in

[2] AIAAIC Repositoryhttps://www.aiaaic.org/aiaaic-repository. August 2024

[3] OECD AI Principles for trustworthy AI https://oecd.ai/en/ai-principles

oblikovanje politik, ki zmanjšujejo tveganja.

August 2024

Razvijalci AI in raziskovalci lahko koristijo AIM, da

[4] Gregor Leban, Blaz Fortuna, Janez Brank, and Marko Grobelnik.

prepoznajo pogoste težave, povezane s sistemi umetne inteligence.

2014. Event registry: learning about world events from news. In

S preučevanjem incidentov, zabeleženih v AIM, lahko izboljšajo

Proceedings of the 23rd International Conference on World Wide Web,

svoje modele, da bi se izognili podobnim težavam in povečali

107–110.

[5] Richard Benjamins, Another Inconvenient Truth: The Societal

varnost ter zanesljivost aplikacij umetne inteligence.

Emergency of AI Incidents - We Should Do Something About It

Pravni strokovnjaki lahko uporabljajo AIM za pridobitev

https://www.odiseia.org/post/another-inconvenient-truth-the-societal-emergency-of-

vpogledov v spreminjajočo se pokrajino tveganj, povezanih z

ai-incidents-we-should-do-something-about-it

[6] Microsoft’s AI Push Imperils Climate Goal as Carbon Emissions Jump

umetno inteligenco, kar bi lahko bilo koristno v pravnih primerih

30% https://tanaka-preciousmetals.com/en/elements/news-cred-20240821/

ali ocenah skladnosti. Razumevanje preteklih incidentov in





83





Slika 2 Prikaz naprednega iskanja na OECD monitorju AI incidentov (https://oecd.ai/en/incidents) filtrirano po državi za Slovenijo.

Podane so statistike dveh incidentov o katerih je poročalo 25 novičarskih člankov, in spodaj sta prikazana oba incidenta.



84





Perception of AI in Slovenia

Abdul Sittar

Alenka Guček

Dunja Mladenić

abdul.sittar@ijs.si

alenka.gucek@ijs.si

dunja.mladenic@ijs.si

Jožef Stefan Institute

Jožef Stefan Institute

Jožef Stefan Institute and Jožef

Jamova cesta 39

Jamova cesta 39

Stefan Postgraduate School

Ljubljana, Slovenia

Ljubljana, Slovenia

Jamova cesta 39

Ljubljana, Slovenia

Abstract

Following are the main scientific contributions of this paper:

This paper introduces the AI News Monitor system developed

(1) We present a methodology to understand public perception

for real-time monitoring and analysis of artificial intelligence

about AI in news.

(AI) perception in global and local news media. Leveraging data

(2) We analyse some trends in AI’s Perception.

from the Event Registry platform, the AI News Monitor tracks

The remainder of the paper is structured as follows. Section 2

AI-related news articles across multiple dimensions, providing

describes the methodology to collect historical data, AI news

insights through three key views: a global historical overview,

categories and gaining insights in public perception about AI in

current global trends, and local trends specific to Slovenian media.

news. Section 3 presents the analysis of trends in AI’s Perception.

The system facilitates both passive observation of AI discourse

We present different user scenarios and possible applications

and active exploration of specific AI-related events. Our illustra-

of AI News Monitoring in Section 4 and discussion in Section

tive analysis reveals significant global trends, including height-

5. Section 6 concludes the paper and outlines possible areas of

ened media focus on deep learning, generative AI, and robotics,

future work.

and examines the implications of these trends on public trust in

Plotly Graphs

AI. Additionally, the paper discusses the practical applications

BERT Topic Modeling

Global Overview

of the AI News Monitor for stakeholders such as policymakers,

Global Trends

requests

Front End

Back End

Database

journalists, business leaders, and researchers. We conclude with a

Local Trends

discussion on the impact of media coverage on public perception

requests

of AI and propose possible future enhancements of the system,

including broader language and source coverage.

Users

Keywords

Figure 1: Architecture for Real-Time AI News Monitor-

datasets, artificial intelligence, media monitoring, perception

ing and Visualization based on Event Registry and imple-

mented using Flask and Plotly.

1

Introduction

Artificial Intelligence (AI) is increasingly becoming an integral

part of society, influencing various aspects of daily life and in-

2

Methodology

dustries [4]. As AI continues to evolve, so does its portrayal in the media, which plays a critical role in shaping public percep-The proposed approach to creating a web service to analyze pub-

tion and trust. Understanding how AI is perceived globally and

lic perception involves two key steps: 1) identifying AI-related

locally is essential for policymakers, businesses, and researchers

categories and gathering news within these categories, and 2)

to ensure that AI technologies are developed and deployed in

developing a web service that displays trends across these cat-

ways that are socially acceptable and trustworthy [3, 4].

egories, news publishers, and highlights current trends among

In response to this need, we have developed the AI News Mon-

both global and local (Slovenian) news sources (see Figure 1).

itor system designed for real-time monitoring and exploratory

Firstly, we selected AI-related categories based on the Slovenian

analysis of AI-related news coverage. The AI News Monitor of-

AI observatory1 and Wikipedia2. The key categories associated fers a comprehensive view of how AI is discussed in the media,

with Artificial Intelligence include ‘Generative AI’, ‘Artificial In-

capturing data from the Event Registry platform on a monthly

telligence’, ‘NLP’, ‘Chat-GPT’, ‘Deep Learning’, ‘Robotics’, ‘Com-

basis [7].

puter Vision’, ‘Neural Networks’, ‘Graph Neural Networks’, ‘Self-

The AI News Monitor system is structured around three main

supervised Learning’, and ‘Zero-shot Learning’.

views: a global overview that presents historical data from the

Next, we collected news articles from the last year related to

past year, global trends that highlight recent AI-related events,

these categories. These articles were classified into the appropri-

and local trends focusing on mentions of AI by Slovenian news

ate categories based on Wikipedia concepts, and we also obtained

sources. These views allow the users to either passively monitor

sentiment data from Event Registry. The portrayal of AI-related

ongoing developments in AI or actively explore specific events

news significantly impacts public perception, with the emphasis

and trends that may influence public opinion.

on risks, benefits, or ethical concerns shaping public opinion and

driving narratives that can either build trust or instill fear[8],[12],

Permission to make digital or hard copies of all or part of this work for personal

[1].

or classroom use is granted without fee provided that copies are not made or

distributed for profit or commercial advantage and that copies bear this notice and To understand global trends, we retrieved news events published

the full citation on the first page. Copyrights for third-party components of this globally in the last month. For local trends, we focused on news

work must be honored. For all other uses, contact the owner/author(s).

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

1http://siai.ijs.si/dashboards/Main/SlovenianObservatoryIntro?globalCountry=SV

© 2024 Copyright held by the owner/author(s).

N

https://doi.org/10.70314/is.2024.sikdd.14

2http://country-dashboards.ijs.si/dashboards/Main/Index?

85





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Trovato et al.

Figure 2: Time series of the number of news articles by specific areas (in colors, at the top). Detailed view upon precise exploration (middle) and corresponding sentiment of news from specific areas (at the bottom).

articles published by the top 50 Slovenian news publishers.

Finally, we employed topic models to analyze the corpus of news

articles and extract underlying themes [9], [2].

3

Analysis of trends of AI’s Perception

3.1

Global Overview

The global overview provides a historical review of global AI-

related news (see Figure 2). Users can explore the number of news articles across 13 AI fields (Generative AI, Chat-GPT, Deep Learning, Robotics, Computer Vision, Neural Networks, Graph Neural

Networks, Artificial Intelligence, Federated Learning, Few-shot

Learning, Meta Learning, Self-supervised Learning, and Zero-

shot Learning) or by news providers and have an overview of

the sentiment of the news.

Global trends allow for the review and exploration of global AI-

Figure 3: A detailed view of Global Trends, showing the

related trends based on captured events from the last month.

option to select news events based on chosen AI fields.

Figure 3 shows a detailed view of the Global Trends, where a

written report of the number of news articles and events, a his-

togram of the number of AI-related news articles over time, and

3.3.1

Global Overview. In the historical overview of AI trends in

the ability to explore the last 10 events in a selected field.

March 2024 (Figure 2), there was a significant increase in the number of news articles and interest in deep learning, generative AI,

and robotics. Specifically, on March 18th, there were 1,800 news

articles about generative AI, 970 about robotics, and 274 about

3.2

Local Trends

deep learning. This spike in news highlights several key events:

Local trends allow for the review of news from Slovenian news

one of the standout stories was the launch of Gen-2 by Runway,

providers for the last month. The local trends show the detailed

a generative video model capable of creating high-quality short

view, where a written report of the number of news articles and

clips. An important topic was the use of AI in political campaigns,

events, a histogram of the number of AI-related news articles

particularly the creation of deepfakes and misinformation. This

over time, and the ability to explore (see Figure 4).

raised concerns about AI’s impact on elections and voter trust. In

the field of robotics, researchers were inspired by advancements

in generative AI to develop more versatile robots. These new

3.3

EXAMPLES OF TRENDS

robots can perform various tasks using a single, comprehensive

86





Perception of AI in Slovenia

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

model, demonstrating significant progress in robotic capabili-

News Monitor can support their specific goals. Other potential

ties. Overall, the sentiment in March 2024 was positive (as seen

stakeholders are business executives, NGOs, researchers and ed-

from the sentiment analysis), reflecting enthusiasm and optimism

ucators.

regarding this technological progress. The increased media atten-

Policy Makers: Scenario: A policymaker uses AI News Monitor

tion highlights the rapid development and growing importance

to track trends in robotics.

of AI in various fields.

Background: Jure, a decision-maker at a government agency for

technology and innovation, is tasked with drafting new guide-

lines for the development and implementation of robotics in

Slovenia. To understand the broader context and local trends, he

needs to explore the global perception of robotics and compare

it with local perspectives.

Steps: Step 1: Searching for a Global Overview: Jure logs into

AI News Monitor and searches for "robotics" under the global

overview section. The system displays a line chart showing how

robotics has been mentioned over time, along with a sentiment

graph for the past year. He finds that robotics is globally discussed

with mostly positive sentiment, particularly in Asia and North

America. Step 2: Global Trends: Jure selects "robotics" among the

topics and reviews recent events on this subject. He chooses an

event focusing on robotics in the EU and examines the sentiment

of the publications and the main themes. In his browser, he looks

at the specific articles and discovers that discussions predomi-

Figure 4: A detailed view of Local Trends, showing the

nantly revolve around automation and industrial robotics. Step

option to select news events based on chosen AI fields.

3: Local Trends in Slovenia: Next, Jure is interested in a review

for Slovenia to understand how robotics is perceived at the local

level. The dashboard for the selected topic displays an analysis

3.3.2

Global Trends. In our examination of global trends, we

selected the news story "AI and heat waves pose dual threats to

of recent articles from Slovenian media. By using the browser, he

the power grid" and found that two specific newspapers published

discovers that discussions mainly focus on the impact of robotics

more articles on this topic compared to others. The sentiment of

on employment and the potential use of robots in healthcare.

these articles, as shown in the middle graph (Figure 4), fluctuates Jure finds that local concerns are more focused on social and

between positive and neutral. Upon delving into the content of

economic impacts. He includes these insights in his preparatory

these publications, we found that Forbes focused on the issue of

documents for the new guidelines. Step 4: Compiling the Report

fake news generated by AI, while Lexology explored future AI

and Recommendations: Finally, Jure exports key data, including

applications in various fields.

sentiment graphs and media summaries, from AI News Monitor.

He compiles a report that summarizes global trends and local

3.3.3

Local Trends. In the last month (at the time of writing the

concerns and proposes balanced guidelines that promote innova-

report, this was June 2024), there was an increase in AI-related

tion in robotics while addressing social impacts.

news from Slovenian news providers, particularly from Delo.si

Journalists: Scenario: A policymaker uses AI News Monitor to

and Sta.si (Figure 5). When analyzing the sentiment of these arti-track trends in robotics.

cles, most were neutral, with a few expressing positive opinions

Background: Ana, a journalist at a technology magazine, is tasked

about AI. Delo.si focused on the growing adoption of AI by com-

with writing an article on the growing trend of using genera-

panies in Slovenia, highlighting discussions on the potential of

tive AI to create videos. She needs to explore both global trends

quantum computing and recent advancements in AI technology.

and local perspectives in Slovenia to provide a comprehensive

This coverage indicates a balanced view of AI’s impact and po-

overview.

tential. Sta.si reported on the construction of a state-of-the-art

Steps: Step 1: Searching for a Global Overview: Ana searches for

data center in Maribor, which will also house a supercomputer.

"generative AI" under the global overview section. The system

This event represents a major development in Slovenia’s techno-

displays a line chart showing that this topic is on the rise, identi-

logical infrastructure. Additionally, Sta.si wrote about AI trends

fies the media outlets reporting on generative AI, and provides

that benefit semiconductor manufacturers, reflecting a positive

a sentiment graph for the past year. Step 2: Global Trends: Ana

outlook on the economic impact.

selects "generative AI" and reviews recent events on this topic.

She focuses on deepfake video generation, checking who has

written about it and what the main themes are. She then looks

4

User Scenarios and Applications

up these articles in her browser. Step 3: Local Trends in Slovenia:

The AI News Monitor can cater to a range of stakeholders with

Ana shifts her focus to Slovenia to understand local views. The

varying use case objectives [10], [6], [5]. Policy makers can utilize dashboard reveals that Slovenian media coverage is largely posi-the developed system to track global and local trends in AI-related

tive, particularly for certain providers. However, Ana realizes the

topics, enabling them to craft data-driven policies that balance

need to include concerns about authenticity and misinformation

innovation with societal concerns. Journalists can leverage the

to provide a balanced perspective. Step 4: Compiling and Writing:

system to gather comprehensive insights into public sentiment

Ana exports key data, including sentiment graphs and media

and media coverage, enriching their reporting with accurate and

summaries, from AI News Monitor. She drafts her article, start-

timely information [11]. Detailed scenarios for both policy making with global trends and then delving into specific concerns in

ers and journalists are explained below, illustrating how the AI

Slovenia, enriched with visual data.

87





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Trovato et al.

Figure 5: Time series of the number of news articles by news provider in Slovenia (at the top). Sentiment analysis (in the middle) and frequency of topics for this period (at the bottom).

5

Discussion

of Digital Transformation and Slovenian Research and Innovation

Services like AI News Monitor can play a role in fostering greater

Agency under CRP V2-2272.

transparency around AI by offering detailed insights into how AI

is being discussed across various media platforms. By tracking

References

public sentiment and highlighting both positive and negative

[1]

Iyad AlAgha. 2021. Topic modeling and sentiment analysis of twitter dis-

cussions on covid-19 from spatial and temporal perspectives.

trends, it helps to ensure that the development and deployment

Journal of

Information Science Theory and Practice, 9, 1, 35–53.

of AI technologies are aligned with public concerns and expecta-

[2]

David Alvarez-Melis and Martin Saveski. 2016. Topic modeling in twitter:

tions.

aggregating tweets by conversations. In Tenth international AAAI conference

on web and social media.

While AI News Monitor offers valuable insights, it has limitations,

[3]

Stephen Cave, Kate Coughlan, and Kanta Dihal. 2019. " scary robots" exam-such as its reliance on media reporting, which may not capture

ining public responses to ai. In Proceedings of the 2019 AAAI/ACM Conference

the full spectrum of public opinion. Additionally, potential biases

on AI, Ethics, and Society, 331–337.

[4]

Ethan Fast and Eric Horvitz. 2017. Long-term trends in the public perception

in media sources or the algorithms used for sentiment analysis

of artificial intelligence. In Proceedings of the AAAI conference on artificial

could skew the results, presenting challenges in ensuring a fully

intelligence number 1. Vol. 31.

[5]

Fabian Gilson, Matthias Galster, and François Georis. 2020. Generating use

accurate and balanced representation of public perception.

case scenarios from user stories. In Proceedings of the international conference

on software and system processes, 31–40.

[6]

Debasish Kundu and Debasis Samanta. 2007. A novel approach of prioritizing

use case scenarios. In 14th Asia-Pacific Software Engineering Conference

6

Conclusions

(APSEC’07). IEEE, 542–549.

AI News Monitor was developed to understand and track pub-

[7]

Gregor Leban, Blaz Fortuna, Janez Brank, and Marko Grobelnik. 2014. Event

registry: learning about world events from news. In Proceedings of the 23rd

lic sentiment around AI, offering policymakers, journalists, and

International Conference on World Wide Web, 107–110.

other stakeholders the insights needed to make informed de-

[8]

Kalle Lyytinen, Heikki Topi, and Jing Tang. 2021. Information systems cur-

cisions. AI perceptions can be monitored globally and locally,

riculum analysis for the macude project. Communications of the Association

for Information Systems, 49, 1, 38.

for the context of Slovenia. However, there are opportunities

[9]

Rishabh Mehrotra, Scott Sanner, Wray Buntine, and Lexing Xie. 2013. Im-

for future work to enhance its capabilities. Expanding the its

proving lda topic models for microblogs via tweet pooling and automatic

coverage to include more languages and diverse sources would

labeling. In Proceedings of the 36th international ACM SIGIR conference on

Research and development in information retrieval, 889–892.

provide a more global perspective, while refining sentiment anal-

[10]

Frank Moisiadis. 2000. Prioritising use cases and scenarios. In Proceedings

ysis techniques could improve accuracy and reduce potential

37th International Conference on Technology of Object-Oriented Languages

and Systems. TOOLS-Pacific 2000. IEEE, 108–119.

biases.

[11]

Abdul Sittar, Daniela Major, Caio Mello, Dunja Mladenić, and Marko Grobel-

nik. 2022. Political and economic patterns in covid-19 news: from lockdown

to vaccination. IEEE Access, 10, 40036–40050.

[12]

Abdul Sittar, Dunja Mladenić, and Marko Grobelnik. 2022. Analysis of

7

Acknowledgments

information cascading and propagation barriers across distinctive news

This work was supported by the European Union through AI4Gov

events. Journal of Intelligent Information Systems, 58, 1, 119–152.

(101094905) and TWON (101095095) EU HE projects and Ministry

88





What will happen tomorrow? Predicting future event types for

businesses

Tesia Šker

Jože M. Rožanec

Jožef Stefan Institute

Jožef Stefan International Postgraduate School

Ljubljana, Slovenia

Ljubljana, Slovenia

tesia.sker@gmail.com

joze.rozanec@ijs.si

Gregor Leban

Dunja Mladenić

Event Registry d.o.o.

Jožef Stefan Institute

Ljubljana, Slovenia

Ljubljana, Slovenia

gregor@eventregistry.org

dunja.mladenic@ijs.si

ABSTRACT

location prediction, semantics prediction, and a combination of

Strategic foresight helps organizations anticipate future chal-

these. Each goal is divided into subgoals for which various tech-

lenges and opportunities, allowing them to handle uncertainty

niques can be applied. According to the classification provided

better. While strategic foresight is becoming more widely adopted

by Zhao, our technique can be classified as a semantic prediction.

across organizations, the process still heavily relies on expert

In this research, we explore how graphs can be used to model

knowledge, and little of it has been automated through artificial

media news events and to forecast event types in the near future.

intelligence. In this research, we explore how media news events

By doing so, we provide a valuable tool for decision-makers,

can be analyzed to forecast event types that will take place in

offering them a clearer view of potential outcomes. Specifically,

the near future. In particular, we consider it a supervised ma-

our research focuses on using a JSON dataset containing a variety

chine learning problem with a well-defined set of event types and

of articles about a particular business company. We create a

leverage graph representation of the media news events to create

graph representation of the articles and use Graph2Vec to create

graph embeddings, train a classifier, and predict event types that

embeddings that can be used downstream to fit other machine-

will likely occur one day ahead. We validated our approach on a

learning models. Using this information, we apply a Random

real-world dataset of an American multinational conglomerate

Forest Classifier to predict the categories of articles about the

operating in industry, worker safety, healthcare, and consumer

company for the following day.

goods.

In particular, we expect this to be useful to give organizations

a competitive advantage in fast-changing markets [5]. While

KEYWORDS

human expertise is valuable, it varies from person to person,

strategic foresight, event prediction, machine learning, graphs

leading to inconsistent predictions. Manually analyzing large

datasets is also time-consuming and prone to errors. AI, however,

can process vast amounts of data, spot patterns, and predict future

1

INTRODUCTION

event types more accurately.

Strategic foresight helps organizations anticipate future chal-

This work is structured as follows. Section 2 presents related

lenges and opportunities, allowing them to handle uncertainty

work that is relevant for this paper. Section 3 describes the data in

better [9]. Therefore, predicting future event types as a part of the dataset, and the data extraction process. Section 4 introduces

strategic foresight became necessary for businesses to manage

a new approach to predict future event types. Section 5 presents

their operations without significant losses. Various events on

the results of this research. Section 6 concludes this work and

a major scale, such as floods, earthquakes, internet failures, or

proposes future improvements.

pandemics, as we are witnessing recently, or on a minor scale,

such as road closures due to sports events or promotions at fairs,

can have a major impact on business operations. By predicting

the next event type, businesses can adjust prices, reschedule

2

RELATED WORK

staff, manage stocks, reschedule transportation routes to avoid

In recent decades there has been an increasing interest in strategic

delays, and more, and thus reduce losses or increase their sales

foresight in the academic field. According to Fergnani (2020)

and profits.

[2] this is because by "using corporate foresight, organisations There is currently a massive number of articles written on Fu-can reconfigure their strategy based on the analysis of business

ture Event Predictions. Based on Zhao [11], the event prediction opportunities suggested by future possibilities". Even in academia

methods can be classified in terms of goals into time prediction,

"one of the domains heavily impacted by Artificial Intelligence is

innovation management and in this context especially the area

Jože M. Rožanec and Tesia Šker are co-first authors with equal contribution and

of Strategic Foresight (SF)" as per Brandtner et. al (2021) [1].

importance.

Corresponding author: Jože M. Rožanec: joze.rozanec@ijs.si.

However it seems that strategic foresight methods related to

AI only end up being used by bigger companies with a larger

Permission to make digital or hard copies of part or all of this work for personal number of resources. As noted by Kim and Seo (2023) [6], "except or classroom use is granted without fee provided that copies are not made or

distributed for profit or commercial advantage and that copies bear this notice and for AI start-ups and players in the consumer electronics and infor-the full citation on the first page. Copyrights for third-party components of this mation and communication industry, small- and medium-sized

work must be honored. For all other uses, contact the owner/author(s).

enterprises (hereafter SMEs) in other industries do not demon-

Information Society 2024, 7–11 October 2023, Ljubljana, Slovenia

© 2023 Copyright held by the owner/author(s).

strate competence in AI." Therefore, effective implementation of

https://doi.org/https://doi.org/10.70314/is.2024.sikdd.24

AI solutions for strategic foresight in smaller and medium sized

89





Information Society 2024, 7–11 October 2023, Ljubljana, Slovenia

Tesia Šker, Jože M. Rožanec, Gregor Leban, and Dunja Mladenić

companies would be one of the topics to be explored in future

research.

In this research however, we focus more on the general imple-

mentation of strategic foresight by means of next event predic-

tion. Exploring similar fields, we found that there was already

some research exploring the field of event predictions, which

Figure 1: Sample of relevant data considered when parsing

rather than focusing on businesses focused on other domains. In

an event type to build the dataset.

the field of sequential event prediction, several researchers are

exploring diverse methods. Although the methods share some

conceptual similarities with our research, they differ significantly

in methodology and focus. Letham, Rudin, and Madigan (2013)

[7] developed a model that predicts the next event using an

ERM-based approach with logistic regression, focusing on the

presence of events rather than their order. On the other hand, our

work uses labeled article databases and considers the sequence

of past events, using techniques like graph construction, random

walks, and random forests. Yeon, Kim, and Jang (2015) [10] focus on predicting event flow through visual analytics, using LDA

for topic extraction and emphasizing specific keywords, while

our approach is entirely text-based and relies on graphs. On the

other hand, Hu et al. (2017) [4] use LSTM networks for predicting future subevents, which offers an alternative method to our

non-LSTM-based text analysis.

Although these studies provide useful insights and have of-

fered significant improvement in sequential event prediction,

they may face certain challenges. For instance, Letham, Rudin,

and Madigan (2013) [7] emphasize event presence over sequence, potentially missing key temporal relationships, while Yeon, Kim,

and Jang (2015) [10] depend heavily on keywords, overlooking

Figure 2: Event Type Taxonomy

broader context. Additionally, LSTM-based models like those

used by Hu et al. (2017) [4] are powerful however they require significant computational power. In contrast, our work addresses

3.2

Data Description

these limitations by employing a graph-based approach that pri-

oritizes event sequences and leverages standardized data from

For our research, we used a dataset of events provided by Event

sources like DMOZ and Wikipedia. This enables us to make more

Registry, with media events encoded in JSON format. Specifically,

accurate and efficient predictions, offering a practical and scalable

we analyzed 4,216 events related to the company 3M, recorded

solution that enhances predictive accuracy.

between June 23, 2021, and July 23, 2024. We used a URI to clas-

sify each event, drawing from DMOZ and Wikipedia categories

(Fig. 1). These were selected because they provide standardized descriptions of the events being reported, which makes the data

consistent and reliable. The events are categorized into 94 distinct

types, which are further grouped into three primary domains:

3

DATASET

business, environment, and society. The business domain makes

3.1

Data Extraction Pipeline

up the largest proportion of events, accounting for 65 types (69%

The event detection pipeline processes about 300.000 English

of the total), while the environment and society domains contain

news articles per day. Each news article is first annotated using

13 types (14%) and 16 types (16%), respectively. Within these

tools like entity linking, topic classification and sentiment detec-

domains, the event types are further divided into smaller subdo-

tion. Each article is then split into sentences where each sentence

mains, which can be aggregated into larger subdomain units as

retains it’s annotations and other meta-data. For each pair of the

demonstrated in the event type taxonomy (Fig. 2).

entities in the sentence, an event classifier then determines if

there is a particular relation of interest expressed in the sentence

4

METHODOLOGY

between the two entities. The predefined taxonomy currently

This study uses graph-based techniques to predict future event

includes 133 event types of interest, ranging from security, en-

types from news articles about a specific company. The process

vironment, natural disasters, accidents, politics, and other areas.

starts by building a graph that maps relationships between event

To classify the events, a neural network transformer architecture

types and concepts from Wikipedia and DMOZ. Random walks

with a pretrained encoder is used. The entire network, including

are then performed on this graph to extract key information

the encoder, is trained on our supervised dataset using best prac-

such as URIs, dates, and event types, which are then transformed

tices like online hard example mining, class balancing, dropout,

into embeddings using Graph2Vec [8]. Next, the event types are and consistency regularization. The sentences for which the clas-encoded and adjusted through a process called target shifting.

sifier finds that it mentions a relation of interest are then stored

This step aligns the features to better forecast future outcomes

in a database, together with the pair of associated entities and

based on previous data. The predictions are made using a Random

other available meta-data.

Forest classifier, which is then validated through stratified k-fold

90





What will happen tomorrow? Predicting future event types for businesses

Information Society 2024, 7–11 October 2023, Ljubljana, Slovenia

4.4

One Hot Encoding & Target Shifting

To transform the categorical event types into binary vectors, One

hot encoding is applied. This allows the model to treat each event

type as a separate class. After extracting relevant column names,

the encoded target data is concatenated with the feature embed-

dings, creating a dataset for model training and evaluation. The

a)

b)

c)

dataset is then aggregated by averaging out the embeddings and

calculating the maximum value of the encoded target columns

Figure 3: Event Type Graphs

for a given day. Finally the ’target’ data is shifted by one day,

which allows the embeddings to forecast the event types for the

cross-validation for higher accuracy. The following sections will

following day.

present each step of this process in more detail (see Fig. 4).

4.5

Random Forest Classification & Stratified

4.1

Graph Construction

K-Fold Cross Validation

For each article in the JSON dataset, a detailed graph G is gen-

To ensure an effective classification and prediction of the data, A

erated using the NetworkX library [3]. The graph construction Random Forest classifier is created. When employing this method,

process starts by extracting key information such as the article’s

embeddings are used as features and the one-hot encoded event

URI(unique identifier), as well as the date associated with the

types are used as labels. The data itself is split into testing and

article and the event types, which are represented by specific

training sets, followed by the incorporation of the Stratified K-

URIs. In addition to these elements, each article also includes two

Fold cross validation. This technique splits the data into 10 folds,

important lists: ’slots’ and ’categories’. The ’slots’ list contains

while ensuring that the event type proportion in each fold re-

wiki and dmoz addresses that are directly related to the event

mains equal. The model is then trained on 9 folds, with the re-

described in the article, while the ’categories’ list includes vari-

maining fold being used for validation. This ensures balanced

ous classifications of the event. To complete the graph, labels are

representation of each class across the folds resulting in a more

created by extracting URIs from the ’slots’ list and filtering the

effective performance.

’categories’ to focus on those with the "dmoz" prefix.

4.2

Random Walks for Feature Extraction

5

RESULTS

Once the graphs for each article are constructed, random walks

As mentioned above, the model was trained on a training set,

are performed, starting at a given node (event type) and moving

and then evaluated on a test set. The training set included ap-

to adjacent nodes based on specific probabilities. Several random

proximately 508 samples for each fold, and the test set included

walks are generated for each node, forming the foundation for

about 10% of the whole set, which amounted to 56 samples per

feature extraction processes. A single random walk begins by

fold. Using this, the model then predicted the probabilities for

initializing the path with the starting node and iterating over a

event types for each set. When training the model for each class,

specified path length. At each step, a random number is compared

we noticed certain classes did not have enough occurrences to

with a probability p. If the number is less than p, the walker stays

have at least one entry of such a class per dataset fold and were

at the current node, otherwise it moves to a random neighbor. If

skipped. We, therefore, trained the model and predicted for a

no neighbors are available, the walk ends.

total of 45 classes.

Generating multiple random walks for every node follows

To evaluate the discriminative performance of the model, the

a similar approach, using p as the probability of staying at the

ROC AUC score was used. The results produced showed us how

current node (set at 0.05). The process involves creating an empty

well the model distinguishes between different classes, as well

list to store all random walks and iterating through each node in

as the model’s ability to predict future event types. The ROC

the graph. For each node, the specified number of random walks

AUC score showed us that the average performance of the model

is generated, and each walk is appended to the list.

was around 0.5674, and the median was close to it, with an AUC

ROC score of 0.5559, with the highest score reaching 0.8194 and

4.3

Embedding Generation Using Graph2Vec

the lowest reaching a value of 0.3338. While the best scores

The random walks from the graphs are processed similarly to

demonstrate we can effectively forecast event types ahead of

word sequences in a document. The ’embedding_data’ function

time, further work is required to enhance results, which in most

generates vector embeddings for graph data using the Doc2Vec

cases remain close to 0.5.

model. It begins by converting each random walk into a Tagged-

Document, storing these in ’documents_gensim’. The Doc2Vec

6

CONCLUSIONS

model, with a vector size of 5 is trained on these documents,

This study was used to develop a graph-based approach to pre-

creating a vector space where similar sequences are positioned

dicting event types in articles. In the process, we utilized ran-

close together.

dom walks for feature extraction and Doc2Vec for embedding

The function then processes each graph in the graphs dictio-

generation. Then, we trained the resulting model on a Random

nary, extracting uri, date, and event type, and generating addi-

Forest classifier and evaluated it with a Stratified K-Fold Cross

tional random walks. These walks are converted into embeddings

Validation. The model demonstrated solid performance with an

using the ’infer_vector’ method, and the resulting vectors are

average ROC AUC score of around 0.5674, reaching a peak at

averaged into one final embedding. This embedding is stored in

approximately 0.8194. This indicates the model’s effectiveness

a dictionary across ’embedding1’ to ’embedding5’, alongside the

in capturing relationships within the data and predicting future

graph’s metadata.

event types.

91





Information Society 2024, 7–11 October 2023, Ljubljana, Slovenia

Tesia Šker, Jože M. Rožanec, Gregor Leban, and Dunja Mladenić

Figure 4: Data Extraction Pipeline

However, while the model performed well overall, occasional

[11] Liang Zhao. 2021. Event Prediction in the Big Data Era. Comput. Surveys 54, 5

fluctuations in accuracy suggest space for further improvement.

(2021), 1–37.

We are currently striving to find ways to make graphs more in-

formative. In future work we could refine the feature extraction

process by incorporating larger datasets, with a wider variety

samples and a larger number of companies.

ACKNOWLEDGMENTS

The Slovenian Research Agency supported this work. This re-

search was developed as part of the Graph-Massivizer project

funded under the Horizon Europe research and innovation pro-

gram of the European Union under grant agreement 101093202.

REFERENCES

[1] Patrick Brandtner and Marius Mates. 2021. Artificial intelligence in strate-

gic foresight–Current practices and future application potentials: current

practices and future application potentials. In Proceedings of the 2021 12th

International Conference on E-business, Management and Economics. 75–81.

[2] Alex Fergnani, Andy Hines, Alessandro Lanteri, and Mark Esposito. 2020.

Corporate foresight in an ever-turbulent era. European business review 25

(2020), 26–33.

[3] Aric Hagberg, Pieter J Swart, and Daniel A Schult. 2008. Exploring network

structure, dynamics, and function using NetworkX. Technical Report. Los

Alamos National Laboratory (LANL), Los Alamos, NM (United States).

[4] Linmei Hu, Juanzi Li, Liqiang Nie, Xiao-Li Li, and Chao Shao. 2017. What

happens next? Future subevent prediction using contextual hierarchical LSTM.

In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.

[5] Jon Iden, Leif B. Methlie, and Gunnar E. Christensen. 2017. The nature of

strategic foresight research: A systematic literature review. Technological

Forecasting and Social Change 116 (2017), 87–97. https://www.sciencedirect.

com/science/article/pii/S0040162516306035

[6] Jong-Seok Kim and Dongsu Seo. 2023. Foresight and strategic decision-making

framework from artificial intelligence technology development to utilization

activities in small-and-medium-sized enterprises. foresight 25, 6 (2023), 769–

787.

[7] Benjamin Letham, Cynthia Rudin, and David Madigan. 2013. Sequential event

prediction. Machine learning 93 (2013), 357–380.

[8] Annamalai Narayanan, Mahinthan Chandramohan, Rajasekar Venkatesan,

Lihui Chen, Yang Liu, and Shantanu Jaiswal. 2017. graph2vec: Learning

distributed representations of graphs. arXiv preprint arXiv:1707.05005 (2017).

[9] Freija van Duijne and Peter Bishop. 2018. Introduction to strategic foresight.

Future 1 (2018), 67.

[10] Hanbyul Yeon, Seokyeon Kim, and Yun Jang. 2015. Visual Analytics using

Topic Composition for Predicting Event Flow. KIISE Transactions on Computing

Practices 21, 12 (2015), 768–773.

92





Generating Non-English Synthetic Medical Data Sets

Lenart Dolinar

Erik Calcina

Erik Novak

University College London

Jožef Stefan International

Jožef Stefan International

London, United Kingdom

Postgraduate School

Postgraduate School

Jožef Stefan Institute

Jožef Stefan Institute

Ljubljana, Slovenia

Ljubljana, Slovenia

Abstract

The experiment setting is presented in Section 4, followed by the experiment results in Section 5. We discuss the results in Using synthetic datasets to train medicine-focused machine learn-Section 6 and conclude the paper in Section 7.

ing models has been shown to enhance their performance, how-

ever, most research focuses on English texts. In this paper, we ex-

plore generating non-English synthetic medical texts. We propose

2

Related Work

a methodology for generating medical synthetic data, showcasing

This section describes the related work, focusing on large lan-

it by generating Greeklish medical texts relating to hypertension.

guage models and methods for generating synthetic data.

We evaluate our approach with seven different language models

and assess the quality of the datasets by training a classifier to

2.1

Large language models

distinguish between original and synthetic examples. We find

that the Llama-3 performs best for our task.

Large Language Models (LLMs) are models that were trained

to generate human-like texts based on an extensive process of

Keywords

training on vast amounts of data. Models, such as Llama 3 [2],

GPT-4 [9], Aya 23 [3] and Mistral [7], are often easy to work Synthetic data, healthcare data, multilingual data, large language

with by providing an input textual prompt, based on which the

models, classification

models respond. The LLMs are helpful in specialized fields, such

1

Introduction

as medicine, since they can be fine-tuned on extensive data sets

containing medical terms and concepts. This enables them to per-

The healthcare domain produces a lot of medical data that can be

form well in tasks such as medical synthetic data generation [12].

used to train machine-learning models to help medical person-

Despite that, they are sometimes unable to follow the instruc-

nel. For example, a machine-learning model designed to perform

tions in the prompt accurately, leading them to hallucinate, i.e.

Named Entity Recognition (NER) on electronic health records

confidently produce wrong responses [5].

(EHRs) needs extensive labeled datasets to accurately identify

In our experiments, we investigate the LLMs’ performance

medical terms like diseases, treatments, and patient details. How-

in generating synthetic medical data given specific constraints

ever, the data contains a lot of personal information, and hospitals

and detailed prompts to simulate the original data set as best as

cannot share it freely due to data protection. In addition, there

possible.

are not enough examples to train the models for some problems,

such as those relating to rare diseases. Because of this, synthetic

2.2

Synthetic medical data generation

data is being used as a substitute to train the models.

Most synthetic data generation approaches focus on generat-

Recently, synthetic medical data, generated using LLMs, has been

ing English texts. These usually utilize large language models

used to enhance the performance of models for solving different

trained on predominantly English documents retrieved from the

natural language processing tasks in medicine.

web. However, there are few examples of using them to gener-

One work focuses on generating a synthetic dataset of elec-

ate non-English texts. Furthermore, the language models have

tronic health records of Alzheimer’s Disease (AD) patients based

difficulties generating texts that do not reflect the distributions

on a label that is provided [8]. They find that the performance of found in the training sample. This includes medical texts, which

their system for detecting AD-related signs and symptoms from

are usually not accessible to the general public.

EHRs improves vastly when trained on synthetic and original

This paper proposes a methodology for generating medical

data sets as opposed to training the system only on the origi-

synthetic data using open-source large language models. We

nal one. Another work investigated using LLMs for extracting

apply the methodology to a medical data set written in Greeklish,

structured information from unstructured healthcare text [13].

a combination of Greek and English scripts. We test it with seven

By generating synthetic data using LLMs and fine-tuning the

large language models and assess performance by training a

model, they significantly improved the models’ performance for

classifier to distinguish original examples from synthetic ones.

medical-named entity extraction and relation extraction tasks.

Using the same prompt, we find that the open-source Llama-3

Most related works focus on English synthetic data due to

model best generates synthetic data that reflects the original data

scarce non-English training data and the dominance of English

set.

in medical terminology [6]. This paper focuses on generating

The remainder of the paper is as follows: Section 2 presents the non-English texts, specifically medical texts written in Greeklish

related work on generating synthetic data using large language

about hypertension.

models. Next, the proposed methodology is described in Section 3.

Permission to make digital or hard copies of part or all of this work for personal 3

Methodology

or classroom use is granted without fee provided that copies are not made or

This section outlines our research methodology. We first present

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this the pre-processing of the data set, followed by describing the syn-work must be honored. For all other uses, contact the owner /author(s).

thetic data generation process. Finally, we present the description

Information Society 2024, 10–14 October 2024, Ljubljana, Slovenia

of synthetic dataset evaluation using a classifier. Figure 1 shows

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2024.sikdd.4

the diagram overviewing the proposed methodology.

93





Information Society 2024, 10–14 October 2024, Ljubljana, Slovenia

Lenart Dolinar, Erik Calcina, and Erik Novak

evaluation

3.3

Technical details

with classifier

In this section, we describe the models and the parameters used

in the experiment. All models used are available via the Hugging-

Face’s transformer library [15].

We tested five open-source models to generate the synthetic

LLM and prompt

data sets, all of which can be run on a 32GB GP U: Llama-3 [2]

choice

only has support for the English language but has been fine-tuned

translate to Latin script

DataDreamer

synthetic patient records

to understand user prompts, which is a feature we expected would

2

help a lot with the synthetic data generation.

Aya-23 [3] is a

multilingual language model and offers support for 23 languages,

3

including Greek.

Mistral [7] supports a variety of languages

train entity

4

but omits Greek . The models Gemma-2 [4] and Phi-3 [1] were extraction LLM

5 6

patient records

also tested and compared in the experiments.

In addition, we

experimented with GPT-4o [9] and GPT-3.5-Turbo, which are

Figure 1: An overview of the methodology. The image was

accessible via the OpenAI API.

designed using resources from flaticon.

All models were given the same prompt containing instruc-

tions that included (1) generating Greek texts written in Latin

script and (2) containing a label randomly selected from the orig-

inal data set, (3) examples are supposed to be at most 6 words

3.1

Data pre-processing

long, (4) should provide concise responses, (5) structured format

The data set used consisted of 1,299 examples of medical history

(all text must be in a single line, must use // and commas as sepa-

in Greeklish, where the Latin and Greek scripts were used inter-

rators, and must be similar in format as the provided few-shot

changeably. It also contained 1,495 labels, most of which were

examples). To stress some more important instructions, some

in English. The labels consisted of drugs, medical events, and

instructions were given in capital letters and were also repeated.

measurements.

To translate the labels into Greek, we used the NLLB-200 [14]

4

Experiment Setting

1

translation model . Since LLMs were predominantly trained on

This section describes the experiment setting, which consists

texts written in Latin script, we decided to transliterate both the

of the evaluation process and the metrics used to measure the

labels and examples from Greek to Latin script. This allowed the

approach’s performance.

LLMs to generate longer tokens with richer information.

We split the original data set into two subsets to ensure no

4.1

Evaluation approach

data leakage. The first one, consisting of 930 examples, was used

The quality of the generated synthetic data was measured in two

for synthetic data generation. The second one, containing the

parts. The first consisted of statistical measurements, such as

remaining 369 examples, was used for evaluation.

calculating the average length of the generated examples and

finding the proportion of examples that included the required

3.2

Synthetic data generation

labels. These statistics were then compared to the original data

We utilized the datadreamer library [10] to generate the synthetic set.

data set. The library enables open-source models to create syn-

The second part consisted of training a classifier to discern if

thetic data sets and was developed to work in research settings,

the input text was from the original or from the synthetic data set.

supporting prompt templates and few-shot learning.

The data set used to train and evaluate the classifier involved 369

We developed a prompt containing the instructions and re-

randomly selected synthetic examples and 369 examples from

strictions on generating the examples. To better showcase the

the original data set, transliterated into Latin script. We chose

structure of the generated text, we also provided five random

5-fold validation as our classification procedure and calculated

examples from the original data set as few-shot examples. Next,

the mean performance across all trials.

using datadreamer, we sent the prompt to the chosen LLM. We

The classifier was trained using the BERT [11] language model,

7

experimented with multiple LLMs, and about 800 examples were

specifically the bert-base-multilingual-cased variant . The

generated for each used LLM. When experimenting with LLMs

classifier was trained using the following parameters: batch size

that required calling an external provider (e.g., OpenAI), we pro-

= 16, epochs = 3, and learning rate = 2e-5. The same parameters

vided five static few-shot examples that did not include any pa-

were used for all synthetic data sets.

tient personal data due to data privacy concerns.

To ensure the quality of generated data, we implemented a

4.2

Metrics

post-processing step. This included formatting the generated

To assess the quality of the generated synthetic data sets, we used

text into one line and excluding examples where the length was

the F1 score as our main metric for evaluating the classifier’s

too long or where the model started repeating words meaning-

performance. The target value was 0.5; if the performance is

lessly. This ensured that all generated examples followed the

greater than 0.5, the classifier can discern the original from the

same format and could be used for evaluation.

synthetic examples. Hence, the synthetic data does not reflect

Table 2 presents generated examples for the label "OSTEO-

the original data set. If the performance is less than 0.5, the

POROSH". Similarities in the examples highlight the need for

rigorous methods to evaluate how closely they resemble the

2 https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

original data set. The methods are explained in Section 4.1.

3 https://huggingface.co/CohereForAI/aya-23-8B

4 https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3

5 https://huggingface.co/google/gemma-2-9b-it

6 https://huggingface.co/microsoft/Phi-3-medium-4k-instruct

1

7

https://huggingface.co/facebook/nllb- 200- distilled- 600M

https://huggingface.co/google- bert/bert- base- multilingual- cased

94





Generating Non-English Synthetic Medical Data Sets

Information Society 2024, 10–14 October 2024, Ljubljana, Slovenia

classifier has difficulties separating the synthetic from the original

6.1

LLM performance

data, which can be because the synthetic data contains copies of

Results in Table 1 show significant quality differences among the original examples. In addition to the F1 score, we measured

synthetic datasets from different LLMs, with label occurrence

the classifier’s accuracy, precision, and recall, which are also

ranging from 0.740 for Mistral to 0.996 for GPT-4o, and average

reported.

example length from 3.691 for GPT-4o to 8.040 for Aya-23.

However, Table 3 indicates no significant performance differ-

5

Results

ences within a single synthetic dataset, with a maximal standard

In this section, we present the results of our experiment. We

deviation of the metrics being 0.021 for the Llama-3 dataset.

first present the statistical results, followed by the classifier’s

We can also notice that the F1 and accuracy scores are very

evaluation.

close for all synthetic data sets. This means the classifier was

likely performing relatively similarly on both classes (synthetic

5.1

Statistical analysis

and original datasets) without significant bias to either class.

Table 1 compares the synthetic data sets and the original one We can observe much better performance on the Llama-3

regarding label occurrence and average example length. The label

data set, which is primarily trained on English data, than on the

occurrence is 1.000 in the original data set, as all examples from

Aya-23 data set, which is also trained on Greek data. This shows

the original data set are assumed to include relevant labels and

that a model does not need to be extensively trained in Greek

information.

texts to generate this type of synthetic medical data well.

The most aligned synthetic data set regarding label occurrence

was generated using GPT-4o, followed by Llama-3. However, in

6.2

Limitations

terms of average example length, the data set generated using

Due to limited computing power, only one GP U with 32GB of

Gemma-2 performed the best, followed by Llama-3.

space was available, restricting the testing of larger LLMs. To ad-

The worst-performing models, in terms of label occurrence,

dress these challenges, using cloud-based resources or distributed

were Mistral and Phi-3, which in about 25% did not include the

computing could help run larger models and improve the variety

selected label. The data set generated using the Aya-23 had the

of synthetic data generated.

largest difference in terms of average example length, on average

Due to privacy concerns, when using GPT-4o and GPT-3.5-Turbo

generating examples with three extra words.

models, which are not locally-run models, we had to use five fixed

Table 1: Statistical comparison between the original and

examples when generating synthetic data instead of a larger vari-

synthetic data sets. The bold and underlined values repre-

ety. This potentially led to larger similarities of the GPT synthetic

sent the best and second-best statistics, respectively.

datasets to the examples instead of the original dataset and, con-

sequently, worse performance.

LLM

Label occurrence

Avg example length

6.3

Potential improvements

original dataset

1.000

4.682

The prompt was the same for all seven LLMs and was primarily

Llama-3

0.990

5.330 (+0.648)

tested on Llama-3. Hence, the performance might be biased to-

Aya-23

0.949

8.040 (+3.358)

wards the model. The method could be improved by tailoring the

Mistral

0.740

6.376 (+1.694)

prompts to each model individually.

Gemma-2

0.988

4.207 (-0.475)

The evaluation of synthetic datasets could be further extended

Phi-3

0.782

6.071 (+1.389)

by checking for repeating examples in the synthetic dataset or

GPT-4o

0.996

3.691 (-0.991)

by checking how different the generated example is from the five

GPT-3.5-Turbo

0.867

6.764 (+2.082)

provided examples. The evaluation could also be improved by

checking for overfitting to the original data set.

Looking at both statistics, we can conclude that Llama-3 had

the best alignment to the original data set in terms of label oc-

7

Conclusion and Future Work

currence and example length, closely followed by GPT-4o.

This paper presents a method for generating Greek synthetic

To better imagine the differences between the generated ex-

medical data sets. To synthetically create datasets similar to the

amples, we handpicked an example from each synthetic data set

original, we carefully craft a prompt and perform pre-processing

related to the label “OSTEOPOROSH”, shown in Table 2.

and post-processing of the data to increase performance and

eliminate the effect of hallucinations.

5.2

The classifier evaluation

Using a classifier and considering the inclusion of labels and

Table 3 shows the F1, Precision, Recall, and Accuracy perfor-

generated text length, we conclude that Llama-3 is best for gen-

mances of the trained classifier on different synthetic data sets.

erating examples that most closely resemble the original dataset.

The best performance was achieved by Mistral with approxi-

In the future, we plan to explore the underlying architectures

mately 0.85 scores in all four metrics, followed by Llama-3, with

of the models to understand their performance differences in

approximately 0.88 scores in all metrics. The worst performances

multilingual contexts. This will allow us to further refine our

were on data sets generated by the Aya-23 and GPT-3.5-Turbo

methods and create more accurate data sets.

models. Surprisingly, the Aya-23 is a language model supporting

Furthermore, we intend to use the synthetic dataset to train a

Greek; thus, it was expected to generate better examples.

named entity recognition (NER) system to recognize medical la-

bels from medical history examples. Measuring the performance

6

Discussion

of the NER trained on synthetic datasets will give us another

This section discusses the synthetic data generation performance,

way of evaluating their quality. We also intend to create a more

outlines our methodology’s limitations and drawbacks, and pro-

general pipeline enabling the code to generate synthetic medical

poses potential improvements to the approach.

data in a wider variety of languages and formats.

95





Information Society 2024, 10–14 October 2024, Ljubljana, Slovenia Lenart Dolinar, Erik Calcina, and Erik Novak

Table 2: Generated examples for label "OSTEOPOROSH".

LLM

Examples

original dataset

APO 2O ETON YPERTASH ME AGOGI// OSTEOPOROSH // YPOTHYROIDISMOS

Llama-3

YPOTHYROEIDISMOS, OSTEOPOROSH, APO//

Aya-23

CA ORTHOU, ANEYRISMA KOILAKHS AORTHOU, OSTEOPOROSH.

Mistral

OSTEOPOROSH, APO 60 ETOS, APO 2 MHNES KAI APO 10 GRAMM

Gemma-2

OSTEOPOROSH, ARTHROSITIS, ETOVIR

Phi-3

OSTEOPOROSH, XAROSTHROMA, ALPHA-BISFIOVITINI, 2018, DIATHRHSH, DIA

gpt-4o

OSTEOPOROSH, ANEMIA

gpt-3.5-Turbo

OSTEOPOROSH, GASTREKTOMH, EMFISIMA, YDRONERFOSI, PSIXROS.

Table 3: Mean performance metrics of the classifier for synthetic data sets, with standard deviation. Performances that are closer to 0.5 are considered better. The bold and underlined values represent the best and second-best performances, respectively.

LLM

F1

Precision

Recall

Accuracy

Llama-3

0.875 ± 0.021

0.881 ± 0.020

0.875 ± 0.020

0.875 ± 0.020

Aya-23

0.945 ± 0.005

0.947 ± 0.004

0.945 ± 0.005

0.945 ± 0.005

Mistral

0.848 ± 0.012

0.856 ± 0.001

0.849 ± 0.011

0.849 ± 0.011

Gemma-2

0.928 ± 0.005

0.930 ± 0.005

0.928 ± 0.005

0.928 ± 0.005

Phi-3

0.927 ± 0.009

0.932 ± 0.008

0.927 ± 0.009

0.927 ± 0.009

GPT-4o

0.906 ± 0.014

0.912 ± 0.012

0.907 ± 0.014

0.907 ± 0.014

GPT-3.5-Turbo

0.940 ± 0.013

0.944 ± 0.011

0.940 ± 0.013

0.940 ± 0.013

Acknowledgments

[7]

Albert Q. Jiang et al. Mistral 7B. 2023. arXiv: 2310.06825

[cs.CL]. url: https://arxiv.org/abs/2310.06825.

This work was supported by the Slovenian Research Agency.

[8]

Rumeng Li, Xun Wang, and Hong Yu. “Two Directions for

Funded by the European Union. UK participants in Horizon Eu-

Clinical Data Generation with Large Language Models:

rope Project PREPARE are supported by UKRI grant number

Data-to-Label and Label-to-Data”. In: Findings of the Asso-

10086219 (Trilateral Research). Views and opinions expressed are

ciation for Computational Linguistics: EMNLP 2023. 2023,

however those of the author(s) only and do not necessarily reflect

pp. 7129–7143. doi: 10.18653/v1/2023.findings- emnlp.474.

those of the European Union or European Health and Digital Ex-

[9]

OpenAI et al. GPT-4 Technical Report. 2024. arXiv: 2303.

ecutive Agency (HADEA) or UKRI. Neither the European Union

08774 [cs.CL]. url: https://arxiv.org/abs/2303.08774.

nor the granting authority nor UKRI can be held responsible for

[10]

Ajay Patel, Colin Raffel, and Chris Callison-Burch. DataDreamer:

them. Grant Agreement 101080288 PREPARE HORIZON-HLTH-

A Tool for Synthetic Data Generation and Reproducible LLM

2022-TOOL-12-01.

Workflows. 2024. arXiv: 2402.10379 [cs.CL]. url: https:

References

//arxiv.org/abs/2402.10379.

[11]

Telmo Pires, Eva Schlinger, and Dan Garrette. “How Mul-

[1]

Marah Abdin et al. Phi-3 Technical Report: A Highly Ca-

tilingual is Multilingual BERT?” In: Proceedings of the 57th

pable Language Model Locally on Your Phone. 2024. arXiv:

Annual Meeting of the Association for Computational Lin-

2404 . 14219 [cs.CL]. url: https : / / arxiv. org / abs / 2404 .

guistics. Association for Computational Linguistics, 2019,

14219.

pp. 4996–5001. doi: 10.18653/v1/P19- 1493.

[2]

AI@Meta. “Llama 3 Model Card”. In: (2024). url: https :

[12]

Karan Singhal et al. “Large language models encode clin-

/ / github. com / meta - llama / llama3 / blob / main / MODEL _

ical knowledge”. In: Nature 620 (2023), pp. 172–180. doi:

CARD.md.

10.1038/s41586- 023- 06291- 2.

[3]

Viraat Aryabumi et al. Aya 23: Open Weight Releases to

[13]

Ruixiang Tang et al. Does Synthetic Data Generation of

Further Multilingual Progress. 2024. arXiv: 2405 . 15032

LLMs Help Clinical Text Mining? 2023. arXiv: 2303.04360

[cs.CL].

[cs.CL]. url: https://arxiv.org/abs/2303.04360.

[4]

Google DeepMind Gemma Team. Gemma 2: Improving

[14]

NLLB Team et al. No Language Left Behind: Scaling Human-

Open Language Models at a Practical Size. 2024. url: https:

Centered Machine Translation. 2022. arXiv: 2207 . 04672

/ / storage . googleapis . com / deepmind - media / gemma /

[cs.CL]. url: https://arxiv.org/abs/2207.04672.

gemma- 2- report.pdf .

[15]

Thomas Wolf et al. “Transformers: State-of-the-art natural

[5]

Xu Guo and Yiqiang Chen. Generative AI for Synthetic

language processing”. In: Proceedings of the 2020 Confer-

Data Generation: Methods, Challenges and the Future. 2024.

ence on Empirical Methods in Natural Language Processing:

arXiv: 2403 . 04190 [cs.LG]. url: https : / / arxiv. org / abs /

System Demonstrations. Association for Computational

2403.04190.

Linguistics, 2020, pp. 38–45. doi: 10.18653/v1/2020.emnlp-

[6]

Rainer Hamel. “The dominance of English in the inter-

demos.6.

national scientific periodical literature and the future of

language use in science”. In: AILA Review 20 (Dec. 2007),

pp. 53–71. doi: 10.1075/aila.20.06ham.

96





LLNewsBias: A Multilingual News Dataset for Lifelong

Learning

Swati Swati

Dunja Mladenić

swati.swati@unibw.de

dunja.mladenic@ijs.si

Jožef Stefan International Postgraduate School

Jožef Stefan Institute and

Ljubljana, Slovenia

Jožef Stefan International Postgraduate School

Ljubljana, Slovenia

Abstract

In this study, we address these challenges by introducing a

The rise of digital media enhances information accessibility but

novel dataset LLNewsBias specifically designed for the detection

also introduces challenges related to the quality and impartiality

and analysis of political bias in multilingual news headlines. Our

of news reporting, particularly regarding biases that influence

dataset spans four major global events from 2019 to 2022: Brexit,

public perception during key global events. In response, this

COVID-19, the 2020 U.S. election, and the Ukraine-Russia war,

study introduces LLNewsBias, a dataset designed to detect and

capturing a wide range of political discourse across 17 languages.

analyze political bias in multilingual news headlines, covering

To collect this dataset, we use Media Bias/Fact Check for the four major events from 2019 to 2022 — Brexit, COVID-19, the

assignment of bias labels, and Event Registry [2] for the extrac-2020 U.S. election, and the Ukraine-Russia war. With over 350,000

tion of relevant headlines and metadata. The resulting dataset is

headlines in 17 languages, annotated with bias labels, this dataset

not only comprehensive in its linguistic diversity but also struc-

is compiled using Media Bias/Fact Check and Event Registry. Our

tured to support both event-wise and year-wise analyses, with

contributions include a structured framework for data collection

an emphasis on lifelong learning.

and organization, enabling event-wise and year-wise analysis

while supporting lifelong learning. We also highlight potential

1.1

Contributions

use cases that demonstrate the dataset’s utility in advancing bias

Our study makes the following contributions:

prediction models, multilingual adaptation, and model robustness.

• Multilingual bias-annotated dataset: We introduce a

Additionally, we discuss the dataset’s limitations, addressing po-

multilingual bias-annotated dataset containing over 350,000

tential biases, sample size constraints, and contextual factors. This

news headlines in 17 languages, each annotated with po-

work provides a valuable resource for improving bias detection

litical bias labels.

in dynamic, multilingual news environments, contributing to the

• Data collection and organization framework: We pre-

development of more accurate and adaptable models in natural

sent a structured framework for data collection and or-

language processing and media studies. For code and additional

ganization, enabling event-wise and year-wise analysis

insights, visit: https://github.com/Swati17293/LLNewsBias

while ensuring adaptability for lifelong learning.

Keywords

• Potential use-cases: We outline several potential applica-

tions of our dataset, highlight its potential for advancing

Dataset, News, Bias, Multilingual, Headline, Low-resource, Media

lifelong learning models, particularly in bias prediction,

Bias, News Bias, Continual Learning, Lifelong Learning

multilingual adaptation, and model robustness.

• Discussion of limitations: We identify and discuss the

1

Introduction

dataset’s limitations, such as biases in data collection, sam-

The rapid growth of digital media has greatly enhanced the ac-

ple size constraints, and contextual influences, offering a

cessibility of information, but it has also introduced significant

transparent assessment of its applicability.

challenges concerning the quality and impartiality of news re-

In summary, our paper introduces a comprehensive dataset

porting. Political bias in news content is particularly concerning,

and a framework for the study of political bias in multilingual

as it has the potential to influence public perception and shape

news headlines. By focusing on key global events and providing

societal narratives, especially around key global events. Under-

support for lifelong learning, our study contributes to the ongoing

standing and predicting such biases, particularly in multilingual

effort to develop more accurate and adaptable models for bias

contexts where biases can manifest differently across cultural and

detection in diverse linguistic and cultural contexts.

linguistic boundaries, is essential for promoting fair and balanced

journalism. Traditional approaches to bias detection often rely

2

Related Work

on monolingual datasets and static models that may not effec-

tively capture the evolving nature of news content [6]. These Several datasets focus on news articles and political bias [5],

limitations underscore the need for more robust datasets and

but there is a notable scarcity of multilingual, bias-annotated

methodologies that can adapt to the dynamic and multilingual

datasets designed for lifelong learning [4]. While resources like landscape of modern news reporting.

the media bias chart by Ad Fontes Media and PolitiFact provide insights into bias, they are often limited to English-language

Permission to make digital or hard copies of all or part of this work for personal sources or specific fact-checked claims, lacking the continuous,

or classroom use is granted without fee provided that copies are not made or

distributed for profit or commercial advantage and that copies bear this notice and event-centric data necessary for broader analysis. GDELT [3],

the full citation on the first page. Copyrights for third-party components of this a large-scale event-oriented news dataset, covers multiple lan-work must be honored. For all other uses, contact the owner/author(s).

guages but focuses on location, network, and temporal attributes

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

rather than political bias or the event-outlet relationship. Exist-

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2024.sikdd.8

ing multilingual datasets are often domain-specific [1], limiting 97





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Swati et al.

their utility for general bias analysis. In contrast, LLNewsBias

exclude outlets labeled as questionable and assign each remain-

dataset fills these gaps by offering a generalized, multilingual,

ing outlet 𝑜 ∈ 𝑂 a bias label 𝑏 ∈ 𝐵, where 𝐵 = {𝑏

}

𝑖

𝑖

1, 𝑏2, ..., 𝑏𝑞

and bias-annotated data designed for event-wise and year-wise

represents the set of bias labels, with 𝑞 representing the number

analyses, particularly suited for lifelong learning models.

of distinct bias labels.

Next, we define a temporal query 𝑄 to extract article headlines

𝑡

3

Dataset Description

(𝐻 = {ℎ1, ℎ2, ..., ℎ }), where 𝑟 represents the total number of

𝑟

headlines retrieved from the Event Registry (ER). The query 𝑄

In this section, we introduce our dataset LLNewsBias and describe

𝑡

is formulated as:

the framework used for its collection and organization. We begin

by detailing the primary data sources that form the foundation of

𝑄

= {𝑄 , 𝑄 , 𝑄

, 𝑄

}

(1)

𝑡

𝑒

𝑜

𝑐𝑎𝑡

𝑑 𝑡

this dataset. Following this, we present a comprehensive overview

where 𝑄 , 𝑄 , 𝑄

specify the event, media outlet, and news

𝑒

𝑜

𝑐𝑎𝑡

of the data collection process, with a focus on the methodologies

categories (limited to those classified as ’news’ by ER 𝑄

=

𝑐𝑎𝑡

employed to ensure robustness and reliability. Finally, we provide

{‘𝑝𝑜𝑙𝑖𝑡𝑖𝑐𝑠’, ‘𝑏𝑢𝑠𝑖𝑛𝑒𝑠𝑠’, ‘𝑠𝑝𝑜𝑟𝑡𝑠’, ‘𝑎𝑟𝑡𝑠 𝑎𝑛𝑑 𝑒𝑛𝑡𝑒𝑟𝑡𝑎𝑖𝑛𝑚𝑒𝑛𝑡 ’, an in-depth overview of the dataset’s structure, including its

‘𝑠𝑐𝑖𝑒𝑛𝑐𝑒’, ‘𝑡𝑒𝑐ℎ𝑛𝑜𝑙𝑜𝑔𝑦’, ‘ℎ𝑒𝑎𝑙𝑡ℎ’, ‘𝑒𝑛𝑣𝑖𝑟𝑜𝑛𝑚𝑒𝑛𝑡 ’}), respectively.

directory organization, file contents, and the various ordering

The time constraint is represented as 𝑄

= [𝑄

, 𝑄

], where

𝑑 𝑡

𝑠𝑑

𝑒𝑑

methods applied to facilitate detailed analysis. Our dataset is

𝑄

and 𝑄

denote the start and end dates. To scrape all the

𝑠𝑑

𝑒𝑑

documented in accordance with the FAIR Data Principles.

article headlines (𝐻 ), we utilize 𝑄 to query ER.

𝑡

We then associate the extracted headlines 𝐻 with the corre-

3.1

Primary Data Sources

sponding bias labels in 𝐵 and structure the dataset according to

In this section, we outline the two primary data sources used in

two classification types: event-wise and year-wise. To organize

our study: Media Bias/Fact Check (MBFC) and Event Registry

the data, we define an event-based order 𝑂

and a year-based

𝑒 𝑣𝑒𝑛𝑡

(ER). MBFC serves as the bias rating portal, providing bias la-

order 𝑂

as follows:

𝑦𝑒𝑎𝑟

bels for selected media outlets, while ER is used to extract the

𝑂

= {𝑒

}

(2)

𝑒 𝑣𝑒𝑛𝑡

1 → 𝑒2 → ... → 𝑒𝑛

headlines and corresponding metadata from articles published

by these outlets.

𝑂

= {𝑦

}

(3)

𝑦𝑒𝑎𝑟

1 → 𝑦2 → ... → 𝑦𝑚

For lifelong learning, we design the dataset to be extendable,

3.1.1

Media Bias/Fact Check. For bias labeling in this study,

allowing for the integration of new events and years as they

we utilized Media Bias/Fact Check (MBFC), a well-established emerge, denoted by ′

′

′

′

𝐸

⊆ 𝐸 and 𝑌 ⊆ 𝑌 , where 𝐸 and 𝑌 repre-

platform known for its comprehensive coverage and frequent

sent the sets of newly added events and years.

updates. Although other platforms like allsides.com and adfontes-

We designed the dataset with a flexible framework that allows

media.com also provide bias ratings, MBFC was selected for its for the seamless integration of new events and years as they

reliability and particular focus on low-resource languages. MBFC

emerge, represented as ′

′

′

′

𝐸

⊆ 𝐸 and 𝑌 ⊆ 𝑌 , where 𝐸 and 𝑌 de-

assigns bias labels based on political orientation and evaluates

note the newly added events and years. This structured approach

outlets for credibility and factual accuracy. These labels are de-

ensures scalability for continuous learning without requiring

termined by a team of contractors and volunteers who follow

major restructuring and supports the training of adaptive mod-

a standardized methodology, ensuring that the ratings are both

els capable of integrating new information effectively. Unlike

consistent and dependable for our analysis.

standard multi-year datasets, our dataset includes annotations

3.1.2

Event Registry. In this study, we use Event Registry [2]

that facilitate contextual understanding, enabling models to learn

platform as the primary source for collecting multilingual news

from historical data while adapting to evolving trends and pat-

headlines. It aggregates content from over 150,000 news sources

terns in news reporting. This ensures that the models remain

across more than 60 languages, making it an ideal resource for

relevant as new information becomes available.

analyzing bias in diverse and low-resource languages. Apart from

Finally, we split the dataset into training and test sets using a

the headlines, it allows access to numerous metadata such as

stratified sampling approach to ensure the preservation of bias

publication date, news category, and political bias. By leveraging

label distributions across both events and years. We perform this

its Python API, we efficiently filtered and extracted headlines

step as it is critical for maintaining the integrity of the model

relevant to our study. This ensured a comprehensive dataset

training process in a lifelong learning context.

that supports the analysis of bias in a lifelong learning setup,

exploring how emerging events and domain shifts influence the

3.3

Data Synopsis and Structure

performance of bias prediction models over time.

In this section, we present an overview of the data and explain

how it is systematically organized, making it easier to understand

3.2

Data Collection Framework

both the content and format of our dataset.

Our data collection framework as depicted in Figure 1, is designed 3.3.1

Data Synopsis. The dataset features 356,060 headlines

to support both event-wise and year-wise analyses, with the

on four major events from 2019 to 2022: Brexit, COVID-19, the

additional capability of facilitating lifelong learning.

election, and the Ukraine-Russia war. These headlines, sourced

For data collection, we begin by defining two sets: a set of

from 45 unique news outlets in 17 different languages, are anno-

significant global events (𝐸 = {𝑒1, 𝑒2, ..., 𝑒 }), and a set of years

tated with 3 political bias labels: Left Centre, Least Biased, and

𝑛

(𝑌 = {𝑦1, 𝑦2, ..., 𝑦 }), where 𝑛 and 𝑚 represent the total num-

Right Centre covering diverse topics such as politics, business, arts

𝑚

ber of events and years, respectively. We then use the Media

and entertainment, sports, science, technology, health, and environ-

Bias/Fact Check (MBFC) platform to select media outlets (𝑂 =

ment. The dataset is structured into 7 distinct columns within .csv

{𝑜1, 𝑜2, ..., 𝑜 }) and determine their respective political bias, with

files. Table 1 presents a comprehensive summary of the dataset

𝑝

𝑝 as the total number of outlets. To maintain data reliability, we

statistics.

98



LLNewsBias: A Multilingual News Dataset for Lifelong Learning

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Figure 1: Data Collection Framework. The framework uses MBFC for bias labeling and ER for headline retrieval.

Table 1: Summary of Dataset Statistics.

• article_ID: A unique identifier for the raw news article

in the Event Registry platform from which the headlines

Language-wise Distribution

are extracted.

• language: The source language of the published news

Catalan

882

Romanian

17,038

article.

Croatian

13,929

Russian

10,511

• date: The date on which the news was published.

Czech

1,876

Slovak

5,642

• headline_text: The text of the news headline.

Danish

4,330

Spanish

83,940

• news_category: The category assigned by Event Registry.

Dutch

10,905

Swedish

6,441

• political_bias: The political bias of the news outlet as

Finnish

1,512

Ukrainian

10,616

provided by the bias rating portal Media Bias/Fact Check.

French

85,007

Italian

48,450

Hungarian

105

The dataset is annotated with bias labels: Left Centre (LC),

Least Biased (LB), and Right Centre (RC). To ensure model ro-

Event-wise Distribution

bustness across varying data distributions, we concatenate and

Brexit

32,286

COVID

309,329

shuffle files for each event and year in four distinct random orders.

Election

3,829

Ukraine

10,616

This prevents overfitting to specific sequences and helps evaluate

generalization across diverse configurations. While chronolog-

Year-wise Distribution

ical order is ideal for practical use, this randomized approach

2019

20,664

2021

4,638

tests broader performance, with the original event and year splits

2020

258,871

2022

71,887

provided for user flexibility.

Event-wise Ordering:

(1) brexit → covid → election → ukr-rus-war

3.3.2

Directory Structure. The dataset is organized in a main

(2) election → covid → ukr-rus-war → brexit

‘data’ directory with subdirectories categorized by events (‘brexit’,

(3) brexit → ukr-rus-war → election → covid

‘covid’, ‘election’, ‘ukr-rus-war’) and years (2019-2022). Addi-

(4) covid → brexit → ukr-rus-war → election

tional subdirectories consolidate data across all events (ordered_events)

and all years (ordered_years). Each subdirectory contains .csv

Year-wise Ordering:

files for training and testing, structured across the following

(1) 2019 → 2020 → 2021 → 2022

columns.

(2) 2021 → 2020 → 2022 → 2019

• news outlet: The name of the news outlet.

(3) 2019 → 2022 → 2021 → 2020

99





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Swati et al.

(4) 2020 → 2019 → 2022 → 2021

Event Registry (ER). MBFC is publicly accessible, while ER

The dataset captures the distribution of headlines related to

provided comprehensive but limited coverage, potentially

various events over the years, reflecting the temporal dynamics

missing relevant articles. The use of ER’s paid version also

of news coverage and the evolving reporting on these events.

restricted the extent of data collection.

The differences in coverage levels reveal important patterns in

• Sample Size: The dataset is constrained by its focus on

media attention, which are essential for developing datasets that

four major events over a span of four years. This limited

support lifelong learning models.

number of events and time frame may not fully capture

the broader spectrum of news and media biases, affecting

4

Potential Use-Cases

the diversity of the samples.

• Biases: Selection bias is a significant factor, as only news

Our dataset introduced in this study has a wide range of potential

outlets labelled by MediaBiasFactCheck were included.

use-cases, particularly in the fields of natural language processing

This restriction may limit the number of languages and

and media studies. It is particularly valuable for research and

perspectives represented in the dataset, thereby influenc-

applications that require understanding and predicting news bias

ing the overall analysis.

in a continual, multilingual environment. Below we list some

• Contextual Factors: The dataset is limited by its tem-

potential use cases:

poral scope, covering only four specific events over four

• Lifelong learning for news bias prediction: Our dataset

years. While it reflects the dynamic nature of news media,

is ideal for developing and testing lifelong learning mod-

it does not account for all future events and years to come.

els. It allows models to adapt to new events and evolving

entities. With its year-wise structure from 2019 to 2022,

6

Conclusions

the dataset addresses the challenges of emerging events

In this study, we present LLNewsBias, a comprehensive dataset

and domain shifts (e.g., Brexit, COVID-19, Ukraine-Russia

designed to tackle the challenges of detecting and analyzing polit-

War), providing the data needed to develop and evaluate

ical bias in multilingual news headlines. By spanning four major

robust models.

global events from 2019 to 2022 across 17 languages, this dataset

provides a valuable resource for research in natural language

• Domain Adaptation in Multilingual Contexts: Our

processing and media studies. Our framework supports both

dataset enables researchers to investigate domain adapta-

event-wise and year-wise analysis, emphasizing lifelong learning

tion techniques in a multilingual context, featuring head-

and enabling models to adapt continuously to new data. The

lines in 17 languages. This facilitates the development of

dataset’s potential use cases include enhancing bias prediction

models that generalize across languages and adapt to vari-

models, facilitating domain adaptation in multilingual contexts,

ous cultural and political contexts, ensuring accurate bias

and improving model robustness. While LLNewsBias offers sig-

prediction. It addresses the challenges faced by generic

nificant contributions, we also acknowledge limitations such as

models in the news domain, which often struggle with

potential biases in data collection, sample size constraints, and

topic and language diversity.

contextual factors. Addressing these challenges in future work

will be crucial for maximizing the dataset’s impact, ultimately

• Sparse Experience Replay for Continual Learning:

contributing to fairer and more balanced journalism.

Our dataset is particularly well-suited for the news do-

main, supporting efficient experience replay by allowing

7

Acknowledgments

the selection of specific topics and categories. With its

This work was supported by the Slovenian Research Agency and

event-wise and year-wise classifications, our dataset en-

National grants (CRP V2-2272; V5-2264; CRP V2-2146) and by the

hances memory utilization, improves generalization, re-

European Union through enrichMyData EU HORIZON-IA project

duces catastrophic forgetting, and ensures that models

under grant agreement No 101070284 and ELIAS HORIZON-RIA

remain accurate and up-to-date in real-time applications.

project under grant agreement No 101120237.

In a nutshell, our dataset serves as a valuable resource for

advancing news bias prediction, particularly in the context of

References

lifelong learning, by providing a flexible framework for integrat-

[1]

Jason Armitage, Endri Kacupaj, Golsa Tahmasebzadeh, Swati, Maria Maleshk-

ing new events and years. Unlike many news-based datasets with

ova, Ralph Ewerth, and Jens Lehmann. 2020. Mlm: a benchmark dataset

for multitask learning with multiple languages and modalities. In Proceed-

timestamps, it offers structured annotations and contextual in-

ings of the 29th ACM International Conference on Information & Knowledge

formation that enhance the understanding of evolving trends

Management, 2967–2974.

in news coverage, making it particularly suitable for lifelong

[2]

Gregor Leban, Blaz Fortuna, Janez Brank, and Marko Grobelnik. 2014. Event

registry: learning about world events from news. In Proceedings of the 23rd

learning applications. It supports a range of research activities,

International Conference on World Wide Web, 107–110.

from model development and evaluation to the exploration of

[3]

Kalev Leetaru and Philip A Schrodt. 2013. Gdelt: global data on events, loca-

tion, and tone, 1979–2012. In ISA annual convention. Vol. 2, 1–49.

new techniques for handling dynamic and multilingual news

[4]

Swati Swati, Adrian Mladenić Grobelnik, Dunja Mladenić, and Marko Grobel-

environments.

nik. 2023. A commonsense-infused language-agnostic learning framework

for enhancing prediction of political bias in multilingual news headlines.

Knowledge-Based Systems, 277, 110838.

5

Limitations

[5]

Swati Swati, Dunja Mladenić, and Tomaž Erjavec. 2021. Eveout: an event-

centric news dataset to analyze an outlet’s event selection patterns. Informat-

Several limitations are associated with the dataset presented in

ica, 45, 7.

this article and should be carefully considered in any further

[6]

Swati Swati, Dunja Mladenić, and Marko Grobelnik. 2023. An inferential

research or analysis:

commonsense-driven framework for predicting political bias in news head-

lines. IEEE Access.

• Data Collection Issues: The dataset was gathered using

Media Bias Fact/Check (MBFC) and the paid version of

100





Creating Local World Models using LLMs

Mark David Longar

Erik Novak

Marko Grobelnik

Jožef Stefan Institute

Jožef Stefan Institute

Jožef Stefan Institute

Ljubljana, Slovenia

Ljubljana, Slovenia

Ljubljana, Slovenia

Abstract

e.g. by providing LLMs a framework for responding with logi-

cally consistent and pedagogically sound explanations. Moreover,

A key limitation of state-of-the-art large language models is their

by modifying some of the components, the approach can also be

lack of a consistent world model, which hinders their ability to

applied to other domains, such as industry, finance, and law.

perform unseen multi-hop reasoning tasks. This paper addresses

The remainder of the paper is as follows: Section 2 presents

this by extracting local world models from text into a system-

the related work on LLMs and creating world models. Next, the

atic first-order logic framework, enabling structured reasoning.

proposed approach is described in Section 3. The experiment set-Focusing on the educational domain, we present a multi-step

ting is presented in Section 4, followed by the experiment results approach using Prolog to represent and reason with these mod-in Section 5. We discuss the results in Section 6 and conclude the els. Our method involves segmenting educational texts, generat-paper in Section 7.

ing Prolog definitions, and merging them into a comprehensive

knowledge graph. We successfully extracted several small models

2

Related Work

and manually verified their accuracy, demonstrating the poten-

tial of this approach. While promising, our results are currently

The recent surge in large language models, such as GPT-3 [3] and limited to small-scale models.

GPT-4 [1], has significantly advanced natural language processing, showing emergent reasoning abilities across various tasks.

Keywords

However, despite their impressive performance, LLMs are often

criticized for lacking factual consistency, interpretability, and

Large language models, local world models, knowledge represen-

logical coherence, especially in complex, multi-hop reasoning

tation, educational technology, structured reasoning, knowledge

tasks [8]. To address these shortcomings, efforts have been made graphs

to integrate LLMs with structured knowledge frameworks, like

knowledge graphs (KGs) and ontologies, to enhance reasoning

1

Introduction

and knowledge flow between structured data and language mod-

In recent years, Large Language Models (LLMs) have revolu-

els [9].

tionized the field of Natural Language Processing (NLP), offer-

In the field of ontology and KG development, early initiatives

ing unprecedented capabilities in understanding, reasoning over,

like Cyc [6] laid the groundwork for large-scale structured knowl-and generating human-like text. Despite their impressive per-

edge representation. More recent efforts [8, 5] have explored formance across various language tasks, a significant limitation

using LLMs to assist in ontology generation and KG construction.

persists – the absence of a consistent and coherent world model

While LLMs can automate parts of the ontology development

within these systems [8]. This limitation hampers their ability process, they struggle with ensuring logical consistency and

to perform advanced reasoning tasks that require not only tex-

managing complex domain-specific knowledge [5, 2]. Comple-

tual understanding but also logical consistency and structured

mentary approaches, like using LLMs for ontology learning [2]

knowledge representation.

and structured knowledge extraction [10], highlight the need for While current LLMs are powerful, they are inherently con-human validation and formal methods to ensure accuracy.

strained by their reliance on statistical correlations within vast

Our work builds on these insights by focusing on using LLMs

datasets, often resulting in shallow and contextually inconsistent

to extract structured local world models in the form of Prolog-

reasoning. To address this limitation, we propose an approach

based representations. This approach addresses the limitations of

for extracting local world models, i.e., small, context-specific

LLMs in handling complex reasoning and provides a more robust,

representations of knowledge that capture the relationships and

logically consistent framework for educational applications.

rules governing a particular domain or scenario. The approach

is multi-step. First, the input text is segmented into manageable

3

Methodology

parts. Each segment is analyzed to extract key concepts and their

This section introduces the approach for creating local world

interrelationships, which are then represented as Prolog defini-

models by generating and utilizing structured data in Prolog. The

tions. Then, the definitions are merged into a comprehensive

methodology is designed to systematically identify and map the

knowledge graph that reflects the structure and content of the

concepts and their interrelationships within a given educational

input text.

document, such as a textbook, facilitating the generation of a

We focus specifically on the educational domain, where the

knowledge graph.

ability to generate and utilize local world models could signifi-

cantly enhance the effectiveness of AI-driven educational tools,

3.1

Document segmentation

To manage the document’s complexity and ensure accurate con-

Permission to make digital or hard copies of all or part of this work for personal cept extraction, the source material was divided into several

or classroom use is granted without fee provided that copies are not made or

shorter parts, each up to 10 pages long. This segmentation was

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this crucial in allowing us to focus on smaller, more manageable sec-work must be honored. For all other uses, contact the owner /author(s).

tions of the content, enabling a thorough analysis and avoiding

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

problems that come with long-context LLM outputs. The length

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2024.sikdd.22

of each part was determined based on the natural divisions within

101





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Mark David Longar, Erik Novak, and Marko Grobelnik

the text, such as chapters or major sections, to maintain the co-

concept map that helped identify key learning paths and pre-

herence of concepts within each segment.

requisites. Prolog (specifically SWI Prolog [11]) was chosen for this task because it can handle structured data, is widely used

3.2

Generating Prolog definitions

(increasing the likelihood that LLMs have encountered it during

training), and can be executed and analyzed immediately.

For each segmented part, we created a prompt to generate Prolog

definitions of the concepts and their relationships. The prompt

4

Experiment Setting

was carefully crafted to guide the extraction of educational con-

tent in a structured format. It consisted of three main components:

This section outlines the experiment setting for evaluating our

the context, the predicates and the structured output.

approach to extracting local world models from educational texts

and generating structured Prolog representations. We describe

Context. A description of the educational context and a brief

the data sources, the large language model used, and the evalua-

narrative to position the content within a learning scenario. This

tion framework.

helped to align the LLM-extracted concepts and relationships

with our downstream tasks. The following is an example of the

4.1

Data sources

prompt used:

We evaluated our approach on two widely used textbooks in

deep learning and natural language processing. These texts were

You are a teacher and an expert in natural language process-

chosen because they are relevant to both structured reasoning

ing (NLP). You wrote a chapter in an NLP textbook and would

tasks and the representation of complex, multi-step concepts.

like to convert the content of the chapter into a classroom

The following chapters were selected for analysis:

lesson. You would like to step into the shoes of a student in

order to understand their learning process of this material.

Deep Learning Preliminaries from the book Dive into Deep

You need to understand which concepts are being taught and

Learning [12]. This chapter provides foundational knowledge

their relationships.

of deep learning, covering key concepts such as linear algebra,

calculus, and probability, which are essential for understanding

Predicates.

the field. The textbook’s teaching approach is highly hands-on,

List of predicates and their descriptions, which were

with a significant portion devoted to code. It is open-sourced, and

essential for identifying concepts (isConcept(A)), prerequisites

1

we used the Markdown files provided on their GitHub page .

(isPrerequisiteOf(A, B)), and sections (isSection(S)). These

predicates were used to simulate the learning process, where con-

Chapter 2: Regular Expressions, Tokenization, and Edit

cepts are linked to sections. A concept may have prerequisite

Distance from Speech and Language Processing [4]. This chapter concepts or sections that must be understood before a student

introduces basic NLP techniques, focusing on regular expressions

can advance to learning the concept.

and tokenization, which are pivotal in text preprocessing tasks.

Structured output. Clear instructions to output the extracted

4.2

Used large language model

predicates in the form of a Prolog program. The LLM responding

in a structured format a crucial part of our approach, as it has been

We employed GPT-4o via the ChatGPT interface to extract con-

shown that structured responses can improve LLM reasoning

cepts and their interrelationships. We leveraged the model’s mul-

and generation quality [13].

timodal capabilities, allowing it to process text and PDF docu-

ments.

In summary, this prompt allowed us to extract detailed sum-

maries of the concepts taught and their relationships, which

4.3

Evaluation Framework

were then represented in Prolog. Each segment was processed

We developed an evaluation framework to assess the performance

independently to generate a corresponding Prolog program.

of our approach based on three primary aspects: accuracy, com-

pleteness, and consistency. To validate the results, we manually

3.3

Merging Prolog definitions

reviewed the extracted knowledge graphs and compared them

After generating the Prolog definitions for each segment, the

with the source texts. We ensured that the extracted concepts

next step was to merge them into a single cohesive program. To

were accurate, complete, and logically consistent.

achieve this, we created a prompt, which was nearly identical to

the first, but with instructions to combine the disjoint parts into

Assessment Criteria. The following criteria were used to eval-

one integrated Prolog program added to the end of the prompt:

uate the effectiveness of our approach:

• Accuracy. This aspect examines how accurately the approach

Now you need to combine the parts into a single Prolog pro-

extracted the concepts and their relationships from the text.

gram. Make sure to include all the concepts and relationships,

We evaluated the correctness of each Prolog definition against

but also properly connect them. Merge concepts from different

the source material.

sections where necessary and make sure to include all the

• Completeness. This evaluates whether the system captured all

sections and their relationships.

the key concepts from the educational material. The assess-

ment ensured that no significant concepts or relationships

3.4

Use of the knowledge graph

were omitted during extraction.

• Consistency. This aspect assesses the extent to which the ex-

The generated knowledge graph, represented by the Prolog pro-

tracted models maintained logical coherence across different

gram, was then used to recommend the next steps in the learn-

ing process. Using the structured output, we created a detailed

1 https://github.com/d2l-ai/d2l-en

102





Creating Local World Models using LLMs

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

segments of the text. This was crucial in determining whether

there were rare occasions where the output required manual in-

the segmented Prolog definitions could be merged into a cohe-

terventions to fix inconsistent formatting of the Prolog variable

sive KG.

names.

6

Discussion

5

Results

Our approach to extracting local world models from educational

In this section, we review the knowledge graphs of the two tested

texts demonstrated strong performance in generating logically

texts generated by our model.

coherent knowledge graphs from high-level concepts, but certain

limitations were identified. The synthetic data generation effec-

5.1

Dive into Deep Learning

tively captured core concepts from both textbooks, particularly

in structuring major branches such as Linear Algebra, Calculus,

The selected chapter covered six sub-chapters in the following

and Probability from Dive into Deep Learning. However, some

order: Data Manipulation, Data Preprocessing, Linear Algebra,

restructured sections, while logical, differed significantly from

Calculus, Automatic Differentiation, and Probability and Statis-

the source material’s flow.

tics. The results are represented by the graph in Figure 1.

In the Speech and Language Processing textbook, the Regular

The system accurately identified three major independent

Expressions subsection was extracted with sufficient accuracy.

branches of the chapter – Linear Algebra, Calculus, and Probabil-

Other sections, such as Tokenization and Edit Distance, suffered

ity and Statistics – which reflects the structure of the source ma-

from detail omissions, where only top-level concepts were ex-

terial. The extracted knowledge graph also logically restructured

tracted. This issue was more prominent due to the higher in-

the content in ways that differed from the original organization

formation density of the NLP textbook, exposing limitations in

but made sense pedagogically. This restructuring highlights the

handling detailed, densely packed content.

logical flow of how data handling techniques naturally feed into

Regarding the evaluation framework, the model generally per-

more abstract mathematical concepts despite differing from the

formed well on metrics like accuracy and consistency but strug-

original structure.

gled with completeness in more detailed sections. The model’s

However, some omissions and reassignments were noted, par-

tendency to restructure content logically, though sometimes de-

ticularly within the Linear Algebra section. Concepts such as

viating from the original, suggests that while it captures core

vectors and matrices were omitted, likely due to the high-level

relationships, further refinements are needed to preserve peda-

nature of the extraction process. Additionally, matrix multipli-

gogical flow and details.

cation, though identified, was separated from Linear Algebra

basics and Tensor operations. This disjunction represents a slight

6.1

Potential improvements

deviation from the expected conceptual hierarchy.

To address the limitations, improving the prompt engineering

Similarly, in the Calculus section, the extracted model restruc-

could lead to more detailed extractions while maintaining the

tured the sequence of topics. This restructuring captured the

structure of the source material. Additionally, enhancing the

relationship between fundamental calculus concepts and their

model’s ability to handle complex, dense information would mit-

practical applications in machine learning. Furthermore, the sys-

igate the loss of key concepts. Future iterations may benefit from

tem included concepts like Gradient Descent and Backpropaga-

automated post-processing checks to ensure logical consistency

tion which were only briefly mentioned in the source material.

and reduce manual interventions. Overall, while the approach

shows promise, refining it to handle finer details and complex

5.2

Speech and Language Processing

sequences more effectively will be essential for broader applica-

tions.

The Regular Expressions section, seen in Figure 2, was extracted accurately, capturing the core concepts effectively. However, a

7

Conclusion and Future work

noticeable limitation was the loss of the original sequencing of

the concepts presented in the textbook. While the key ideas were

In this paper, we proposed a novel approach to extracting local

identified, the pedagogical flow, which is essential for gradual

world models from educational texts by generating structured

learning, was somewhat disrupted in the extraction process.

Prolog representations. Our methodology demonstrated the abil-

For the other sections, including Tokenization and Edit Dis-

ity to capture core concepts and their interrelationships in a logi-

tance, the model extracted only the most prominent concepts,

cal and coherent manner, especially in the Dive into Deep Learning

omitting many important details. As a result, these sections are

textbook. However, the results from the more information-dense

less comprehensive than they need to be for in-depth understand-

Speech and Language Processing text revealed limitations, partic-

ing. Despite this, the overall connections between sections in

ularly in handling detailed content, large knowledge graphs, as

the knowledge graph were logically structured, showing that the

well as preserving pedagogical flow.

system was still able to create a coherent representation of the

The use of Prolog proved effective in organizing educational

material at a high level.

material, allowing for structured reasoning and enabling appli-

It is important to note that this textbook is significantly more

cations in AI-driven educational tools. Despite these successes,

information-dense and longer compared to the Dive into Deep

certain challenges remain, such as the omission of detailed con-

Learning book. This added complexity exposed some limitations

cepts and the system’s occasional tendency to deviate from the

in the current approach, mainly when dealing with texts that re-

original sequence of topics.

quire detailed extraction of concepts and their interrelationships.

Future work will address these limitations by improving the

The model’s ability to handle such dense material is limited by

prompt engineering and enhancing the system’s ability to handle

its tendency to focus on top-level ideas while losing much of the

complex, information-dense material. Additionally, we plan to

depth and sequencing provided in the source text. Additionally,

explore automating the segmentation process and scaling up the

103





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Mark David Longar, Erik Novak, and Marko Grobelnik

Deep Learning Prerequisites

Linear Algebra Basics

Calculus Basics

Probability Basics

Tensor Operations

Matrix Multiplication

Gradient Descent

Chain Rule

Statistics Basics

Data Preprocessing

Broadcasting Techniques

Optimization Techniques

Backpropagation

Stochastic Models

Automatic Differentiation

Loss Function Optimization

Figure 1: Knowledge graph of the Preliminaries section from Dive into Deep Learning.

Regular Expressions

Concatenation

Square Brackets

Kleene Star

Period

Anchors

Disjunction

Precedence

Word Boundary

Substitution

Question Mark

Kleene Plus

Parenthesis

Greedy Matching

Capture Group

Non-Greedy Matching

Lookahead Assertion

Figure 2: Knowledge graph of the Regular Expressions section from Speech and Language Processing.

model to generate larger, more intricate knowledge graphs. Other

[5]

Vamsi Krishna Kommineni, Birgitta König-Ries, and Sheeba

potential directions include integrating retrieval-augmented gen-

Samuel. “From human experts to machines: An LLM sup-

eration [7] to enrich knowledge extraction and comparing gen-

ported approach to ontology and knowledge graph con-

erated world models across different texts to evaluate their peda-

struction”. In: arXiv preprint arXiv:2403.08345 (2024).

gogical alignment. Self-evaluation and correction mechanisms

[6]

Douglas B Lenat. “CYC: A large-scale investment in knowl-

could also be introduced to improve accuracy and completeness.

edge infrastructure”. In: Communications of the ACM 38.11

(1995), pp. 33–38.

Acknowledgments

[7]

Patrick Lewis et al. “Retrieval-augmented generation for

knowledge-intensive nlp tasks”. In: Advances in Neural

This work was supported by the Slovenian Research Agency

Information Processing Systems 33 (2020), pp. 9459–9474.

and the European Union’s Horizon 2020 project Humane AI Net

[8]

Fabian Neuhaus. “Ontologies in the era of large language

(Grant No. 952026).

models–a perspective”. In: Applied ontology 18.4 (2023),

References

pp. 399–407.

[9]

Shirui Pan et al. “Unifying large language models and

[1]

Josh Achiam et al. “Gpt-4 technical report”. In: arXiv preprint

knowledge graphs: A roadmap”. In: IEEE Transactions on

arXiv:2303.08774 (2023).

Knowledge and Data Engineering (2024).

[2]

Hamed Babaei Giglou, Jennifer D’Souza, and Sören Auer.

[10]

Mohammad Javad Saeedizade and Eva Blomqvist. “Navi-

“LLMs4OL: Large language models for ontology learning”.

gating Ontology Development with Large Language Mod-

In: International Semantic Web Conference. Springer. 2023,

els”. In: European Semantic Web Conference. Springer. 2024,

pp. 408–427.

pp. 143–161.

[3]

Tom Brown et al. “Language Models are Few-Shot Learn-

[11]

Jan Wielemaker et al. “SWI-Prolog”. In: Theory and Practice

ers”. In: Advances in Neural Information Processing Systems.

of Logic Programming 12.1-2 (2012), pp. 67–96. issn: 1471-

Vol. 33. Curran Associates, Inc., 2020, pp. 1877–1901. (Vis-

0684.

ited on 08/27/2024).

[12]

Aston Zhang et al. Dive into Deep Learning. https://D2L.ai.

[4]

Daniel Jurafsky and James H. Martin. Speech and Language

Cambridge University Press, 2023.

Processing: An Introduction to Natural Language Processing,

[13]

Pei Zhou et al. “How FaR Are Large Language Models

Computational Linguistics, and Speech Recognition with

From Agents with Theory-of-Mind?” In: arXiv preprint

Language Models. 3rd. Online manuscript released August

arXiv:2310.03051 (2023).

20, 2024. 2024. url: https://web.stanford.edu/~jurafsky/

slp3/.

104





Semantic video content search and recommendation

∗

∗

∗

Mark David Longar

Jakob Fir

Bor Pangeršič

Jožef Stefan Institute

University of Ljubljana

University of Ljubljana

Ljubljana, Slovenia

Ljubljana, Slovenia

Ljubljana, Slovenia

Abstract

a recommendation system that interacts with users to capture

their immediate preferences, thereby overcoming the cold start

The rapid growth of video streaming platforms has intensified

problem and enhancing the relevance of recommendations. Addi-

the demand for personalized content recommendations. How-

tionally, ensuring consistency in the quality of recommendations

ever, current solutions often rely on historical user data, leading

across different languages is increasingly important as many

to challenges like the cold start problem and overlooking users’

streaming services operate globally.

immediate preferences. We present a conversational recommen-

Our approach utilizes LLMs to generate keyword descriptions

dation system that leverages large language models (LLMs) to

for both content and user queries. These keywords serve as the

generate keyword-based content and query descriptions. By in-

basis for recommendations, with a Retrieval-Augmented Gen-

tegrating Retrieval-Augmented Generation (RAG), our system

eration (RAG) [6] model efficiently retrieving relevant content.

efficiently retrieves relevant content, independent of prior user in-

By crafting query keywords using LLMs, the system adapts to

teractions, and ensures consistent performance across languages.

user preferences in real time, providing relevant and language-

Preliminary testing shows our system outperforms the RAG base-

consistent recommendations.

line by up to 24% in less descriptive queries and demonstrates

This paper makes the following contributions: (1) Develop-

consistent performance across three languages. While the results

ment of a Keyword-Based Recommendation System: We in-

are promising, further evaluation focusing on user interaction

troduce a novel approach that utilizes LLMs to generate keyword-

and satisfaction is necessary. Our approach can potentially be

based descriptions for content and user queries, enabling more

extended to other recommendation systems, offering broader

personalized and adaptive recommendations. (2) Exploration of

applicability and enhanced content personalization.

Two User Interaction Models: We propose and evaluate two

Keywords

distinct interfaces for user interaction—a conversational chat-

based model and a structured question-answering model, where

large language models, recommendation system, search system,

the system refines recommendations through a series of targeted

retrieval augmented generation

yes/no questions generated by the LLM. (3) Comprehensive

1

Introduction

Evaluation Strategy: We outline a detailed plan for evaluating

the system’s performance in a production environment, focusing

The surge in video streaming platforms has accelerated the de-

on its ability to deliver consistent, high-quality recommendations

mand for personalized content recommendations. As these plat-

across different languages and user contexts.

forms expand their libraries and user bases, the challenge of

delivering precise, user-specific recommendations intensifies. In

this dynamic environment, streaming services must quickly adapt

2

Related Work

to provide accurate recommendations, which are crucial for main-

Recommender systems have progressed from techniques such

taining user engagement and ensuring satisfaction.

as collaborative filtering and matrix factorization to more com-

Existing recommendation systems primarily rely on historical

plex models that incorporate deep learning. The advent of large

user interaction data, such as viewing history and ratings. This

language models (LLMs) has enabled innovative methods for

dependence leads to significant challenges, such as the cold start

interacting with these systems [11], particularly when combined problem, where new users or newly added content lack sufficient

with retrieval techniques [9]. One of the most promising advance-data for accurate recommendations. Additionally, these systems

ments in this area is the use of Retrieval-Augmented Generation

often fail to account for users’ immediate preferences, which can

(RAG) models, which integrate the powerful text generation ca-

change dynamically due to various factors such as mood, viewing

pabilities of LLMs with retrieval-based methods to improve rec-

context (e.g., watching alone or with a group), or recent events

ommendation accuracy and relevance [6].

in the user’s life. This gap highlights the need for more adaptive

Recent advancements in conversational recommender systems

and responsive recommendation mechanisms.

have focused primarily on integrating LLMs with traditional rec-

Recent advancements in Large Language Models (LLMs) present

ommender systems or fine-tuning LLMs using user-item interac-

an opportunity to address these limitations. LLMs offer significant

tion data [9], [10], e.g., [8], [4], and [5]. These approaches, while potential due to their emergent reasoning abilities, their capacity

effective, often rely heavily on historical user data, leading to

to extract high-quality representations of textual features, and

challenges such as the cold start problem. This reliance under-

their ability to leverage the vast external knowledge encoded

scores the need for novel methods that reduce dependency on

within them [10], [7]. By harnessing LLMs, it is possible to create past interactions and leverage real-time retrieval mechanisms to

enhance content recommendations [2].

∗ All authors have contributed equally.

To address these challenges, recent work by Di Palma et al.

Permission to make digital or hard copies of part or all of this work for personal (2023) [2] introduced a Retrieval-Augmented Recommender Sys-or classroom use is granted without fee provided that copies are not made or

tem, which combines the strengths of LLMs and retrieval-based

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this methods. Their approach employs LLMs both at the conversa-work must be honored. For all other uses, contact the owner /author(s).

tional layer and the backend retrieval process, thereby improving

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

recommendation relevance, particularly in scenarios with sparse

© 2024 Copyright held by the owner/author(s).

https://doi.org/10.70314/is.2024.sikdd.10

data or new users. Their experimental results demonstrated that

105





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Mark David Longar, Jakob Fir, and Bor Pangeršič

this RAG-based framework performs comparably to state-of-the-

art systems, even in zero-shot scenarios, underscoring the poten-

tial of such an approach to mitigate cold start and hallucination

problems inherent in LLMs.

Our approach builds on the strengths of RAG-based models by

introducing a keyword-based recommendation system that oper-

ates within a RAG framework. This system ensures consistent

performance across multiple languages and adapts to real-time

user preferences without relying on historical user data.

3

Data

The data used in this study was provided by our partner United

Cloud, who operate a multinational streaming service in the

1

Balkan region, EON TV . The EON platform encompasses a vari-

ety of content, such as video-on-demand (VOD) movies and TV

Figure 1: Overview of the Recommendation Pipeline.

shows, as well as live TV channels. We focused exclusively on

VOD movie data, although our approach is capable of accommo-

dating multiple content types.

4.2

User Interface

The VOD movies data set comprises nearly 5000 movies in

various languages. Each movie is accompanied by a brief descrip-

Our proposed user interface designs (see Figure 2) offer two main tion averaging around 460 characters (5-6 sentences) in multiple

ways for users to interact with our recommendation core. Be-

languages. In cases where multiple translations were available,

sides a direct search, where the user submits a query and receives

we opted for the original language of the movie; otherwise, we

recommendations in a single step, we propose: (a) A chatbot,

chose the first available translation.

which assists users in narrowing down their options through a

conversational interface. The chatbot provides recommendations

4

Methodology

at each response, allowing for a multi-step interaction that re-

4.1

Recommendation Mechanism

fines the search results progressively. (b) An inquisitive method,

where an agent asks the user a series of Yes/No questions to

The core of our recommendation system is the generation of tex-

narrow down the search. Keywords are generated based on the

tual representations of content. Instead of using movie descrip-

user’s responses, making it particularly useful for users who are

tions directly, we employ the LLM to generate a set of English

uncertain about what they want to watch. This approach shifts

keywords and related movies. This approach prevents the model

the burden of knowing what to query from the user to the system,

from overemphasizing less relevant details, such as specific plot

streamlining the recommendation process.

points, that may not be central to the user’s query. User queries

Each of these designs aims to enhance user engagement and

follow a similar approach, where the LLM generates a set of

satisfaction by providing tailored interactions that cater to differ-

relevant keywords, as well as any possibly relevant movies.

ent user preferences and needs.

One of the key advantages of this method is its ability to

abstract core concepts from user queries using the LLM, aligning

5

Evaluation

better with the keywords generated from movie descriptions.

We have developed a twofold approach for addressing the evalu-

The LLM-generated keywords from both the movie descriptions

ation of our model:

and user queries are designed to encapsulate the essential topics

First, to gauge the effectiveness of our keyword-based ap-

and themes. By aligning the keywords generated from movie

proach for recommendation, we curated a small multilingual

descriptions with those derived from user queries, our system

evaluation dataset to test our core recommendation mechanism.

enhances the relevance of the recommendations. This alignment

This dataset includes queries in various languages along with

is crucial in ensuring that the retrieved movies resonate with

their expected recommendations. We compared the performance

the user’s expressed interests, even when these interests are

of our mechanism with a baseline RAG system that directly em-

not articulated well. Furthermore, the use of in-context learning

bedded user queries and movie descriptions.

allows the system to maintain its performance without extensive

Second, to assess the efficiency and user satisfaction of our sys-

fine-tuning [3], making it both efficient and effective.

tem in real-world situations, we have devised an evaluation plan

The rest of the recommendation system follows the Retrieval-

to test our system in production. This strategy utilizes a struc-

Augmented Generation (RAG) [6] pipeline (see Figure 1). The tured A/B testing framework to conduct precise comparisons

RAG pipeline operates by first generating textual representations

between our semantic recommendation system and conventional

of movies, which are then embedded into a vector space. These

search, addressing distinct aspects of user experience and system

embeddings are stored in a vector database, allowing for efficient

performance.

similarity searches. When a user submits a query, the system

generates a corresponding representation, embeds it into the

5.1

Evaluation dataset

same vector space, and retrieves the top 𝑘 most similar movie

embeddings from the database. This process ensures that the rec-

To create our evaluation dataset, we carefully selected 25 movies

ommendations are both contextually relevant and semantically

across multiple languages, including both well-known and lesser-

aligned with the user’s input.

known titles. For each movie, we formulated two types of queries

to assess the system’s retrieval accuracy: Descriptive and General

1 No EON user data was used.

queries.

106





Semantic video content search and recommendation

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

The Descriptive queries were designed to simulate scenarios

recommendation links, and Watch Time to gauge the duration

where the user knows exactly what they are looking for. For

users engage with recommended content. Additionally, immedi-

instance, a query for the movie Messi (2014) might be, "I am

ate user reactions are captured through Like/Dislike Ratios, while

looking for inspirational documentaries about famous athletes,

more detailed user feedback is collected via surveys administered

such as Lionel Messi and his rise through football." In contrast,

after interactions.

the General queries were intended to test situations where the

Behavioral Metrics: We analyze User Interaction Patterns, such

user has only a rough idea of what they want to watch, which is

as search frequency and refinement actions, and System Usage

likely more common in real-world environments. An example

Frequency to determine how different demographics utilize the

of a general query for the same movie might be, "soccer movies

system and to identify any potential biases in system engagement.

that will inspire me."

We also record the search time and number of queries needed for

To evaluate the system’s performance across different linguis-

a decision.

tic contexts, we manually translated these queries into English,

Serbian, and Slovenian. We then compared the performance of

6

Results

our keyword-based retrieval mechanism against a baseline RAG

The outcomes presented in Table 1 showcase the performance of model that directly used user queries and movie descriptions

both models in various query types and languages, as measured

without generating keywords.

by accuracy at the top 5 and top 10 recommendations.

The results reveal that the baseline model surpasses (or matches)

5.2

Experiment Design

the performance of the keyword mechanism in the case of De-

scriptive queries, particularly in terms of Accuracy@5. However,

We have divided our user base into four distinct groups to facili-

in terms of Accuracy@10, the two models demonstrate relatively

tate a detailed comparative analysis, aligned with our proposed

similar performance. Conversely, the keyword model shows sig-

user interface designs:

nificant performance enhancements for General queries, partic-

Baseline Group: This control group doesn’t use our system, but

ularly in Accuracy@10, indicating its capacity to adapt to non-

instead finds movies and receives recommendations based on the

specific content descriptions. Additionally, the keywords model

traditional recommendation methods, a common practice in the

consistently performs well across different languages, whereas

industry.

the baseline model shows fluctuations of up to 28% across lan-

Direct Semantic Search Group: This control group interacts

guages.

with a straightforward search interface. Users submit a query

In summary, the keywords model allows for more general and

and receive recommendations in a single step. This approach

multilingual queries, while the baseline model excels at retrieving

provides immediate suggestions based on the user’s input, mim-

very specific content.

icking traditional full-text search practices.

Chatbot Group: Participants in this treatment group use a con-

Table 1: Evaluation results on the descriptions and gen-

versational interface (interface a), where a chatbot assists in

eral queries data sets. LLM embeddings were generated

narrowing down options. The chatbot provides recommenda-

using OpenAI’s text-embedding-3-large model. The Key-

tions at each response, enabling a multi-step interaction that

words model used GPT-4o.

progressively refines the search results. This design enhances

engagement by simulating a natural conversation.

Inquisitive Method Group: Users in this group engage with

Accuracy@5

Accuracy@10

an agent that asks a series of Yes/No questions to narrow down

Keywords

Baseline

Keywords

Baseline

the search (interface b). Keywords are generated based on the

Descriptive Queries

user’s responses.

English

60%

64%

68%

68%

The evaluation will be conducted continuously, starting with a

Serbian

56%

80%

72%

84%

focused initial phase over the first month post-implementation

Slovenian

56%

80%

72%

84%

to address immediate usability and performance issues, followed

by ongoing monitoring to capture long-term user engagement

General Queries

and satisfaction.

English

44%

28%

68%

44%

By implementing this structured evaluation framework, we aim

Serbian

44%

52%

68%

52%

to comprehensively understand the impact and effectiveness of

Slovenian

44%

56%

72%

56%

our semantic recommendation system, guiding further refine-

ments and ensuring that the system meets user needs and expec-

tations.

6.1

User Interface Implementation

5.2.1

Metrics

We would like to measure how users interact

We implemented our proposed interface design using Flutter,

with our system in two main ways: First, we would like to know

which guarantees functionality across a variety of devices, includ-

how engaged and satisfied they are with our recommendations,

ing iOS, Android, Windows, and web browsers. This cross-device

i.e., do users find our system frustrating to navigate, and whether

compatibility is crucial as it ensures that all users, regardless

they watch movies recommended by our system. The second

of their preferred platform, have access to our application. The

set of metrics will aim to capture how different demographics

support for mobile devices is particularly useful in our interroga-

interact with our system, as a major goal is to remove any biases

tion design, where users can easily navigate through options by

such as language or age.

swiping cards left or right.

Engagement and Satisfaction Metrics: These include Click-

Additionally, we integrated Tipko [1], a Slovenian transcrip-

Through Rate (CTR), which measures the percentage of clicked

tion service, to facilitate voice-to-text capabilities. This feature

107





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

Mark David Longar, Jakob Fir, and Bor Pangeršič

enhances user convenience by enabling voice communication

integrations, such as the user’s calendar. We also intend to expand

with our chat bot, removing the necessity for typing.

our user interface by introducing new forms of interaction, such

as movie trailers and multiple-choice questions.

To overcome the limitations of our movie information, we

are interested in delving deeper into the content by analyzing

subtitles using a local language model. Additionally, we aim to

broaden our database to include other types of content, such as

live channel content and special time-limited events like Eurovi-

sion, Eurobasket, and the FIFA World Cup.

Finally, we are interested in the integration of a traditional

recommendation models that utilize historical watch data or

ratings to re-rank our recommendations.

Acknowledgments

This project was made in collaboration with United.Cloud and

In516ht for the 2024 Data Science Competition, organized by The

Faculty of Computer and Information Science at the University of

Ljubljana. We thank our advisors Slavko Žitnik, Aljaž Košmerlj,

Klementina Pirc, and Rebeka Merhar for their contributions.

References

[1]

Primož Bratanič. Transkript app | Samodejna transkripcija

slovenskega govora. May 2024. url: https://transkript.si/.

Figure 2: Implementations of our (a) Chatbot (left) and (b)

[2]

Dario Di Palma. “Retrieval-augmented recommender sys-

Inquisitive (right) user interface designs.

tem: Enhancing recommender systems with large lan-

guage models”. In: Proceedings of the 17th ACM Conference

on Recommender Systems. 2023, pp. 1369–1373.

7

Discussion

[3]

Elnara Galimzhanova et al. “Rewriting Conversational

Utterances with Instructed Large Language Models”. In:

This report introduces a new content recommendation mecha-

(Oct. 2023). doi: 10.1109/wi- iat59888.2023.00014. (Visited nism and three ways to interact with it. Table 1 demonstrates the on 05/22/2024).

success of our keyword retrieval model in understanding general

[4]

Yunfan Gao et al. “Chat-rec: Towards interactive and ex-

user preferences while still performing well when searching for

plainable llms-augmented recommender system”. In: arXiv

specific content. Moreover, its consistency across languages and

preprint arXiv:2303.14524 (2023).

its ability to retrieve content using specific descriptions as well

[5]

Xu Huang et al. “Recommender ai agent: Integrating large

as general themes make it well-suited for a diverse user base.

language models for interactive recommendations”. In:

Additionally, the keyword model allows seamless integration

arXiv preprint arXiv:2308.16505 (2023).

with both the Chatbot and Inquisitive methods. Moreover, our

[6]

Patrick Lewis et al. “Retrieval-augmented generation for

system could be extended to dynamically adjust keyword genera-

knowledge-intensive nlp tasks”. In: Advances in Neural

tion based on user-specific factors such as viewing history, local

Information Processing Systems 33 (2020), pp. 9459–9474.

time, weather, and current mood indicators. This personalization

[7]

Peng Liu, Lemei Zhang, and Jon Atle Gulla. “Pre-train,

ensures that the recommendations are not only relevant to the

Prompt, and Recommendation: A Comprehensive Survey

content but also tailored to the user’s immediate context and

of Language Modeling Paradigm Adaptations in Recom-

preferences.

mender Systems”. In: Transactions of the Association for

Our approach has some limitations, including the cost per

Computational Linguistics 11 (2023), pp. 1553–1571.

query, which is higher than traditional search, although not ex-

[8]

Zihan Liu et al. “ChatQA: Building GPT-4 Level Conver-

orbitant. Furthermore, our model’s performance is commendable

sational QA Models”. In: arXiv preprint arXiv:2401.10225

given our limited knowledge about the movie content but relies

(2024).

on the assumption that the language model may have more infor-

[9]

Arpita Vats et al. “Exploring the Impact of Large Language

mation about a movie than our dataset. It’s worth noting that, in

Models on Recommender Systems: An Extensive Review”.

the short term, it appears that models are continually improving,

In: arXiv preprint arXiv:2402.18590 (2024).

becoming faster, more knowledgeable, and more cost-effective.

[10]

Likang Wu et al. “A survey on large language models for

Lastly, as with any chat application that involves user inputs,

recommendation”. In: World Wide Web 27.5 (2024), p. 60.

security is a crucial consideration. While improvements can be

[11]

Bowen Zheng et al. “Adapting Large Language Models

made through better prompting and fine-tuning, ongoing moni-

by Integrating Collaborative Semantics for Recommenda-

toring is essential when the system is in production.

tion”. In: 2024 IEEE 40th International Conference on Data

8

Future work

Engineering (ICDE). 2024, pp. 1435–1448. doi: 10 . 1109 /

ICDE60146.2024.00118.

In future work, we plan to further explore methods for improving

user experience and personalization. Our initial experiments have

involved incorporating the user’s time, location, and weather to

enhance results. Moving forward, we aim to explore additional

108





Continuous Planning of a Fleet of Shuttle Vans as Support for

Dynamic Pricing

Filip Stavrov

Luka Stopar

stavrovf@gmail.com

luka.stopar@ijs.si

Jožef Stefan Institute

Jožef Stefan Institute

Jamova cesta 39

Jamova cesta 39

Ljubljana, Slovenia

Ljubljana, Slovenia



ABSTRACT

Once we receive these predictions, our goal is to simulate

reservations based on this data. For instance, if the predictions

This paper solves the problem of estimating the number and type

indicate that 12 passengers will travel from Ljubljana to Koper on

of required resources for pickup and delivery of passengers at some

October 20, 2024, we would simulate reservations using sampling

time in the future. By combining optimization and sampling

techniques. One particular example is creating four separate

methods, as well as making plans based on several statistical

bookings—one for five passengers, one for three, and two for two

samples, we estimate the real values for the required resources

passengers each. We will introduce the sampling techniques used

and show how the sample values converge towards the real values.

in this process in greater detail later on.

Our approach combines machine-learning based demand

predictions, for the number of passengers, and a route

After generating these reservations, the next step is to input them

optimization engine that assigns the passengers into shared shuttle

into the Route Optimization Engine to generate a plan for that day.

vehicles. In order to validate our method we create a baseline data

This plan will specify the number of vehicles required and the

that is representative of the real values. We test our approach using

specific reservations each vehicle will serve.

this baseline data, and we obtain statistically significant results.

The main hypotheses that our approach explores and

KEYWORDS

experimentally tests are the following:



H1: We can accurately estimate the number of required

statistical samples, demand predictions, route optimization engine,

resources using optimization methods based on

sampling techniques, optimization technique

predicted passenger numbers.

1

INTRODUCTION



H2: Monte Carlo sampling of historical distributions can

effectively model uncertainty in demand predictions,

The effective allocation of resources is a critical topic in the mobility

leading to stable resource estimations.

industry. Anticipating the number and type of resources required



H3: Creating plans based on several sample values will

can significantly enhance a company's ability to plan accurately for

converge towards the actual number of required

the future. Our work addresses this challenge by focusing on how

resources.

to estimate the number and type of vehicles needed for passenger

pickup and delivery at a future time. The input to our problem

On the other hand, the key assumptions and limitations that

consists of machine learning-based demand predictions, which

underline our research are:

provide estimates of the number of passengers across various



Prediction Accuracy: We assume that the predictions

routes offered by the company. These predictions are provided

effectively estimate the number of future passengers.

daily and further broken down into hourly estimates for each day.



Passenger Distribution: We assume that the number of

∗

passengers follows a Poisson distribution and that the

Both authors contributed equally to this research.



distributions on different routes are independent.

Permission to make digital or hard copies of part or all of this work for



Independence: We assume that the passenger

personal or classroom use is granted without fee provided that copies are not

distribution and the window type distributions are

made or distributed for profit or commercial advantage and that copies bear

this notice and the full citation on the first page. Copyrights for third-party independent to each other.

components of this work must be honored. For all other uses, contact the



Concept Drift: We assume there is no concept drift in the

owner/author(s).

data, meaning the underlying data patterns do not

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

© 2024 Copyright held by the owner/author(s).

change over time.

https://doi.org/10.70314/is.2024.sikdd.27





109



2

RELATED WORK

The process begins with demand predictions and culminates in the

generation of reservation data. Critical steps include sampling the

The problem of resource allocation in the mobility industry,

number of passengers per reservation, the window type, and the

particularly in the context of vehicle routing and passenger demand

window length. Sampling is done from probabilistic distributions

prediction, has been extensively studied. Traditional methods for

derived from historical data, with the distributions illustrated

vehicle routing often rely on static models that assume known and

below.

deterministic demand. However, recent advances in machine

learning and optimization have enabled more dynamic approaches

that can account for uncertainty and variability in demand. [3][4]

For instance, predictive analytics has been employed to forecast

passenger demand using historical data, which can then be fed into

optimization algorithms to determine the optimal allocation of

vehicles. Monte Carlo simulation is another technique commonly

used to model uncertainty in demand predictions, providing a

probabilistic framework for decision-making under uncertainty. [2]

Moreover,

dynamic

vehicle

routing

approaches,

have

demonstrated the benefits of real-time adjustments to routing



plans based on updated demand information. [1] The integration

of these methodologies into a continuous planning framework is

Figure 2. Window type distribution

relatively novel and addresses the limitations of static planning

approaches, particularly in highly variable and uncertain

environments. [1][5]

3

METHODOLOGY

Our methodology begins with demand predictions for the number

of passengers, and the ultimate goal is to determine the number

and type of vehicles required, as well as the reservations each

vehicle will serve. The figure below provides a detailed overview of

this process.



Figure 3. Window length distribution



Figure 1. Methodology

Starting with the demand predictions, we apply sampling

techniques to simulate reservation data. Specifically, we take the



predicted number of passengers for different routes at various

Figure 4. Number of passengers distribution

times and generate reservations through sampling. This

reservation data follows a specific format, including fields such as

Please note that from a single demand prediction input file, we

ID, start location, end location, pickup time, and more. Key

generate 100 independent samples of reservation data. This

attributes include the number of passengers per reservation and

approach introduces uncertainty through probabilistic sampling.

the window type, which reflects travel preferences. For instance,

Each independent sample is then submitted as a separate job to

some passengers may prefer a private vehicle (VIP), while others

the Route Optimization Engine, where it solves a vehicle routing

are open to sharing the ride. Additionally, the window interval is

problem with time constraints. The output for each job is a plan

crucial—it can be a specific time or a more flexible period, affecting

corresponding to the reservation data. Our final objective is to

both the service pricing and overall experience. These factors will

aggregate these results and analyze the insights they provide.

be incorporated into the dynamic pricing model later on.



110

4

RESULTS

is acceptable given the overall similarity to the global mean, and

the sampling of values. Thus, despite the variance, the sampled

After solving all 100 jobs, we obtained 100 independent plans and

values converge towards the actual values. This error distribution

began analyzing the results. As shown in the figure below, the

is displayed on the figure below.

distribution of the number of passengers yielded a mean value of

325.87 with a standard deviation of 16.85. For the number of

vehicles, the mean was 38.01 with a standard deviation of 3.06. It’s

notable that the passenger data exhibits significantly more

variance compared to the vehicle data. This is expected, as

passengers are grouped into visits, and visits are then allocated to

vehicles, resulting in less variation in the vehicle count.



Figure 7. Required vehicles - error distribution

To statistically test whether the sampled and baseline data have

the same mean number of vehicles, we conducted a Welch's t-test.

The results showed a test statistic of 0.59, a p-value of 0.55, and a

95% confidence interval ranging from -0.64 to 1.23. Given the p-



value, we fail to reject the null hypothesis, meaning there is no

Figure 5. Sampled data: visits, vehicles and passengers

statistically significant difference between the sampled and

distributions

baseline vehicle counts. Additionally, the range of the mean

difference of vehicles between the sampled and the baseline data,

To further validate our approach, we created a baseline using the

which is from - 0.64 to 1.23, falls within our practical significance

same data from which the demand predictions were generated.

threshold of up to 2 vehicles, further supporting the similarity

We generated 100 samples from this baseline and submitted them

between the two datasets. This indicates that we can effectively

as independent jobs. Upon completion, we compared the baseline

estimate the number of required resources by applying

results with those of our sampled data. The mean number of

optimization techniques on top of the demand prediction values.

vehicles from the baseline was 37.81 with a standard deviation of

3.01, which closely aligns with the values from our sampled data.

We also analyzed the mean number of vehicles and observed that

You can observe the comparison on the figure below.

this value converges toward the actual values as the number of

samples increases. This is shown on the figure below.





Figure 6. Comparison of required vehicles between sampled and

baseline data

Figure 8. Convergence of means of sampled vehicles

We also analyzed the error distribution for the number of vehicles

Finally, after obtaining both the number of passengers and the

between the baseline and sampled data, finding a mean absolute

number of vehicles, we decided to fit a linear regression to explore

error of 3.16. This suggests that the difference between the two

whether we could simplify the process and avoid the detailed

sets is minor, considering the sampling of data, and it is indicating

approach previously described. As illustrated in the figure below,

a good alignment. Additionally, the average number of vehicles in

the regression line serves as a reasonable estimator for the number

both the sampled and baseline data is quite similar. While the mean

of vehicles based on the number of passengers. However, this

absolute error reflects some variability in the sampled values, this

model struggles to capture the non-linear relationships influenced

111

by various optimization types, window lengths, and travel modes, ACKNOWLEDGMENTS

resulting in considerable variance around the regression line. While

it is generally true that a higher number of passengers correlates

Our research is part of a broader, multi-partner initiative called

with an increased number of vehicles, this relationship can be

CONDUCTOR. The primary objective of this project is to design,

misleading. Different travel types can accommodate more

integrate, and demonstrate advanced, high-level traffic and fleet

passengers per vehicle, which can disrupt the linear relationship,

management systems. These systems aim to optimize the transport

especially in cases where these travel types dominate.

of passengers and goods efficiently on a global scale, ensuring

Consequently, although the linear regression provides a solid

seamless multimodality and interoperability. The CONDUCTOR

approximation, it overlooks essential non-linear factors that are

project is co-funded by the European Union’s Horizon Europe

critical to our analysis. Our approach, which integrates these

research and innovation programme under the Grant Agreement

factors, demonstrates greater robustness and effectiveness. The

No 101077049.

linear regression line and the data correlation are presented in the

figure below.

REFERENCES

[1]

Berbeglia, G., Cordeau, J. F., & Laporte, G. (2010).

Dynamic pickup and delivery problems. Transportation

Research Part B: Methodological, 44(5), 667-684.

https://doi.org/10.1016/j.trb.2009.10.004

[2]

Ulmer, M. W., Thomas, B. W., & Mattfeld, D. C. (2018).

Preemptive depot returns for same-day delivery under

uncertain customer availability. European Journal of

Operational

Research,

269(2),

356-371.

https://doi.org/10.1016/j.ejor.2017.08.008

[3]

Bertsimas, D., & Sim, M. (2004). The Price of Robustness.



Operations

Research,

52(1),

35-53.

Figure 9. Regression Analysis

https://doi.org/10.1287/opre.1030.0065

[4]

Ghiani, G., Guerriero, F., Laporte, G., & Musmanno, R.

5

CONCLUSION

(2003). Real-time vehicle routing: Solution concepts,

algorithms and parallel computing strategies. European

In conclusion, our findings demonstrate that we can effectively

Journal of Operational Research, 151(1), 1-11.

estimate the number of required resources by employing

https://www.sciencedirect.com/science/article/abs/pii/

optimization methods based on predicted passenger numbers. As

S0377221702009153

the number of samples increases, the sampled values consistently

[5]

Psaraftis, H. N., Wen, M., & Kontovas, C. A. (2016).

converge toward the actual resource requirements, reinforcing the

Dynamic vehicle routing problems: Three decades and

reliability of our approach. Alternative methods, such as linear

counting.

Networks,

67(1),

3-31.

regression, fail to adequately address the non-linear complexities

https://doi.org/10.1002/net.21628

inherent in resource allocation, such as varying optimization types



and window lengths. Our method, which incorporates these

factors, proves to be a far more accurate and effective solution for



resource estimation in the mobility industry.





112





Knowledge graph Extraction from Textual data using LLM

Khasa Gillani

Erik Novak

khasagillani22@gmail.com

erik.novak@ijs.si

Jožef Stefan Postgraduate School

Jožef Stefan Institute

Ljubljana, Slovenia

Ljubljana, Slovenia

Klemen Kenda

Dunja Mladenić

klemen.kenda@ijs.si

dunja.mladenic@ijs.si

Jožef Stefan Institute and Qlector

Jožef Stefan Institute and

Ljubljana, Slovenia

Jožef Stefan Postgraduate School

Ljubljana, Slovenia

ABSTRACT

Entity 1

Entity 2

Relation

The advent of Large Language Models (LLMs), such as Chat-

JSI

Assets

Slovenia located in

GPT and GPT-4, has revolutionized natural language process-

Termboard

Research

ing, opening avenues for advanced textual understanding. This

AI

JSI

area

study explores the application of LLMs in developing Knowledge

KG

Note

graphs from textual data. Knowledge graphs offer a structured

Note

Input

Generate

text

GPT ptompt

representation of information, facilitating enhanced comprehen-

sion and utilization of unstructured text. We intend to construct

Knowledge graphs that capture relationships and entities within

diverse textual datasets by harnessing LLMs’ contextual under-

Figure 1: Overview of proposed approach where input text

standing and language generation capabilities. The primary goal

is processed through a Termboard to generate a structured

is to explore and understand how well LLMs can identify and

prompt for LLM, creating an entity-relation table to build

extract relevant entities and relationships from textual data using

a Knowledge graph (KG).

prompt engineering while contributing to structured knowledge

representation.

labor-intensive and requires expert knowledge. However, con-

structing Knowledge graphs from unstructured text is intricate

KEYWORDS

and depends on sophisticated natural language processing (NLP)

Knowledge graph, Large Language Models, prompt engineering,

methods, including named entity recognition (NER) and relation

information extraction, textual data

extraction. The advancement of LLMs like GPT-4 presents an op-

portunity to automate and improve this process as illustrated in

1

INTRODUCTION

Figure 1. Utilizing LLMs can lead to more efficient, scalable, and In an era where data is ubiquitous, efficient organization, retrieval,

accurate Knowledge graph construction, thereby unlocking new

and interpretation of textual information are crucial. Knowledge

possibilities in information management and AI applications.

graphs, representing facts and relationships in structured forms,

play a pivotal role in various AI applications, from enhancing

2

BACKGROUND

search engines to powering recommendation systems. However,

An overview of recent research in Large Language Models and

the construction of these graphs is often hindered by the complex-

Knowledge graphs is provided in this section, which also empha-

ity and variability of human language. This paper explores the

sizes the potential for their integration.

potential of Large Language Models, like GPT-4, to revolution-

ize this process. By leveraging their advanced natural language

2.1

Large Language Model (LLM)

understanding capabilities, we aim to automate and refine the

Large Language Models are advanced AI systems pre-trained

extraction of knowledge from textual datasets. The fundamental

on extensive data, enabling them to comprehend and produce

purpose of this research is to understand the extent to which

human language. Their recent surge in popularity is due to their

LLMs can identify and extract relevant entities and relationships

proficiency in various language-processing tasks, including text

from textual data and then build a Knowledge graph using the

completion, translation, summarization, and answering ques-

extracted information.

tions. These models, primarily based on transformer architecture,

The motivation behind this study stems from the growing need

utilize self-attention mechanisms through encoder-decoder mod-

to effectively manage and utilize the vast amounts of textual data

ules. Encoders transform input text into numerical embeddings

generated daily. Knowledge graphs offer a structured and intu-

that reflect the context and meaning, while decoders use these

itive way to represent information, but their construction is often

embeddings to generate coherent and pertinent textual output.

The large language models feature a decoder-only architecture

Permission to make digital or hard copies of part or all of this work for personal and, thus, make a prediction of the target output text using only

or classroom use is granted without fee provided that copies are not made or

distributed for profit or commercial advantage and that copies bear this notice and the decoder module. The training paradigm for these models is

the full citation on the first page. Copyrights for third-party components of this to predict the next word in the sentence. Generally, large-scale

work must be honored. For all other uses, contact the owner/author(s).

decoder-only LLMs such as ChatGPT [7] and GPT-4 [2], focus on Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

© 2024 Copyright held by the owner/author(s).

human-like language output, predicting subsequent words based

https://doi.org/https://doi.org/10.70314/is.2024.sikdd.15

on the preceding text for tasks like text generation.

113





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Khasa and Dunia, et al.

Table 1: Simplified comparison between Large Language Models (LLMs) and Knowledge graphs (KGs) Feature

LLM

KG

Knowledge type

Broad, general knowledge

Structured, domain-specific knowledge

Data handling

Flexible, can process varied inputs

Requires structured data

Accuracy

May lack precision in understanding

Highly accurate with structured data

Understanding

Can interpret and generate language

Designed for specific queries and relationships

Adaptability

Adapts to new information by retraining

Adaptable when updated with new data

Transparency

Often seen as "black boxes" with unclear reasoning Clear decision-making pathways Error rate

Can make mistakes due to broad generalizations

Can be prone to errors if data is incorrect or missing

Complexity

Handles complex language tasks

Manages complex relationships and attributes

Usage

Broad applications in text generation, translation, Used for specific tasks like recommendations, search etc.

optimization

Scalability

Scales with computational power

Scales with the amount of structured data available

2.2

Knowledge graph (KG)

and KGs can perpetuate biases present in their training data or

Knowledge graphs are structured representations of information

construction methodologies. In conclusion, both LLMs and KGs

that depict the relationships between entities in a specific domain.

have their unique strengths and challenges. While LLMs excel

They are used extensively in various applications, such as search

in general language processing and knowledge extraction from

engines, recommendation systems, and question-answering sys-

vast corpora, KGs provide a structured and interpretable way to

tems. These graphs use detailed connections between data to

organize explicit knowledge. These differences underscore the

help with smart thinking, finding specific information easily, and

potential benefits of integrating LLMs and KGs to create more

running applications that use knowledge. Hence, allows us to

robust AI systems that leverage the strengths of both approaches.

better understand and use information across multiple fields.

Knowledge graphs provide a structured way of representing in-

3

PROOF OF CONCEPT: ANALYSIS AND

terconnected knowledge. They are precise and consistent, aiding

KNOWLEDGE GRAPH GENERATION

in decisive and informed decision-making. KGs are particularly

This section demonstrates how to process and analyze textual

valuable for their interpretability and explainability due to the

data to build a Knowledge graph using LLM. It is important to

explicit representation of entities and relationships. They can

mention that prompt engineering [5] is of great importance when capture domain-specific information accurately and evolve to

it comes to the results generated from ChatGPT. Since it is a gen-

incorporate new data. However, KGs may suffer from incom-

erative model, small variations in the input sequence can create

pleteness and may not always reflect the most recent or unseen

large differences in the produced output as demonstrated below.

facts. They also typically cannot understand natural language in

We use two different textual files containing contextual data: (i)

an unstructured format [3][6]. Moreover, KGs are preferred in APRIORI proposal (containing project details, job description,

scenarios where explainability and interpretability are crucial, as

potential candidate skills, hosting organizations, etc.) and (ii)

they provide structured knowledge representation.

ADRIA Motorhome instruction manual (containing textual as

well as tabular data). Moreover, building KG out of the ADRIA

2.3

Combining LLM and KG

instruction manual has potential applications for the manufac-

The comparison between Large Language Models and Knowl-

turing industry.

edge graphs (Table 1) can be supported by various references that highlight their respective strengths and weaknesses [4]. Large 3.1

Using ChatGPT Prompts:

Language Models like ChatGPT [7] are celebrated for their generalizability and ability to process diverse text data, allowing

We compare ChatGPT-3.5 and GPT-4 extracted entities and rela-

them to perform various language-related tasks without exten-

tions using the same prompts. We use Termboard1 which offers

sive task-specific training. They can act as reservoirs of general

customized ChatGPT prompts to create terms, entities, and rela-

knowledge, aiding in information synthesis and research. Their

tions to visualize larger graphs from the provided text.

proficiency in language processing is useful in tasks like natural

Prompt: Extract an ontology and create a table of relations with

language understanding and sentiment analysis. However, they

3 columns in this order: source, target, and relation name. Also

can suffer from hallucinations, where they generate plausible but

Create a table with 2 columns: put in the first column the name

factually incorrect information. Their "black-box" nature makes

of the term and in the second column an elaborate definition of

it difficult to understand the internal decision-making processes,

the term. Use this text as a basis: ÄPRIORI¨- (contains textual

and they can be indecisive, producing uncertain responses to

data about the job description, candidate skills, project description,

ambiguous inputs. Additionally, while they have vast general

hosting organization, etc).

knowledge, they may not be up-to-date with domain-specific or

Observing the Knowledge graphs generated by ChatGPT-3.5

the latest information. Critics of LLMs argue that these models

(Figure 2) and GPT-4 (Figure 3); we notice, that it didn’t extract all lack transparency and interoperability.

entities and relations and missing terms/concepts. For this reason,

Recent research [3] [4]efforts are, however, improving LLM’s

we ran the second prompt, where we redefined a more detailed

interpretability through techniques like attention mechanisms

prompt to ask GPT-4 to explicitly generate a comprehensive

and model introspection. KGs also present advantages over LLMs

ontology including all entities and relations from the provided

by providing knowledge about long-tail entities, thus improv-

text, categorize entities into types like Persons, Organizations,

ing recall for knowledge computing tasks. However, both LLMs

1https://termboard.com/

114





Knowledge graph Extraction from Textual data using LLM

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

AI

Production

Research

Katholieke

Horizon TMA

Universiteit

Partner

Integrated

Department

MSCA

Leuven

Organization

strategy

of AI

Slovenia

CO2

Research

conducts

facility

Address

emission

Aims to

Active

develop

Funds

uncertainty in

Jožef Stefan

learning

CO2

Participant

Doctoral

Located in

End-use

with

Institute

emissions

candidate

Recommends reduction of

Address

Complexity of

energy

Collaborates

Explainability

trains

Manufacturing

efficiency

APRIORI

APRIORI

trains

Job

Develops

goods

Contributes to

Linked to

Enroll at

description

Active

DC9

in

Collaborates with

Doctoral

Research

Data Mining

Sustainable

learning for

Involves

candidate

in

Educates

manufacturing

explainaibilty

Collaborates with

Research in

Collaborates

Focuses on

Support transition of

Participate

Develops

with

Machine

End-use

learning

Jožef Stefan

semantic

energy

Manufacturing

Visual

Institute

technologies

efficiency

sector

Host

inspection

Internation

Industrial

Linked to

Partner

Energy agency

Organization

Machine

applications

Research in

Horizon TMA

learning

Manufacturing

MSCA

Partner

EU

Machine

Language

sector

Doctoral network

Organization

manufacturing

learning,Data mining

technologies

Industry 4.0

sector

Sensor

networks

Additive

Manufacturing

Part of

Figure 2: The KG generated using ChatGPT-3.5 contains 20

entities. It was able to extract entities and link them to re-

Figure 4: The KG generated by GPT-4 contains 22 entities. It

lations, but it failed in abstracting concepts and specifying

Identified more key entities and relevant concepts and iden-

entities (i.e. partner organizations, location, etc.).

tified suitable relations to connect them (i.e. participant-

Katholieke Universiteit Leuven). However, it didn’t cover

Horizon TMA

MSCA Doctoral

all relations and classes (i.e. skills). We also notice a few

MSCA

network

Slovenia

Recruits

Salary

duplicated entities(i.e. data mining, CO2 emission, etc.)

and some independent entities (i.e. sustainable manufac-

Supports

Doctoral

Regulates

Located in

Part of

candidate

AI

Active

turing).

Research area

APRIORI

trains

learning

Natural

Ehnace

Subject of

Sciences

in

Manufacturing

is experiencing

Specilaizes in

fers PhD position

Products

Of

sector

an evolving trend

Participate

End-use

Is located close to the

Jožef Stefan

city center of Ljubljana

Research area

energy

Customizations

Machine

Institute

Research

efficiency

learning

careers

Research area

Explainable

Msc degree

Specilaizes in varios

AI

Manufacture

certificates

Data

Jožef Stefan

research area,

engineering

Institute

Research area

including AI

EU

Linked to

Data

Language

List of

manufacturing

Critical part of

mining

technologies

publications

sector

components

Horizon TMA MSCA

Marie Sklodowska-

Is a

doctoral network:

Curie doctoral

Application

APRIORI

network

June 15,

system

The manufacture

2023

Figure 3: The KG generated by GPT-4 contains 16 entities.

CV(curriculum

Include Alborg

sector

Vitae)

Industry 4.0

universitet ,

End-use

It was able to identify abstract concepts, and geographic

letter of

Denmark

motivation

energy sector

Additive

manufacturing

entities that ChatGPT-3.5 doesn’t. Extracted more elabo-

Salary

Beneficiaries

Mobility rules

Include Temporary works

rated entities with relations.

Artificial

Design 9,B.V, Netherlands

CO2

intelligence

emissions

Include

Sustainability

Materialize NV,

Include Qlector,

MSCA doctoral

guidlines and

Belgium

Slovenia

network rates

criteria

and concepts, and Geographic Locations, and then identify the

Include Katholieke

Doctoral

relations between these entities. Providing additional information

Universiteit Leuven,

Include Jožef Stefan

Belgium

Institute, slovenia

candidate (DC)

Determines the salary for

researchers in MSCA

to GPT-4 resulted in an improved Knowledge graph (Figure 4).

doctoral network

Identifies critical areas for

Internation

reducing CO2 emission by

However, ChatGPT-3.5 didn’t produce a quality graph (Figure 5)

energy agency

Will enhance Europe

2050

position in Engineering

compared to Figure 2.

sciences

3.2

Python Implementation

Figure 5: ChatGPT-3.5 was able to extract a larger number

We use a free, open-source library called spaCY 2 for advanced of entities but it was not successful at abstracting concepts

NLP in Python. We employ the named entity recognition tech-

and missing relations. Entities and relations found fre-

nique to identify named entities from a given text using the spaCY

quently represented complete sentences rather than con-

model (en-core-web-sm). We used a chunk of textual data from

cepts. This occurs because ChatGPT is a conversational

the ADRIA Motorhome manual for experiment purposes. Table 2

model trained on a task to create responses to a given

compares entities, relations, and triplets extracted from the raw

prompt and is not particularly trained to recognize en-

texts. The table shows that the number of triplets extracted by

tities and relations

algorithms is similar–(Figure 6 and Figure 7). However, the number of entities that spaCY extracts are larger but not every pair

of entities is connected by meaningful relation, leading to fewer

or provide additional context for better recognition. Hence re-

triplets. Thus defeating the purpose of creating a Knowledge

sults can be improved by pre-processing data into a structured

Base. When using spaCy for entity extraction, the entities are

format.

typically recognized based on the named entities present in the

text. Named entities are often specific nouns, such as names of

4

EVALUATION

people, organizations, locations, dates, or product names. spaCy

When there is no ground truth data available, creating an auto-

might not identify it as a specific entity by default. So to extract

mated evaluation metric for a Knowledge graph becomes chal-

specific entities, it might need to customize spaCy’s NER model

lenging. In such cases, the evaluation relies on qualitative prin-

ciples to assess the results. Based on the practical framework

2https://spacy.io/models

defined in the study [1], the following principles were identified: 115





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Khasa and Dunia, et al.

Table 2: Knowledge extraction comparison. (ADRIA mo-

Optical equipment

Special

manuals

Approvals

torhome manual dataset)

Provides

Original parts

Safety

information

Required for

information

Vehicle data

Optical

Follow

Algorithm

Entities

Relations

Triplets

equipment

Used in

Standard

General safety

equipment

GPT-4

18

20

20

Adhere to

Included in

instructions

Describes

equip

Provide/Update

ChatGPT-3.5

24

18

18

Describes

Equipment

Must read

ADRIA Vehicle

ADRIA Vehicle

state

Performed on

spaCY

22

14

17

Saftey chapter

Users

Recommended

Operates

Utilize

Carry in

for

Service &

Comply with

Repair

Technical

Driving on

Innovate/Develops Emergency

Subject to

system

ADRIA). We analyzed that extracted entities are duplicated and re-

public roads

equipment

Warranty

Imposed on

ADRIA design

lations have some noise and incomplete information. If you have

obligations

team

specific patterns or structures in mind that you want to extract

entities and relations based on, you may need to customize the

Figure 6: The KG generated by GPT-4 contains 18 enti-

relation extraction logic. Alternatively, more advanced natural

ties using the ADRIA motorhome instruction manual. It

language processing techniques or pre-trained models designed

extracted concepts relevant to ADRIA users and vehicle

for relation extraction tasks might provide better results. Also,

instructions, their functions, and how they are connected.

we analyzed half of the relations-entities extracted by spaCY and

ChatGPT are overlapped.

Provides

User

ADRIA home

5

CONCLUSION

Safety

accessories

regulations

The proposed exploration of using LLMs for Knowledge graph

tyre pressure

Instruction

check & tighten

manuals

extraction holds promise for advancing our understanding of

Optional

must comply with

equipment

how advanced language models can contribute to structured

driving license

must have

has

has

weight

alter dimension &



knowledge representation. This paper explores using LLMs to

has

Special

has

ADRIA Vehicle

Vehicle

ensure quality & readiness

approvals

generate Knowledge graphs out of source documents. We uti-

nameplates

impact

check & repair

lized ChatGPT-3.5 and GPT-4 models to generate the Knowledge

contains

Passengers

Technical

Graphs for two different textual data and compared the structure

operating

system

brake system

manual

pay attention to

Service

of the KGs. GPT-4 performed better as it successfully identified

gross weight

Warranty

Affects warranty

work

more abstract concepts and key entities compared to ChatGPT-

rating

ADRIA design

obligations

ADRIA dealer

team

doesn't

3.5. Therefore, it provides insights into the practical application

tolerate

provides

technical

Service &

Specialist

assist with

of LLMs in developing structured knowledge from unstructured

stanstill

Repair

workshop

textual data, with potential applications in knowledge-based AI

Figure 7: The KG generated by ChatGPT-3.5 contains 24

applications, paving the way for more effective information pro-

entities. Extracted more entities relevant to ADRIA vehi-

cessing and utilization. In future studies, we intend to use a more

cles but relations between entities are more generic and

formal framework to evaluate the quality of created Knowledge

entities are duplicated.

graphs. Such a framework will allow us to efficiently analyze

the quality of KG and provide a standardized method to forecast

missing linkages between concepts and relationships within a

• Triplets should be concise.

given domain.

• Contextual information of entities should be captured.

• The Knowledge graph does not contain redundant triples.

ACKNOWLEDGEMENTS

• Entities should be densely connected.

This research is supported by EU funding HE MSCA Project

• Relations among different types of entities should be in-

Apriori (GA: 101073551). The author acknowledges the usage of

cluded.

ChatGPT and Grammarly for content paraphrasing, grammar,

• Knowledge graphs should be organized in structured triples

and error checking.

for easy processing by machine.

• For tasks specific to a particular domain, it’s essential

REFERENCES

that the Knowledge graph is tailored and relevant to that

[1]

Haihua Chen, Gaohui Cao, Jiangping Chen, and Junhua Ding. 2019. A practi-

specific field

cal framework for evaluating the quality of knowledge graph. In Knowledge

Graph and Semantic Computing: Knowledge Computing and Language Under-

According to these principles, in our use case, we manually in-

standing: 4th China Conference, CCKS 2019, Hangzhou, China, August 24–27,

spected the Knowledge graphs generated above, and we can con-

2019, Revised Selected Papers 4. Springer, 111–122.

clude that the ChatGPT-3.5 approach provides a more detailed

[2]

R OpenAI. 2023. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2,

13.

Knowledge graph without abstract concepts compared to the

[3]

Jeff Z Pan et al. 2023. Large language models and knowledge graphs: oppor-

GPT-4. However, to create these Knowledge graphs, a few steps

tunities and challenges. arXiv preprint arXiv:2308.06374.

of refining the answers from ChatGPT are needed. Sometimes

[4]

Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong

Wu. 2024. Unifying large language models and knowledge graphs: a roadmap.

the produced output is incorrect and needs to be corrected before

IEEE Transactions on Knowledge and Data Engineering.

proceeding. When we redefined the prompt, GPT-4 identified

[5]

Elvis Saravia. 2022. Prompt engineering guide. (2022).

[6]

Milena Trajanoska, Riste Stojanov, and Dimitar Trajanov. 2023. Enhancing

more specific entities, and concepts compared to ChatGPT-3.5.

knowledge graph construction using large language models. arXiv preprint

Even though ChatGPT extracted a larger number of entities, it

arXiv:2305.04676.

failed to provide abstract concepts and entity-relation.

[7]

Ce Zhou et al. 2023. A comprehensive survey on pretrained foundation

models: a history from bert to chatgpt. arXiv preprint arXiv:2302.09419.

In the second part of the experiment, we employed the NER

method to extract relations and entities from the given text (i.e.

116





Solving hard optimization problems of packing, covering, and

tiling via clique search

Sándor Szabó

Bogdán Zaválnij

sszabo7@hotmail.com

bogdan@renyi.hu

University of Pécs

HUN-REN Alfred Renyi Institute of Mathematics

Pecs, Hungary

Budapest, Hungary

Abstract

numerically hard to solve problem of brick packing popular-

In the paper we propose to convert NP-hard combinato-

ized by M. Gardner. We will focus on different approaches

rial optimization problems of packing, covering, and tiling

of how to construct an auxiliary graph in order that to

types into maximum or 𝑘-clique problems. The key step is

translate this problems into a clique search problem. We

to come up with a tactically constructed auxiliary graph

will try to investigate how these different approaches –

whose maximum or 𝑘-cliques correspond to the sought com-

based on packing, covering and tiling– affect the solving

binatorial structure. As an example, we will consider the

time and if they have other consequences as well. First,

problem of packing a given cube by copies of a brick. The

we describe the basic problem, then we present theoretical

aim of the paper is two fold to illustrate (i) the modeling

discussion of different reformulations, and finally we de-

power and (ii) the feasibility of the clique approach. Since

scribe the results of numerical experiments. The emphasis

theoretical tools are not readily available to study the effec-

is on the modeling aspect of the computation and not on

tiveness of the solution of the resulting clique problems we

reaching new records, as the proposed problem was solved

will carry out carefully conducted numerical experiments.

in theoretical manner within months of its formulation.

Here we use it as a prototype of similar problems, and our

Keywords

aim to show the versatility of our approach, that is model

a problem by a graph.

mathematical programming, 𝑘-clique problems, combina-

Graphs in this paper will be finite simple graphs. Further

torial optimization

all graphs we use will not have loops or double edges. A

finite simple graph 𝐺 can be described with its set of nodes

1

Introduction

𝑉 and a subset 𝐸 of the Cartesian product 𝑉 × 𝑉 . The

subset 𝐸 can be identified by the set of edges of 𝐺.

One can see graphs as a mathematical models that can

Let 𝐺 = (𝑉, 𝐸) be a finite simple graph. A non-empty

describe various fields of interest. Like numbers, functions,

subset 𝐶 of 𝑉 is called a 𝑘-clique if each two distinct nodes

or Linear Programming graph based approach can model

of 𝐶 are adjacent in 𝐺 and in addition 𝐶 has exactly 𝑘

interesting problems and aid us in solving them. Some

elements. If 𝐶 has only one element, then we consider it a

of these approaches are quite straightforward like cliques

1-clique. The 2-cliques of 𝐺 are the edges of 𝐺. A 𝑘-clique

of people in a social interaction graphs or shortest path

𝐶 of 𝐺 is called a maximum clique if 𝐺 does not have

problem in a road map. Other approaches are less obvious

any (𝑘 + 1)-clique. For each finite simple graph 𝐺 there is

but still easily constructed, like conflict graphs in a set of

an integer 𝑘 such that 𝐺 contains a 𝑘-clique but 𝐺 does

codewords where a maximum independent set represents a

not contain any (𝑘 + 1)-clique. This well defined integer

maximum set of suitable error correcting codes [9].

𝑘 is called the clique number of 𝐺. We state two clique

But the approach of modeling and solving various prob-

problems formally.

lems by graphs are more versatile. Namely, we can see

graphs as a language for mathematical programming – if

Problem 1. Given a finite simple graph 𝐺 and an inte-

certain combinatorial problems can be solved by construct-

ger 𝑘. Decide if 𝐺 has a 𝑘-clique.

ing a suitable auxiliary graph and finding a maximum or

𝑘-clique of this graph gives the solution. The authors have

Problem 2. Compute the clique number of a given finite

already used this approach in connection with mathemat-

simple graph.

ical conjectures [1], hyper graph coloring [11], subgraph isomorphism [2], scheduling problems [12], graph coloring Problem 1 is a decision problem, it is referred as the 𝑘-

problems [13] and protein docking problems in chemistry

clique problem, and it is an NP-complete problem included

[8].

in the original list of 21 NP-complete problems by Karp

Here we would like to give an example, where a hard

[7]. Problem 2 is an optimization problem and referred as combinatorial optimization problem can be solved by this

the maximum clique problem, and as the decision problem

approach. For this we chose a simple to understand but

belongs to the NP-complete class it follows that it belongs

to the NP-hard class.

Permission to make digital or hard copies of all or part of this

We color the nodes of a finite simple graph 𝐺 with the

work for personal or classroom use is granted without fee provided

colors 1, 2, . . . , 𝑘 such that each node receives exactly one

that copies are not made or distributed for profit or commercial

advantage and that copies bear this notice and the full citation on

color and adjacent nodes never receive the same color. Such

the first page. Copyrights for third-party components of this work

a coloring of the nodes of 𝐺 is called a well coloring, a

must be honored. For all other uses, contact the owner/author(s).

proper coloring, or a legal coloring (the terminology is not

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

©

unified). The set of nodes of 𝐺 receiving the color 𝑖 is called

2024 Copyright held by the owner/author(s).

https://doi.org/https://doi.org/10.70314/is.2024.sikdd.9

the 𝑖-th color class. Clearly, a color class is an independent

117





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Szabó et al.

set of 𝐺, that is, two nodes from a fixed color class are

We state five problems related to packings, coverings,

never adjacent.

and tilings in a formal manner. Given a finite set 𝑈 and

If the nodes of a finite simple graph can be legally colored

its subsets (1).

using 𝑘 colors, then we say that 𝐺 is a 𝑘-partite graph.

Problem 4. Decide if 𝑈 has a 𝑘-packing using the mem-

The reason is that in this situation the nodes of 𝐺 form a

bers of the family (1).

union of 𝑘 independent sets and these sets are pair-wise

disjoint.

Problem 5. Decide if 𝑈 has a 𝑘-covering using the

In this paper we will focus on the following clique prob-

members of the family (1).

lem.

Problem 6. Decide if 𝑈 has a 𝑘-tiling using the mem-

Problem 3. Given a finite simple graph 𝐺 whose nodes

bers of the family (1).

are legally colored using 𝑘 colors. Decide if 𝐺 has a 𝑘-clique.

Problem 7. Compute the packing number of 𝑈 with

Problem 3 is a 𝑘-clique problem particularized to case

respect to the family (1).

of 𝑘-partite graphs. This problem is still an NP-complete

Problem 8. Compute the covering number of 𝑈 with

problem, as the graph coloring problem can be reduced to

respect to the family (1).

such question as shown in [13], and should not be confused

with the problem of complete graphs.

Problem 4 can be reduced to Problem 1. We construct a The problem class we will be focusing on in the present

finite simple graph 𝐺. The nodes of 𝐺 are the members of

paper consists of packing, covering, or tiling problems.

the family (1). Two distinct nodes 𝐴𝑖 and 𝐴𝑗 are adjacent Obviously many real world and mathematical problems

in 𝐺 whenever 𝐴𝑖 and 𝐴𝑗 are disjoint. A 𝑘-clique in 𝐺

fall into this class, and here we would show some ideas how

corresponds to a 𝑘-packing of 𝑈 .

such problems can be modeled by a suitably constructed

Problem 5 can be reduced to Problem 3. We sketch the auxiliary graph where a 𝑘-clique search would solve the

main points of this reduction. We construct a finite simple

original problem.

graph 𝐺. The first type of nodes of 𝐺 are ordered pairs

(𝐵, 𝑥), where 𝐵 ∈ {𝐴1, . . . , 𝐴𝑚}, 1 ≤ 𝑥 ≤ 𝑘. The intuitive

2

Packing, covering, and tiling

meaning of the pair (𝐵, 𝑥) that the subset 𝐵 is the 𝑥-th

member of a 𝑘 element family of (1). To the node (𝐵, 𝑥)

First, we describe the problem class in question. Second,

we assign the color 𝑥. Two nodes receiving the same color

we draw up some basic concepts how these problems can

will be non-adjacent in 𝐺. Therefore the first type nodes

be modeled by graphs.

of 𝐺 are legally colored with 𝑘 colors.

Let 𝑈 be a finite ground set and let

We are adding second type nodes to 𝐺. Namely, we are

𝐴1, . . . , 𝐴𝑚

(1)

adding the ordered pairs (𝐴, 𝑢), where 𝐴 ∈ {𝐴1, . . . , 𝐴𝑚},

be subsets of 𝑈 . A family of subsets

𝑢 ∈ 𝑈 and in addition 𝑢 ∈ 𝐴 holds. The intuitive meaning

of the pair (𝐴, 𝑢) is that the element 𝑢 is covered by set

𝐵1, . . . , 𝐵𝑛

(2)

𝐴. To the node (𝐴, 𝑢) we assign 𝑢 as a color. Two nodes

with {𝐵

receiving the same color will not be adjacent in 𝐺. Thus

1, . . . , 𝐵𝑛} ⊆ {𝐴1, . . . , 𝐴𝑚} is called a packing of

𝑈 if the members of the family (2) are pair-wise disjoint. A the second type nodes of 𝐺 are legally colored using 𝑡 = |𝑈 |

family of subsets (2) is called a covering of 𝑈 if the union of colors. Now if we are locating a (𝑘 + 𝑡)-clique in 𝐺, then

(2) is equal to 𝑈 . Phrasing it differently, a family of subsets we select exactly 𝑘 subsets from (1) and each element of

(2) is a covering of 𝑈 if each element of 𝑈 belongs to at

𝑈 will belong to at least one of these subsets. The missing

least one member of the family (2). If a family of subsets

part of the construction, what we left for the reader, is how

(2) is a packing and a covering of 𝑈 in the same time, then

the first and second types of nodes are connected by edges.

it is called a tiling of 𝑈 . A tiling of 𝑈 some times referred

Problem 6 can be reduced to Problem 3. As a tiling is as exact covering of 𝑈 .

a packing and covering at the same time, we can add the

A packing of 𝑈 is called a 𝑘-packing if it consists of

packing restrictions, namely not connecting two sets if they

𝑘 subsets of 𝑈 . Similarly, a covering of 𝑈 is called a 𝑘-

intersect, to the second type of nodes. On the other hand

covering if it consists of 𝑘 subsets of 𝑈 . Finally, a tiling

– in case of equal size sets –, we do not need to count the

of 𝑈 is called a 𝑘-tiling if it consist of 𝑘 subsets of 𝑈 . For

used sets, so we won’t need the first type of nodes, they

a given ground set 𝑈 and for its given subsets (1) there

can be omitted.

is an integer 𝑘 such that 𝑈 has a 𝑘-packing using subsets

The computational difficulties of the 𝑘-packing, 𝑘-covering,

of the family (1) but there is no any (𝑘 + 1)-packing of 𝑈

and 𝑘-tiling problems are different. It seems that the cov-

using members of the family (1). This well defined integer

ering problems are the computationally most demanding

𝑘 is the packing number of 𝑈 with respect to the family

and the tiling problems are the most manageable.

(1). If the packing number of 𝑈 is equal to 𝑘, then each

𝑘-packing of 𝑈 is called maximum packing of 𝑈 .

3

Gardner’s bricks problem

For a given ground set 𝑈 and for its given subsets (1)

We picked Gardner’s problem because it is intuitive and

there is an integer 𝑘 such that 𝑈 has a 𝑘-covering using

easy to comprehend among such problems that can be

subsets of the family (1) but there is no any (𝑘 −1)-covering reduced to Problem 3 and so it serves as a good illustration

of 𝑈 using members of the family (1). This well defined

of the kind of clique modeling we are dealing with. We do

integer 𝑘 is the covering number of 𝑈 with respect to the

not claim any originality in connection with the problem.

family (1). If the covering number of 𝑈 is equal to 𝑘, then We do not prove any new results. Each of the facts we

each 𝑘-covering of 𝑈 is called minimum covering of 𝑈 .

use are known from the folklore and we present them only

118





Solving hard problems via clique search

Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia

for the reader convenience. The problem was raised by

Proof. Note that a fixed slab can contain only 0, 2 or 4

Foregger in March 1975 [10], popularized by Gardner in

unit cubes from any brick of the packing. The point is that

February 1976 [5], and solved by Foregger and Mather in

the numbers 0, 2, 4 are all even. Each slab consists of an odd

November 1976 [3].

number of unit cubes. Therefore, each slab must contain an

Let us consider a brick 𝐵 of dimensions 1 × 2 × 4. The

odd number of unpacked unit cubes. The number of slabs

brick 𝐵 is a union 8 unit cubes whose edges are parallel to

is 7 and so each slabs must contain exactly one unpacked

the coordinate axis. From some reason unknown for us the

unit cube.

□

brick 𝐵 is referred as canonical brick. Suppose we have a

large supply of congruent copies of 𝐵 and we want to pack

We can also form slabs by slicing 𝐶 with planes per-

as many as possible into a 7 × 7 × 7 cube 𝐶. The cube 𝐶

pendicular to the second coordinate axes. Each of these

is a union of 343 unit cubes. Let us divide 343 by 8 with

7 slabs contains exactly one unpacked unit cube. Finally,

remainder. As 343 = (42)(8) + (7), 43 copies of 𝐵 cannot

slicing 𝐶 by planes perpendicular to the third axes we get

be packed into 𝐶. M. Gardener advanced the question if

that each of these slabs contains exactly one unpacked unit

42 copies of 𝐵 can be placed into 𝐶. One can place a copy

cubes. These constraints on the uncovered unit cubes are

of 𝐵 into 𝐶 in any possible rotated position as long the

independent, but can also be checked independently during

edges of 𝐵 are parallel to the coordinate axis. (The answer

an extended search, and as such can reduce the search

to this question is actually: No, one cannot place 42 bricks

space well.

into a cube of size 7 × 7 × 7.)

Gardner’s problem can be expressed in terms of comput-

4

Numerical experiments

ing the clique number of a suitable constructed graph 𝐺.

Gardner’s brick packing problem can be turned into various

In other words, Gardner’s problem can be reduced to an

clique search problems and we carried out numerical exper-

instance of the maximum clique problem. Let us denote the

iments with them. We will observe that the same geometric

set of the 343 unit cubes forming 𝐶 by 𝑈 . An 8 elements

problem will lead to very different clique search problems.

subset 𝑣 of 𝑈 is a vertex of 𝐺 if the union of the elements

When we try to pack 42 congruent copies of the canonical

of 𝑣 is a congruent copy of 𝐵. As it turns out 𝐺 has 1008

brick 𝐵 into the the big cube 𝐶, we get a 𝑘-clique problem.

nodes. Two distinct nodes 𝑣 and 𝑣′ of 𝐺 are adjacent in 𝐺

When we notice that the nodes of the auxiliary graph can

if 𝑣 and 𝑣′ are disjoint. If 𝐺 contains a (42)-clique, then

be legally colored using 42 colors we get a 𝑘-clique prob-

42 congruent copies of 𝐵 can be packed into 𝐶. During

lem in a 𝑘-partite graph which is a more tractable search

our numerical experiments a greedy coloring procedure

problem. When we try to pack 42 congruent copies of the

provided a legal coloring of the nodes of 𝐺 using 42 colors.

brick into the cube 𝐶 together with 7 unit cubes we get

Note that this is just a coincidence, it could’ve happened

tiling problem. When we try to pack 42 congruent copies

otherwise. Thus we are facing with a particular case of

of the brick into the cube 𝐶 together with 7 unit cubes

the 𝑘-clique problem stated in Problem 3. The nodes of 𝐺

and in addition we distinguish the unit cubes among each

are legally colored with 42 colors and we are looking for a

other we get yet another version of the tiling problem.

(42)-clique in 𝐺. Phrasing it differently, we are looking for

In the first approach the auxiliary graph 𝐺1 had 1008

a 𝑘-clique in a 𝑘-partite graph, where 𝑘 = 42.

vertices. The nodes of 𝐺1 were legally colored using 42

We introduce a coordinate system whose origin coincides

colors and we tried to locate a (42)-clique in 𝐺. Note, that

with a corner of the cube 𝐶.

although this graph can be colored with 42 colors it was

just a coincidence. There is no theoretical background to

Observation 1. If 42 congruent copies of the brick 𝐵

this fact. Of course the expectation was that 𝐺1 do not

can be packed into 𝐶, then there is such a packing which

have any (42)-clique.

contains the congruent copy of 𝐵 whose one corner is the

Let us assume that it is possible to pack 42 congruent

origin. Further the edges of lengths 1, 2, 4 are parallel to

copies of the 1 × 2 × 4 canonical brick 𝐵 into the 7 × 7 × 7

the first, second and third coordinate axis, respectively.

cube 𝐶. By Observation 1, we may assume that a brick

appear in the packing such that one of the corners of the

Proof. As 343 = (42)(8) + (7) holds, 7 unit cubes of

brick coincides with the origin of the coordinate system

𝐶 are not contained by any bricks of the packing. The

and the edges of lengths 1, 2, 4 are along the 1-st, 2-nd,

cube 𝐶 has 8 corners and so at least one of the corners

3-rd coordinate axis. This information can be interpreted

must be contained by a brick. At this point we introduce a

such that there a (42)-clique 𝐶

coordinate system whose origin is this corner of 𝐶. Then

2 in 𝐺1 which has a specific

node. Namely, the vertex 𝑣

we introduce the first, second, and third coordinate axis to

1 of 𝐺1 that corresponds to the

special corner brick is a node of the of 𝐶

satisfy our requirement.

□

2. This suggests to

restrict the graph 𝐺1 to the neighbors of the vertex 𝑣1 to

get a new graph 𝐺

The cube 𝐶 can be sliced into 7 slabs using planes

2. Then we are looking for a (41)-clique

in 𝐺

perpendicular to the first coordinate axes. Each slab is a

2. Plainly, the nodes of 𝐺2 are legally colored using 41

colors. This coloring is inherited from the coloring of the

1𝑥7𝑥7 slice of the big cube, that is a union of 49 unit cubes.

nodes of 𝐺

The centers of these cubes are in a plane perpendicular to

1. Since the graph 𝐺2 has fewer vertices than

𝐺

the first coordinate axis. The 7 unit cubes of 𝐶, that are

1 (actually 960) and we are looking for a smaller clique in

𝐺

not contained by any brick of the packing, are referred as

2 than in 𝐺1. The new clique problem probably requires

less computational effort because the graph is smaller, and

unpacked unit cubes.

because we introduced a symmetry breaking to it.

Observation 2. Two distinct uncovered unit cubes of

The problem of packing 42 bricks into a bigger cube can

𝐶 cannot be in the same slab.

be viewed as a tiling problem. Namely, we try to tile the

119





Information Society 2024, 7–11 October 2024, Ljubljana, Slovenia Szabó et al.

7 × 7 × 7 cube 𝐶 by 42 copies of the canonical brick and

The results presented here have interesting consequences

7 additional copies of a unit cube. Thus we are facing to

and suggest further research problems. First, and as an-

a tiling problem using two different types of tiles and the

ticipated, different auxiliary graphs lead to very different

number of the tiles is given. To ensure that we use 42 bricks

search space sizes. And although the usual concept in our

we numerate the small cubes as {1, . . . , 7} and ensure in the

research is that bigger graphs usually tend to be harder,

graph that each small unit cube is used once, that is we do

that is not always the case. Remarkably, numerical results

not connect nodes where the unit square is covered by the

indicate that the size of the auxiliary graph alone is not as

same small cube. This tiling problem can also be reduced

important as the type of the reformulation. Namely, the

to a clique search problem. We denote the corresponding

tiling type auxiliary graphs required less computational

graph 𝐺3. Tiling problems are more manageable compared

effort for clique search even if they were not the small-

with packing problems as during the search back-tracking

est graphs. Second, there are additional constraints that

can be anticipated earlier. However, the graph associated

can be added to some reformulations while they seemingly

with the tiling in our case has more vertices than the graph

cannot be incorporated into others. An example to such a

associated with the packing, namely it has 10 465 nodes.

constraint is the fact described after the proof of Observa-

Therefore only computations can reveal which approach is

tion 1. Namely, that no two distinct uncovered unit cubes

preferable.

can appear in the same slab in Gardner’s brick packing

Obviously, in this case we can also fix a brick in the

problem. That kind of restriction could be incorporated

corner. This version will be the 𝐺4 graph.

into the tiling version of reformulation, and possibly not

In the last clique search equivalent of Gardner’s problem

applicable to the packing reformulation. Taking advantage

we construct a graph 𝐺5. In this construction we handle a

of the extra constraint made possible to solve the brick

mixed tiling problem but we utilize the extra information

packing problem in reasonable time.

that no two distinct unit cube can appear in the same slab.

There are other problems that can be solved using similar

By Observation 2, this may be assumed. This is done by

approaches as detailed in the paper. Authors could solve

not connecting two nodes associated with unit cubes if

smaller instances of the Golomb ruler problem or the Salem-

those unit cubes lay in the same slab. This graph is the

Spencer set problem. The results, that lay outside the scope

same size as 𝐺3, as we only delete some edged from it. Also,

of the present paper, obtained with those instances open

we can fix a brick in the corner in this case as well, that

up even more interesting considerations.

shall be the 𝐺6 graph.

Once again only numerical experiments can guide us in

Acknowledgements

judging the merits of the possible clique search equivalents

The present research was funded by National Research,

of the problems. Further, the preconditioning methods per-

Development and Innovation Office – NKFIH Fund No.

form differently on the graphs 𝐺1, 𝐺2, 𝐺3, 𝐺4, 𝐺5, 𝐺6 and

SNN-135643.

this adds an extra layer of difficulty to the numerical work.

We used a computer with AMD EPYC 7643 processors,

References

C++, and gcc v12.1 with settings -O3 -arch=znver3.

[1] K. Corrádi and S. Szabó, A combinatorial approach for Keller’s

We made all six graphs and performed 𝑘-clique search

conjecture. Period. Math. Hungar. Vol. 21, 91–100, 1990.

[2] M. Depolli, S. Szabó and B. Zaválnij, An Improved Maximum

on them after preconditioning as described in [12, 13]. The

Common Induced Subgraph Solver. MATCH Commun. Math.

preconditioning run for 1-2 hours for the bigger graph, and

Comput. Chem. 84 pp. 7–28. 2020.

reduced it by half, namely to around 6 000 nodes for 𝐺

[3] T. H. Foregger, and M. Mather, M. E2524. The American

3, 𝐺4;

Mathematical Monthly. Vol. 83, No. 9 (Nov., 1976), pp. 741–

and to around 4 000 for 𝐺5, 𝐺6, that is the graphs where we

742

allow only one small cube in a slab. For the smaller graphs

[4] D. Hespe, Ch. Schulz, D. Strash. Scalable Kernelization for

(𝐺

Maximum Independent Sets. ACM Journal of Experimental

1, 𝐺2) the preconditioner runs for a couple seconds but

Algorithmics. Volume 24, Article No.: 1.16, pp 1–22. 2019.

cannot significally reduce the graph. Three of the six graph

[5] M. Gardner, MATHEMATICAL GAMES – Some elegant brick-

could be solved after preconditioning: 𝐺

packing problems, and a new order-7 perfect magic cube. Scien-

2, 𝐺5, and 𝐺6.

tific American. Vol. 234, No. 2 (February 1976), pp. 122–127.

The solution time of 𝐺2 (the original graph with fixed

[6] M. R. Garey, and D. S. Johnson, Computers and Intractability:

brick in the corner) was 50 days. The solution of 𝐺5 was a

A Guide to the Theory of NP-completeness, Freeman, New

bit faster, 29 days. Finally, the graph 𝐺

York, 2003.

6 could be solved

[7] R.M. Karp. “Reducibility Among Combinatorial Problems.” In:

more effectively. The running time was 123 484 seconds,

Complexity of Computer Computations. New York: Plenum.

that is 34 hours. This clearly show us the importance of

pp. 85–103. 1972.

the extra information of slabs.

[8] K. Rozman, A. Ghysels, B. Zavalnij, T. Kunej, U. Bren, D.

Janežič, and J. Konc, Enhanced Molecular Docking: Novel Al-

gorithm for Identifying Highest Weight k-Cliques in Weighted

General and Protein-Ligand Graphs. JOURNAL OF MOLEC-

5

Conclusions

ULAR STRUCTURE. 1304 p. 137639 Paper: 137639. 2024.

[9] N. J. A. Sloane. Challenge Problems: Independent Sets in

We detailed several 𝑘-clique search reformulations of a

Graphs. https://oeis.org/A265032/a265032.html

certain combinatorial problem in terms of constructing

[10] T. H. Foregger. Elementary Problem E2524. The American

Mathematical Monthly. Vol. 82, No. 3 (Mar., 1975), p. 300.

suitable auxiliary graphs. We do not claim, that these

[11] S. Szabó and B. Zaválnij, Reducing hyper graph coloring to

methods result more efficient practical computations than

clique search. Discrete Applied Mathematics. 264. pp. 196–207.

other approaches. The point we are trying to make is that

2019.

[12] S. Szabó, and B. Zaválnij, Clique search in graphs of special

the clique reformulations open up a possibility to use well

class and job shop scheduling. Mathematics. 10(5), 697. 2022.

tuned clique solvers, including preconditioning, to handle

[13] S. Szabó, and B. Zaválnij, Graph Coloring via Clique Search

with Symmetry Breaking. SYMMETRY (BASEL). 14 : 8 Paper:

different combinatorial problems in a unified manner as a

1574, 16 p. 2022.

general solver.

120





Indeks avtorjev / Author index



Abkari M. Wahib.......................................................................................................................................................................... 39

Amiel Tel ..................................................................................................................................................................................... 35

Andrenšek Luka ........................................................................................................................................................................... 55

Batagelj Vladimir ......................................................................................................................................................................... 27

Calcina Erik .................................................................................................................................................................................. 93

Candia Vieira Joao Paulo ............................................................................................................................................................. 77

Cherakaoui Manal ........................................................................................................................................................................ 39

Čibej Jaka ..................................................................................................................................................................................... 23

Costa Luiz .................................................................................................................................................................................... 77

Dolinar Lenart .............................................................................................................................................................................. 93

Dupuis Aymeric ........................................................................................................................................................................... 31

Džeroski Sašo ............................................................................................................................................................................... 31

Evkoski Bojan .............................................................................................................................................................................. 19

Fijavž Zoran ................................................................................................................................................................................. 51

Fir Jakob ..................................................................................................................................................................................... 105

Gilliani Khasa ............................................................................................................................................................................. 113

Godoy Oliveira Cristina ............................................................................................................................................................... 77

Golob Luka ................................................................................................................................................................................... 47

Gourari Kamal .............................................................................................................................................................................. 39

Grigor Patricia-Carla .................................................................................................................................................................... 19

Grobelnik Marko ............................................................................................................................................................ 43, 81, 101

Guček Alenka ......................................................................................................................................................................... 81, 85

Hachimi Hanaa ............................................................................................................................................................................. 39

Hočevar Domen.............................................................................................................................................................................. 7

Hrib Ivo ........................................................................................................................................................................................ 67

Jermol Mitja ................................................................................................................................................................................. 35

Kenda Klemen ............................................................................................................................................................ 7, 11, 73, 113

Kholmska Ganna .......................................................................................................................................................................... 73

Klančič Rok .................................................................................................................................................................................. 11

Koloski Boshko ............................................................................................................................................................................ 31

Kralj Novak Petra ......................................................................................................................................................................... 19

Lachheb Hatim ............................................................................................................................................................................. 39

Leban Gregor ......................................................................................................................................................................... 63, 89

Longar Mark David ............................................................................................................................................................ 101, 105

Martinc Matej ............................................................................................................................................................................... 31

Massri M. Besher ......................................................................................................................................................................... 81

Meira Silva Rafael ........................................................................................................................................................................ 77

Mladenić Dunja ............................................................................................................................................ 63, 81, 85, 89, 97, 113

Mores Neto Antonio J. ................................................................................................................................................................. 35

Motamedi Elham .......................................................................................................................................................................... 59

Novak Erik ................................................................................................................................................................... 93, 101, 113

Novalija Inna ................................................................................................................................................................................ 59

Pangeršič Bor ............................................................................................................................................................................. 105

Pisanski Jan .................................................................................................................................................................................. 27

Pisanski Tomaž ............................................................................................................................................................................ 27

Pita Costa Joao ........................................................................................................................................................... 35, 39, 43, 77

Polajnar Anja ................................................................................................................................................................................ 35

Pollak Senja .................................................................................................................................................................................. 55

Purver Matthew ............................................................................................................................................................................ 55

Rei Luis ........................................................................................................................................................................................ 59

Rožanec Jože M. .............................................................................................................................................................. 63, 73, 89

Šinik Bogdan ................................................................................................................................................................................ 15

Sitar Šuštar Katarina ..................................................................................................................................................................... 55

Sittar Abdul ............................................................................................................................................................................ 47, 85

Šker Tesia ..................................................................................................................................................................................... 89



121



Škrjanc Maja ................................................................................................................................................................................ 67

Stavrov Filip ............................................................................................................................................................................... 109

Stegnar Jernej ............................................................................................................................................................................... 63

Stopar Luka ................................................................................................................................................................................ 109

Šturm Jan ...................................................................................................................................................................................... 67

Swati ............................................................................................................................................................................................. 97

Szabo Sandor .............................................................................................................................................................................. 117

Topal Oleksandra ......................................................................................................................................................................... 67

Tošić Aleksander .......................................................................................................................................................................... 15

Tounsi El Azzoiani Jad ................................................................................................................................................................ 39

Urbanč Luka ................................................................................................................................................................................. 43

Vake Domen ................................................................................................................................................................................. 15

Vičić Jernej ................................................................................................................................................................................... 15

Zaouini Mustafa ........................................................................................................................................................................... 39

Zavalnij Bogdan ......................................................................................................................................................................... 117





122



Odkrivanje znanja in

podatkovna skladišča - SiKDD

Data Mining and

Data Warehouses - SiKDD

Urednika > Editors:

Dunja Mladenić, Marko Grobelnik





Document Outline


IS2024_Volume-C - DRAFT 02 - Naslovnica - notranja - C - DRAFT

03 - Kolofon - C - DRAFT

04 - IS2024 - Predgovor

05 - IS2024 - Konferencni odbori

07 - Kazalo - C

08 - Naslovnica - notranja - C - DRAFT

09 - Predgovor podkonference - C

10 - Programski odbor podkonference - C

12 - Index - C

Blank Page

Blank Page

11 - Prispevki - C.pdf IS2024_-_SIKDD_2024_paper_001 Abstract

1 Introduction

2 Data

3 Methodology

4 Results 4.1 Accuracy of Data Retrieval

4.2 Instance Fetching Accuracy

4.3 Manual Evaluation of Example Queries





5 Conclusions

Acknowledgements





IS2024_-_SIKDD_2024_paper_002 Abstract

1 Introduction

2 Methods 2.1 Traditional Batch Learning Methods

2.2 Time Series Deep Learning Methods

2.3 Time Series Foundation Models





3 Experiment Setting 3.1 Dataset

3.2 Evaluation Metrics

3.3 Baseline Methods

3.4 Implementation Details





4 Results

5 Conclusion and Future Work

Acknowledgements

A Hyperparameters

B Selected Features





IS2024_-_SIKDD_2024_paper_003 Abstract

1 Introduction

2 Literature review

3 Methodology

4 Results

5 Conclusion and future work





IS2024_-_SIKDD_2024_paper_004 Abstract

1 Introduction

2 Related Work

3 Method 3.1 Datasets

3.2 Model Selection and Fine-Tuning

3.3 Model Evaluation





4 Results 4.1 Inter-Annotator and Model-Annotator Agreement

4.2 Model Comparison





5 Discussion

6 Conclusions

7 Acknowledgments





IS2024_-_SIKDD_2024_paper_005 Abstract

1 Introduction

2 Dataset

3 Statistical Analysis and Feature Selection

4 Pronunciation Type Prediction

5 Manual Evaluation

6 Conclusion

Acknowledgements





IS2024_-_SIKDD_2024_paper_006 Abstract

1 Introduction

2 Higher-Order Bibliographic Services

3 OpenAlex 3.1 API





4 A collection of bibliographic networks

5 Report ingredients 5.1 Statistics

5.2 Network analysis

5.3 Special algorithms

5.4 Reports





6 Conclusions

Acknowledgements





IS2024_-_SIKDD_2024_paper_007 Abstract

1 Introduction

2 Related work

3 Methodology 3.1 Data Retrieval

3.2 Methods





4 Results

5 Conclusions

6 Acknowledgments





IS2024_-_SIKDD_2024_paper_008

IS2024_-_SIKDD_2024_paper_009

IS2024_-_SIKDD_2024_paper_010 Abstract

1 Introduction

2 Data

3 Methodology

4 Main Results

5 Discussion

6 Conclusion

7 Acknowledgements





IS2024_-_SIKDD_2024_paper_011 Abstract

1 Introduction

2 Related Work

3 Methodology 3.1 Data Collection

3.2 Characterization of Facts

3.3 Fact Extraction

3.4 Fact Manipulation and Synthetic News Generation

3.5 Fake News Annotation and Fact verification





4 Experimentation and Results 4.1 Experimental settings

4.2 Evaluation

4.3 Fact Extraction Results

4.4 Quality and coherence of synthetically generated fake news

4.5 Fact verification with LLMs

4.6 Final Dataset Description





5 Conclusion 5.1 Problems, Capabilities and Possible Improvements





6 Acknowledgments





IS2024_-_SIKDD_2024_paper_012 Abstract

1 Introduction

2 Related Work 2.1 Role of Reported Speech

2.2 Existing Datasets and Modelling Approaches





3 Experimental Setting 3.1 Task Overview

3.2 Training and Test Data

3.3 Evaluation Procedure

3.4 Training Settings





4 Results 4.1 Model Results

4.2 Error Analysis Results





5 Discussion

6 Conclusion

Acknowledgements





IS2024_-_SIKDD_2024_paper_013 Abstract

1 Introduction & Related Work

2 Data and Methods 2.1 Hypotheses

2.2 Data and pre-processing

2.3 Financial performance analysis





3 Results and Discussion

Acknowledgements





IS2024_-_SIKDD_2024_paper_014 Abstract

1 Introduction

2 Related Work

3 Methods and Materials 3.1 Patent Collection and Preprocessing

3.2 Refining Hierarchical Structure Through Group Merging

3.3 Text Classification

3.4 Classification Evaluation





4 Results and Analysis 4.1 The Proposed Knowledge Mapping Taxonomy (KnowMap)

4.2 Classification Results





5 Discussion and Conclusions

6 Future Work

Acknowledgements





IS2024_-_SIKDD_2024_paper_015 - X Abstract

1 Introduction

2 Enriching causal graphs with domain knowledge

3 OntoGPT: a brief overview 3.1 OntoGPT's role





4 Templates and Python Code Generation

5 Limitations 5.1 Multiple Same-Class Concepts





6 Conclusions

Acknowledgments

References





IS2024_-_SIKDD_2024_paper_016 Abstract

1 Introduction 1.1 Research Goals





2 Related Work 2.1 Research Gap and Contribution





3 Methodology 3.1 Dataset Generation

3.2 Data Preprocessing

3.3 CO2 Emission Measurement

3.4 Feature Extraction

3.5 Adding Hyperparameters to Final Dataset

3.6 Containerization

3.7 Data Storage

3.8 Modeling





4 Model Architecture 4.1 Model Training

4.2 Prediction





5 Web Application Interface for CO2 Emissions and Power Consumption Prediction 5.1 Key Features of the Web Application

5.2 User Experience and Accessibility





6 Results 6.1 Model Error

6.2 CO2 Emission Analysis Across Different Models





7 Discussion

8 Limitations 8.1 Training Duration and Model Learning

8.2 Lack of Meaningful Learning Objective

8.3 Hardware and Software Considerations





9 Future Work

10 Conclusion

Acknowledgements





IS2024_-_SIKDD_2024_paper_017

IS2024_-_SIKDD_2024_paper_018

IS2024_-_SIKDD_2024_paper_019

IS2024_-_SIKDD_2024_paper_020 Abstract

1 Introduction

2 Methodology

3 Analysis of trends of AI's Perception 3.1 Global Overview

3.2 Local Trends

3.3 EXAMPLES OF TRENDS





4 User Scenarios and Applications

5 Discussion

6 Conclusions

7 Acknowledgments





IS2024_-_SIKDD_2024_paper_021 Abstract

1 Introduction

2 Related work

3 Dataset 3.1 Data Extraction Pipeline

3.2 Data Description





4 Methodology 4.1 Graph Construction

4.2 Random Walks for Feature Extraction

4.3 Embedding Generation Using Graph2Vec

4.4 One Hot Encoding & Target Shifting

4.5 Random Forest Classification & Stratified K-Fold Cross Validation





5 Results

6 Conclusions

Acknowledgments

References





IS2024_-_SIKDD_2024_paper_022 Abstract

1 Introduction

2 Related Work 2.1 Large language models

2.2 Synthetic medical data generation





3 Methodology 3.1 Data pre-processing

3.2 Synthetic data generation

3.3 Technical details





4 Experiment Setting 4.1 Evaluation approach

4.2 Metrics





5 Results 5.1 Statistical analysis

5.2 The classifier evaluation





6 Discussion 6.1 LLM performance

6.2 Limitations

6.3 Potential improvements





7 Conclusion and Future Work

Acknowledgments





IS2024_-_SIKDD_2024_paper_023 Abstract

1 Introduction 1.1 Contributions





2 Related Work

3 Dataset Description 3.1 Primary Data Sources

3.2 Data Collection Framework

3.3 Data Synopsis and Structure





4 Potential Use-Cases

5 Limitations

6 Conclusions

7 Acknowledgments





IS2024_-_SIKDD_2024_paper_024 Abstract

1 Introduction

2 Related Work

3 Methodology 3.1 Document segmentation

3.2 Generating Prolog definitions

3.3 Merging Prolog definitions

3.4 Use of the knowledge graph





4 Experiment Setting 4.1 Data sources

4.2 Used large language model

4.3 Evaluation Framework





5 Results 5.1 Dive into Deep Learning

5.2 Speech and Language Processing





6 Discussion 6.1 Potential improvements





7 Conclusion and Future work

Acknowledgments





IS2024_-_SIKDD_2024_paper_025 Abstract

1 Introduction

2 Related Work

3 Data

4 Methodology 4.1 Recommendation Mechanism

4.2 User Interface





5 Evaluation 5.1 Evaluation dataset

5.2 Experiment Design





6 Results 6.1 User Interface Implementation





7 Discussion

8 Future work

Acknowledgments





IS2024_-_SIKDD_2024_paper_026

IS2024_-_SIKDD_2024_paper_027 Abstract

1 Introduction

2 Background 2.1 Large Language Model (LLM)

2.2 Knowledge graph (KG)

2.3 Combining LLM and KG





3 Proof of Concept: Analysis and Knowledge graph generation 3.1 Using ChatGPT Prompts:

3.2 Python Implementation





4 Evaluation

5 Conclusion

Acknowledgements





IS2024_-_SIKDD_2024_paper_028 Abstract

1 Introduction

2 Packing, covering, and tiling

3 Gardner's bricks problem

4 Numerical experiments

5 Conclusions

Acknowledgements





07 - Kazalo - C

08 - Naslovnica - notranja - C - DRAFT

09 - Predgovor podkonference - C

10 - Programski odbor podkonference - C

11 - Prispevki - C IS2024_-_SIKDD_2024_paper_001 Abstract

1 Introduction

2 Data

3 Methodology

4 Results 4.1 Accuracy of Data Retrieval

4.2 Instance Fetching Accuracy

4.3 Manual Evaluation of Example Queries





5 Conclusions

Acknowledgements





IS2024_-_SIKDD_2024_paper_002 Abstract

1 Introduction

2 Methods 2.1 Traditional Batch Learning Methods

2.2 Time Series Deep Learning Methods

2.3 Time Series Foundation Models





3 Experiment Setting 3.1 Dataset

3.2 Evaluation Metrics

3.3 Baseline Methods

3.4 Implementation Details





4 Results

5 Conclusion and Future Work

Acknowledgements

A Hyperparameters

B Selected Features





IS2024_-_SIKDD_2024_paper_003 Abstract

1 Introduction

2 Literature review

3 Methodology

4 Results

5 Conclusion and future work





IS2024_-_SIKDD_2024_paper_004 Abstract

1 Introduction

2 Related Work

3 Method 3.1 Datasets

3.2 Model Selection and Fine-Tuning

3.3 Model Evaluation





4 Results 4.1 Inter-Annotator and Model-Annotator Agreement

4.2 Model Comparison





5 Discussion

6 Conclusions

7 Acknowledgments





IS2024_-_SIKDD_2024_paper_005 Abstract

1 Introduction

2 Dataset

3 Statistical Analysis and Feature Selection

4 Pronunciation Type Prediction

5 Manual Evaluation

6 Conclusion

Acknowledgements





IS2024_-_SIKDD_2024_paper_006 Abstract

1 Introduction

2 Higher-Order Bibliographic Services

3 OpenAlex 3.1 API





4 A collection of bibliographic networks

5 Report ingredients 5.1 Statistics

5.2 Network analysis

5.3 Special algorithms

5.4 Reports





6 Conclusions

Acknowledgements





IS2024_-_SIKDD_2024_paper_007 Abstract

1 Introduction

2 Related work

3 Methodology 3.1 Data Retrieval

3.2 Methods





4 Results

5 Conclusions

6 Acknowledgments





IS2024_-_SIKDD_2024_paper_008

IS2024_-_SIKDD_2024_paper_009

IS2024_-_SIKDD_2024_paper_010 Abstract

1 Introduction

2 Data

3 Methodology

4 Main Results

5 Discussion

6 Conclusion

7 Acknowledgements





IS2024_-_SIKDD_2024_paper_011 Abstract

1 Introduction

2 Related Work

3 Methodology 3.1 Data Collection

3.2 Characterization of Facts

3.3 Fact Extraction

3.4 Fact Manipulation and Synthetic News Generation

3.5 Fake News Annotation and Fact verification





4 Experimentation and Results 4.1 Experimental settings

4.2 Evaluation

4.3 Fact Extraction Results

4.4 Quality and coherence of synthetically generated fake news

4.5 Fact verification with LLMs

4.6 Final Dataset Description





5 Conclusion 5.1 Problems, Capabilities and Possible Improvements





6 Acknowledgments





IS2024_-_SIKDD_2024_paper_012 Abstract

1 Introduction

2 Related Work 2.1 Role of Reported Speech

2.2 Existing Datasets and Modelling Approaches





3 Experimental Setting 3.1 Task Overview

3.2 Training and Test Data

3.3 Evaluation Procedure

3.4 Training Settings





4 Results 4.1 Model Results

4.2 Error Analysis Results





5 Discussion

6 Conclusion

Acknowledgements





IS2024_-_SIKDD_2024_paper_013 Abstract

1 Introduction & Related Work

2 Data and Methods 2.1 Hypotheses

2.2 Data and pre-processing

2.3 Financial performance analysis





3 Results and Discussion

Acknowledgements





IS2024_-_SIKDD_2024_paper_014 Abstract

1 Introduction

2 Related Work

3 Methods and Materials 3.1 Patent Collection and Preprocessing

3.2 Refining Hierarchical Structure Through Group Merging

3.3 Text Classification

3.4 Classification Evaluation





4 Results and Analysis 4.1 The Proposed Knowledge Mapping Taxonomy (KnowMap)

4.2 Classification Results





5 Discussion and Conclusions

6 Future Work

Acknowledgements





IS2024_-_SIKDD_2024_paper_015 - X Abstract

1 Introduction

2 Enriching causal graphs with domain knowledge

3 OntoGPT: a brief overview 3.1 OntoGPT's role





4 Templates and Python Code Generation

5 Limitations 5.1 Multiple Same-Class Concepts





6 Conclusions

Acknowledgments

References





IS2024_-_SIKDD_2024_paper_016 Abstract

1 Introduction 1.1 Research Goals





2 Related Work 2.1 Research Gap and Contribution





3 Methodology 3.1 Dataset Generation

3.2 Data Preprocessing

3.3 CO2 Emission Measurement

3.4 Feature Extraction

3.5 Adding Hyperparameters to Final Dataset

3.6 Containerization

3.7 Data Storage

3.8 Modeling





4 Model Architecture 4.1 Model Training

4.2 Prediction





5 Web Application Interface for CO2 Emissions and Power Consumption Prediction 5.1 Key Features of the Web Application

5.2 User Experience and Accessibility





6 Results 6.1 Model Error

6.2 CO2 Emission Analysis Across Different Models





7 Discussion

8 Limitations 8.1 Training Duration and Model Learning

8.2 Lack of Meaningful Learning Objective

8.3 Hardware and Software Considerations





9 Future Work

10 Conclusion

Acknowledgements





IS2024_-_SIKDD_2024_paper_017

IS2024_-_SIKDD_2024_paper_018

IS2024_-_SIKDD_2024_paper_019

IS2024_-_SIKDD_2024_paper_020 Abstract

1 Introduction

2 Methodology

3 Analysis of trends of AI's Perception 3.1 Global Overview

3.2 Local Trends

3.3 EXAMPLES OF TRENDS





4 User Scenarios and Applications

5 Discussion

6 Conclusions

7 Acknowledgments





IS2024_-_SIKDD_2024_paper_021 Abstract

1 Introduction

2 Related work

3 Dataset 3.1 Data Extraction Pipeline

3.2 Data Description





4 Methodology 4.1 Graph Construction

4.2 Random Walks for Feature Extraction

4.3 Embedding Generation Using Graph2Vec

4.4 One Hot Encoding & Target Shifting

4.5 Random Forest Classification & Stratified K-Fold Cross Validation





5 Results

6 Conclusions

Acknowledgments

References





IS2024_-_SIKDD_2024_paper_022 Abstract

1 Introduction

2 Related Work 2.1 Large language models

2.2 Synthetic medical data generation





3 Methodology 3.1 Data pre-processing

3.2 Synthetic data generation

3.3 Technical details





4 Experiment Setting 4.1 Evaluation approach

4.2 Metrics





5 Results 5.1 Statistical analysis

5.2 The classifier evaluation





6 Discussion 6.1 LLM performance

6.2 Limitations

6.3 Potential improvements





7 Conclusion and Future Work

Acknowledgments





IS2024_-_SIKDD_2024_paper_023 Abstract

1 Introduction 1.1 Contributions





2 Related Work

3 Dataset Description 3.1 Primary Data Sources

3.2 Data Collection Framework

3.3 Data Synopsis and Structure





4 Potential Use-Cases

5 Limitations

6 Conclusions

7 Acknowledgments





IS2024_-_SIKDD_2024_paper_024 Abstract

1 Introduction

2 Related Work

3 Methodology 3.1 Document segmentation

3.2 Generating Prolog definitions

3.3 Merging Prolog definitions

3.4 Use of the knowledge graph





4 Experiment Setting 4.1 Data sources

4.2 Used large language model

4.3 Evaluation Framework





5 Results 5.1 Dive into Deep Learning

5.2 Speech and Language Processing





6 Discussion 6.1 Potential improvements





7 Conclusion and Future work

Acknowledgments





IS2024_-_SIKDD_2024_paper_025 Abstract

1 Introduction

2 Related Work

3 Data

4 Methodology 4.1 Recommendation Mechanism

4.2 User Interface





5 Evaluation 5.1 Evaluation dataset

5.2 Experiment Design





6 Results 6.1 User Interface Implementation





7 Discussion

8 Future work

Acknowledgments





IS2024_-_SIKDD_2024_paper_026

IS2024_-_SIKDD_2024_paper_027 Abstract

1 Introduction

2 Background 2.1 Large Language Model (LLM)

2.2 Knowledge graph (KG)

2.3 Combining LLM and KG





3 Proof of Concept: Analysis and Knowledge graph generation 3.1 Using ChatGPT Prompts:

3.2 Python Implementation





4 Evaluation

5 Conclusion

Acknowledgements





IS2024_-_SIKDD_2024_paper_028 Abstract

1 Introduction

2 Packing, covering, and tiling

3 Gardner's bricks problem

4 Numerical experiments

5 Conclusions

Acknowledgements





12 - Index - C

Blank Page

IS2024_-_SIKDD_2024_paper_028.pdf Abstract

1 Introduction

2 Packing, covering, and tiling

3 Gardner's bricks problem

4 Numerical experiments

5 Conclusions

Acknowledgements





Blank Page

Blank Page