Zbornik 23. mednarodne multikonference

INFORMACIJSKA DRUŻBA

Zvezek C

Proceedings of the 23rd International Multiconference

.si

INFORMATION SOCIETY

Volume C

.ijsI S

http://is

Odkrivanje znanja in podatkovna

skladišča • SiKDD

20 Data Mining and Data

Warehouses • SiKDD

20 Uredili / Edited byDunja Mladenić, Marko Grobelnik

5. oktober 2020 / 5 October 2020

Ljubljana, Slovenia





Zbornik 23. mednarodne multikonference

INFORMACIJSKA DRUŽBA – IS 2020

Zvezek C





Proceedings of the 23rd International Multiconference

INFORMATION SOCIETY – IS 2020

Volume C





Odkrivanje znanja in podatkovna skladišča - SiKDD

Data Mining and Data Warehouses - SiKDD





Uredila / Edited by



Dunja Mladenić, Marko Grobelnik





http://is.ijs.si





5. oktober 2020 / 5 October 2020

Ljubljana, Slovenia



Urednika:





Dunja Mladenić

Department for Artificial Intelligence

Jožef Stefan Institute, Ljubljana



Marko Grobelnik

Department for Artificial Intelligence

Jožef Stefan Institute, Ljubljana





Založnik: Institut »Jožef Stefan«, Ljubljana

Priprava zbornika: Mitja Lasič, Vesna Lasič, Lana Zemljak

Oblikovanje naslovnice: Vesna Lasič





Dostop do e-publikacije:

http://library.ijs.si/Stacks/Proceedings/InformationSociety





Ljubljana, oktober 2020





Informacijska družba

ISSN 2630-371X



Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni knjižnici v Ljubljani

COBISS.SI-ID=33077251

ISBN 978-961-264-192-4 (epub)

ISBN 978-961-264-193-1 (pdf)





PREDGOVOR MULTIKONFERENCI

INFORMACIJSKA DRUŽBA 2020



Triindvajseta multikonferenca Informacijska družba (http://is.ijs.si) je doživela polovično zmanjšanje zaradi korone.

Zahvala za preživetje gre tistim predsednikom konferenc, ki so se kljub prvi pandemiji modernega sveta pogumno odločili, da bodo izpeljali konferenco na svojem področju.



Korona pa skoraj v ničemer ni omejila neverjetne rasti IKTja, informacijske družbe, umetne inteligence in znanosti nasploh, ampak nasprotno – kar naenkrat je bilo večino aktivnosti potrebno opraviti elektronsko in IKT so dokazale, da je elektronsko marsikdaj celo bolje kot fizično. Po drugi strani pa se je pospešil razpad družbenih vrednot, zaupanje v znanost in razvoj. Celo Flynnov učinek – merjenje IQ na svetovni populaciji – kaže, da ljudje ne postajajo čedalje bolj pametni. Nasprotno - čedalje več ljudi verjame, da je Zemlja ploščata, da bo cepivo za korono škodljivo, ali da je korona škodljiva kot navadna gripa (v resnici je desetkrat bolj). Razkorak med rastočim znanjem in vraževerjem se povečuje.



Letos smo v multikonferenco povezali osem odličnih neodvisnih konferenc. Zajema okoli 160 večinoma spletnih predstavitev, povzetkov in referatov v okviru samostojnih konferenc in delavnic in 300 obiskovalcev. Prireditev bodo spremljale okrogle mize in razprave ter posebni dogodki, kot je svečana podelitev nagrad – seveda večinoma preko spleta. Izbrani prispevki bodo izšli tudi v posebni številki revije Informatica (http://www.informatica.si/), ki se ponaša s 44-letno tradicijo odlične znanstvene revije.



Multikonferenco Informacijska družba 2020 sestavljajo naslednje samostojne konference:

• Etika in stroka

• Interakcija človek računalnik v informacijski družbi

• Izkopavanje znanja in podatkovna skladišča

• Kognitivna znanost

• Ljudje in okolje

• Mednarodna konferenca o prenosu tehnologij

• Slovenska konferenca o umetni inteligenci

• Vzgoja in izobraževanje v informacijski družbi





Soorganizatorji in podporniki konference so različne raziskovalne institucije in združenja, med njimi tudi ACM

Slovenija, SLAIS, DKZ in druga slovenska nacionalna akademija, Inženirska akademija Slovenije (IAS). V imenu organizatorjev konference se zahvaljujemo združenjem in institucijam, še posebej pa udeležencem za njihove dragocene prispevke in priložnost, da z nami delijo svoje izkušnje o informacijski družbi. Zahvaljujemo se tudi recenzentom za njihovo pomoč pri recenziranju.



V 2020 bomo petnajstič podelili nagrado za življenjske dosežke v čast Donalda Michieja in Alana Turinga. Nagrado Michie-Turing za izjemen življenjski prispevek k razvoju in promociji informacijske družbe je prejela prof. dr. Lidija Zadnik Stirn. Priznanje za dosežek leta pripada Programskemu svetu tekmovanja ACM Bober. Podeljujemo tudi nagradi »informacijska limona« in »informacijska jagoda« za najbolj (ne)uspešne poteze v zvezi z informacijsko družbo. Limono je prejela »Neodzivnost pri razvoju elektronskega zdravstvenega kartona«, jagodo pa Laboratorij za bioinformatiko, Fakulteta za računalništvo in informatiko, Univerza v Ljubljani. Čestitke nagrajencem!





Mojca Ciglarič, predsednik programskega odbora

Matjaž Gams, predsednik organizacijskega odbora





i



FOREWORD

INFORMATION SOCIETY 2020





The 23rd Information Society Multiconference (http://is.ijs.si) was halved due to COVID-19. The multiconference survived due to the conference presidents that bravely decided to continue with their conference despite the first pandemics in the modern era.



The COVID-19 pandemics did not decrease the growth of ICT, information society, artificial intelligence and science overall, quite on the contrary – suddenly most of the activities had to be performed by ICT and often it was more efficient than in the old physical way. But COVID-19 did increase downfall of societal norms, trust in science and progress. Even the Flynn effect – measuring IQ all over the world – indicates that an average Earthling is becoming less smart and knowledgeable. Contrary to general belief of scientists, the number of people believing that the Earth is flat is growing. Large number of people are weary of the COVID-19 vaccine and consider the COVID-19

consequences to be similar to that of a common flu dispute empirically observed to be ten times worst.



The Multiconference is running parallel sessions with around 160 presentations of scientific papers at twelve conferences, many round tables, workshops and award ceremonies, and 300 attendees. Selected papers will be published in the Informatica journal with its 44-years tradition of excellent research publishing.



The Information Society 2020 Multiconference consists of the following conferences:

• Cognitive Science

• Data Mining and Data Warehouses

• Education in Information Society

• Human-Computer Interaction in Information Society

• International Technology Transfer Conference

• People and Environment

• Professional Ethics

• Slovenian Conference on Artificial Intelligence



The Multiconference is co-organized and supported by several major research institutions and societies, among them ACM Slovenia, i.e. the Slovenian chapter of the ACM, SLAIS, DKZ and the second national engineering academy, the Slovenian Engineering Academy. In the name of the conference organizers, we thank all the societies and institutions, and particularly all the participants for their valuable contribution and their interest in this event, and the reviewers for their thorough reviews.



For the fifteenth year, the award for life-long outstanding contributions will be presented in memory of Donald Michie and Alan Turing. The Michie-Turing award was given to Prof. Dr. Lidija Zadnik Stirn for her life-long outstanding contribution to the development and promotion of information society in our country. In addition, a recognition for current achievements was awarded to the Program Council of the competition ACM Bober. The information lemon goes to the “Unresponsiveness in the development of the electronic health record”, and the information strawberry to the Bioinformatics Laboratory, Faculty of Computer and Information Science, University of Ljubljana. Congratulations!



Mojca Ciglarič, Programme Committee Chair

Matjaž Gams, Organizing Committee Chair





ii

KONFERENČNI ODBORI

CONFERENCE COMMITTEES



International Programme Committee

Organizing Committee

Vladimir Bajic, South Africa

Matjaž Gams, chair

Heiner Benking, Germany

Mitja Luštrek

Se Woo Cheon, South Korea

Lana Zemljak

Howie Firth, UK

Vesna Koricki

Olga Fomichova, Russia

Marjetka Šprah

Vladimir Fomichov, Russia

Mitja Lasič

Vesna Hljuz Dobric, Croatia

Blaž Mahnič

Alfred Inselberg, Israel

Jani Bizjak

Jay Liebowitz, USA

Tine Kolenik

Huan Liu, Singapore



Henz Martin, Germany

Marcin Paprzycki, USA

Claude Sammut, Australia

Jiri Wiedermann, Czech Republic

Xindong Wu, USA

Yiming Ye, USA

Ning Zhong, USA

Wray Buntine, Australia

Bezalel Gavish, USA

Gal A. Kaminka, Israel

Mike Bain, Australia

Michela Milano, Italy

Derong Liu, Chicago, USA

prof. Toby Walsh, Australia





Programme Committee

Mojca Ciglarič, chair

Andrej Gams

Vladislav Rajkovič

Bojan Orel, co-chair

Matjaž Gams

Grega Repovš

Franc Solina,

Mitja Luštrek

Ivan Rozman

Viljan Mahnič,

Marko Grobelnik

Niko Schlamberger

Cene Bavec,

Nikola Guid

Špela Stres

Tomaž Kalin,

Marjan Heričko

Stanko Strmčnik

Jozsef Györkös,

Borka Jerman Blažič Džonova

Jurij Šilc

Tadej Bajd

Gorazd Kandus

Jurij Tasič

Jaroslav Berce

Urban Kordeš

Denis Trček

Mojca Bernik

Marjan Krisper

Andrej Ule

Marko Bohanec

Andrej Kuščer

Tanja Urbančič

Ivan Bratko

Jadran Lenarčič

Boštjan Vilfan

Andrej Brodnik

Borut Likar

Baldomir Zajc

Dušan Caf

Janez Malačič

Blaž Zupan

Saša Divjak

Olga Markič

Boris Žemva

Tomaž Erjavec

Dunja Mladenič

Leon Žlajpah

Bogdan Filipič

Franc Novak





iii

iv





KAZALO / TABLE OF CONTENTS



Odkrivanje znanja in podatkovna skladišča (SiKDD) / Data Mining and Data Warehouses (SiKDD) ................ 1

PREDGOVOR / FOREWORD ................................................................................................................................. 3

PROGRAMSKI ODBORI / PROGRAMME COMMITTEES ..................................................................................... 4

A Dataset for Information Spreading over the News / Sittar Abdul, Mladenić Dunja, Erjavec Tomaž ................... 5

Learning to fill the slots from multiple perspectives / Zajec Patrik, Mladenić Dunja ............................................... 9

Knowledge graph aware text classification / Petrželková Nela, Škrlj Blaž, Lavrač Nada .................................... 13

EveOut: Reproducible Event Dataset for Studying and Analyzing the Complex Event-Outlet Relationship /

Swati, Erjavec Tomaž, Mladenić Dunja ............................................................................................................ 17

Ontology alignment using Named-Entity Recognition methods in the domain of food / Popovski Gorjan, Eftimov

Tome, Mladenić Dunja, Koroušič Seljak Barbara ............................................................................................. 21

Extracting structured metadata from multilingual textual descriptions in the domain of silk heritage / Massri

M.Besher, Mladenić Dunja ............................................................................................................................... 25

Hierarchical classification of educational resources / Žunič Gregor, Novak Erik ................................................. 29

Are You Following the Right News-Outlet? A Machine Learning based approach to outlet prediction / Swati,

Mladenić Dunja ................................................................................................................................................. 33

MultiCOMET – Multilingual Commonsense Description / Mladenić Grobelnik Adrian, Mladenić Dunja, Grobelnik

Marko ................................................................................................................................................................ 37

A Slovenian Retweet Network 2018-2020 / Evkoski Bojan, Mozetič Igor, Ljubešić Nikola, Kralj Novak Petra .... 41

Toward improved semantic annotation of food and nutrition data / Jovanovska Lidija, Panov Panče ................ 45

Absenteeism prediction from timesheet data: A case study / Zupančič Peter, Mileva Boshkoska Mileva, Panov

Panče................................................................................................................................................................ 49

Monitoring COVID-19 through text mining and visualization / Massri M.Besher, Pita Costa Joao, Andrej Bauer,

Grobelnik Marko, Brank Janez, Luka Stopar ................................................................................................... 53

Usage of Incremental Learning in Land-Cover Classification / Peternelj Jože, Šircelj Beno, Kenda Klemen ..... 57

Predicting bitcoin trend change using tweets / Jelenčič Jakob ............................................................................ 61

Large-Scale Cargo Distribution / Stopar Luka, Bradeško Luka, Jacobs Tobias, Kurbašić Azur, Cimperman Miha

.......................................................................................................................................................................... 65

Amazon forest fire detection with an active learning approach / Čerin Matej, Kenda Klemen ............................. 69

Indeks avtorjev / Author index ................................................................................................................................ 73





v





vi





Zbornik 23. mednarodne multikonference

INFORMACIJSKA DRUŽBA – IS 2030

Zvezek C





Proceedings of the 23rd International Multiconference

INFORMATION SOCIETY – IS 2020

Volume C





Odkrivanje znanja in podatkovna skladišča - SiKDD

Data Mining and Data Warehouses - SiKDD





Uredila / Edited by



Dunja Mladenić, Marko Grobelnik





http://is.ijs.si





5. oktober 2020 / 5 October 2020

Ljubljana, Slovenia

1





2





PREDGOVOR





Tehnologije, ki se ukvarjajo s podatki so v devetdesetih letih močno napredovale. Iz prve faze, kjer je šlo predvsem za shranjevanje podatkov in kako do njih učinkovito dostopati, se je razvila industrija za izdelavo orodij za delo s podatkovnimi bazami, prišlo je do standardizacije procesov, povpraševalnih jezikov itd. Ko shranjevanje podatkov ni bil več poseben problem, se je pojavila potreba po bolj urejenih podatkovnih bazah, ki bi služile ne le transakcijskem procesiranju ampak tudi analitskim vpogledom v podatke – pojavilo se je t.i.

skladiščenje podatkov (data warehousing), ki je postalo standarden del informacijskih sistemov v podjetjih. Paradigma OLAP (On-Line-Analytical-Processing) zahteva od uporabnika, da še vedno sam postavlja sistemu vprašanja in dobiva nanje odgovore in na vizualen način preverja in išče izstopajoče situacije. Ker seveda to ni vedno mogoče, se je pojavila potreba po avtomatski analizi podatkov oz. z drugimi besedami to, da sistem sam pove, kaj bi utegnilo biti zanimivo za uporabnika – to prinašajo tehnike odkrivanja znanja v podatkih (data mining), ki iz obstoječih podatkov skušajo pridobiti novo znanje in tako uporabniku nudijo novo razumevanje dogajanj zajetih v podatkih. Slovenska KDD konferenca pokriva vsebine, ki se ukvarjajo z analizo podatkov in odkrivanjem znanja v podatkih: pristope, orodja, probleme in rešitve.





FOREWORD





Data driven technologies have significantly progressed after mid 90’s. The first phases were mainly focused on storing and efficiently accessing the data, resulted in the development of industry tools for managing large databases, related standards, supporting querying languages, etc. After the initial period, when the data storage was not a primary problem anymore, the development progressed towards analytical functionalities on how to extract added value from the data; i.e., databases started supporting not only transactions but also analytical processing of the data. At this point, data warehousing with On-Line-Analytical-Processing entered as a usual part of a company’s information system portfolio, requiring from the user to set well defined questions about the aggregated views to the data. Data Mining is a technology developed after year 2000, offering automatic data analysis trying to obtain new discoveries from the existing data and enabling a user new insights in the data. In this respect, the Slovenian KDD conference (SiKDD) covers a broad area including Statistical Data Analysis, Data, Text and Multimedia Mining, Semantic Technologies, Link Detection and Link Analysis, Social Network Analysis, Data Warehouses.





3





PROGRAMSKI ODBOR / PROGRAMME COMMITTEE





Janez Brank, Department of Artificial Intelligence, Jožef Stefan Institute, Ljubljana Marko Grobelnik, , Department of Artificial Intelligence, Jožef Stefan Institute, Ljubljana Branko Kavšek, University of Primorska, Koper

Aljaž Košmerlj, Qlector, Ljubljana

Dunja Mladenić, Department of Artificial Intelligence, Jožef Stefan Institute, Ljubljana Inna Novalija, Department of Artificial Intelligence, Jožef Stefan Institute, Ljubljana Luka Stopar, Sportradar, Ljubljana



4





A Dataset for Information Spreading over the News

Abdul Sittar

Dunja Mladenić

Tomaž Erjavec

Jožef Stefan Institute

Jožef Stefan Institute

Jožef Stefan Institute

Ljubljana, Slovenia

Ljubljana, Slovenia

Ljubljana, Slovenia

abdul.sittar@ijs.si

dunja.mladenic@ijs.si

tomaz.erjavec@ijs.si

ABSTRACT

Table 1: List of events

Analysing the spread of information related to a specific event in

Selected events

Other events (ordered by popularity)

the news has many potential applications. Consequently, various

Football

Basketball, Baseball, Boxing, Tennis, Cycling

systems have been developed to facilitate the analysis of infor-

Earthquake

Floods, Tsunamis, Landslides, Hurricane, Volcanic eruptions

mation spreading, such as detection of disease propagation and

Global warming

CO2 emissions, Chemical consumption

identification of the spreading of fake news through social media.

The paper proposes a method for tracking information spread

over news articles. It works by comparing subsequent articles via

limited availability of datasets containing news text and metadata

cosine similarity and applying a threshold to classify into three

including time, place, source and other relevant information.

classes: “Information-Propagated”, “Unsure” and “Information-

When a piece of information starts spreading, it implicitly

not-Propagated”. There are several open challenges in the process

raises questions such as:

of discerning information propagation, among them the lack of

(1) How far does the information in the form of news reach

resources for training and evaluation. This paper describes the

out to the public?

process of compiling corpus from the Event Registry global me-

(2) Does the content of news remain the same or changes to

dia monitoring system. We focus on information spreading in

a certain extent?

three domains: sports (i.e. the FIFA World Cup), natural disas-

(3) Do the cultural values impact the information especially

ters (i.e. earthquakes), and climate change (i.e. global warming).

when the same news will get translated in other languages?

This corpus is a valuable addition to currently available dataset

This paper presents a corpus that focuses on information

to examine the spreading of information about various kind of

spreading over news and that hopes to answer some of the above

events.

questions (This corpus is published as an online resource at ).

We present the use of a news repository to produce a corpus

KEYWORDS

and then analyze information propagation. We present a novel

Datasets, Information propagation, News articles

methodology for automatically assembling the corpus for this

problem and validate it in three different domains. We focused

1

INTRODUCTION

on a combination of rich- and low resource European languages,

Information spreading has received significant attention due to

in particular English, Portuguese, German, Spanish, and Slovene.

its various market applications such as advertisement. did the in-

Three different types of events are targeted in the data collection

formation about a specific product reach to the public of a specific

procedure to potentially involve different information spreading

region? This could be one of the significant research questions.

behaviors in our society. These events are sports (FIFA World

Research in this area considers influential factors in the process

Cup, 2,695 articles), natural disasters (earthquakes, 3,194 articles),

of information spreading such as the economic condition of a

and climate change (global warming, 1,945 articles). The three

specific area related to how textual or visual content is helping to

types of events were chosen based on their popularity and diver-

advertise a product. Information spreading analytics can also be

sity. A list of sub-events was observed from top websites related

used in shaping policies, e.g., in media companies to understand

to the three events and we selected those which were the most

if there is a need to improve the content before publishing it.

popular in the countries with the selected national languages. For

Health organizations may be interested to know the patterns of

sports, a list of countries with their national sports was fetched

spreading of a cure for a certain disease. Environmental scien-

and then filtered for national language1, 2. Based on popularity, tists are perhaps attentive to see whether spread of news about

we selected the FIFA world cup. Similarly, for natural disasters,

climate changes inside the country is similar to what is being

lists of natural disasters were collected by country taking the na-

reported internationally.

tional language into account, for instance, for Slovenia we looked

Domain-specific gaps in information spreading are ubiquitous,

for this country in the natural disaster category on Wikipedia3.

and may exist due to economic conditions, political factors, or

Earthquakes4 and global warming5 were found to be the most linguistic, geographical, time-zone, cultural and other barriers.

prevalent, thus a dataset for each was collected. Table 1 shows the These factors potentially contribute to obstructing the flow of

selected events and other related events ordered by prevalence.

local as well as international news. We believe that there is a lack

The paper makes the following contributions to science:

of research studies which examine, identify and uncover the rea-

(1) a novel methodology to collect a domain-specific corpus

sons for barriers in information spreading. Additionally, there is

from news repository;

(2) semantic similarity between news articles;

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and

1http://www.quickgs.com/countries-and-their-national-sports/

the full citation on the first page. Copyrights for third-party components of this

2https://www.topendsports.com/

work must be honored. For all other uses, contact the owner/author(s).

3https://en.wikipedia.org/wiki/Category:Natural_disasters_in_Slovenia

Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

4https://en.wikipedia.org/wiki/List_of_earthquakes_in_2020

© 2020 Copyright held by the owner/author(s).

5https://www.theguardian.com/environment/2011/apr/21/countries-responsible-

climate-change, 6

5





Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

Abdul Sittar, Dunja Mladenić, and Tomaž Erjavec

(3) an annotated dataset encoding the level of information

spreading from an article.

The rest of the paper is organized as follows: in Section 2 we discuss prior work about information spreading; in Section 3 we describe the data collection methodology; Section 4 describes semantic similarity and dataset annotation; and Section 5 gives the conclusions.

2

RELATED WORK

Information spreading is prevalent in our society. It plays a vi-

tal part in tasks that encompass the spreading of innovations

[9], effects in marketing [6], and opinion spreading [4]. News spreading provides information to consumers that can be used

for decision making and potentially contribute to shaping na-

tional and international policies. There are several types of media

Figure 1: Data collection methodology

involved, such as print media, broadcast, and internet media. In-

ternet is considered as a building block for connecting individuals

worldwide, while news reflects current significant events for peo-

ple [7]. Apart from news, online social media proved to be a remarkable alternative to support information spreading in an

emergency [8, 5]. Social connection plays a vital role in news spreading. Especially the structure of network reflecting who

is connected to whom, crucially increases the proportion of in-

formation spreading. Network structure analysis comes with a

hypothesis related to the strength of the connections, namely

that information will spread further in a situation where there

exist many weak connections rather than clusters of strong [2].

While, in general, there are not many dataset that would help

in modelling information spreading, there are some corpora for

detecting the spreading of information about diseases [3] and fake news in social media [10]. There is currently no multilingual dataset of news articles for analysis of information propagation

composed from a variety of event-centric information such as

Figure 2: Articles with metadata

sports, natural disasters, and climate changes. This provides ad-

ditional motivation for our work.

Table 2: Statistics about dataset

3

DATA COLLECTION METHODOLOGY

Dataset

Domain

Event type

Articles per Language

Total Articles

Eng

Spa

Ger

Slv

Por

In order to collect news originating from different sources, in

1

Sports

FIFA World Cup

983

762

711

10

216

2682

2

Natural Disaster

Earthquake

941

999

937

19

251

3147

different languages, and targeting diverse events, we used Event

3

Climate Changes

Global Warming

996

298

545

8

97

1944

Registry, a platform that identifies events by collecting related

articles written in different languages from tens of thousands of

news sources [9]. Using Event Registry APIs 7, we fetched a list This service uses a page-rank based method to identify a coherent

of articles about each event in the following languages: English,

set of relevant concepts from Wikipedia [1]. We retrieved a list Spanish, German, Portuguese, and Slovenian. Figure 1 shows the of Wikipedia concepts for each article. After representing each

data collection process.

article with a list of Wikipedia concepts, the tf-idf score was com-

Each article was parsed from the JSON response and stored in

puted using the popular machine learning library Scikit-Learn9.

CSV files. Each article was connected with the available list of

Using the same library, cosine similarity was calculated between

relevant information such as the language of the article, event

tf-idf representation of news articles across all five languages.

type, publisher, title, date, and time. Figure 2 shows the metadata In the process of computing similarity between the articles, for

of articles.

each article we calculated its cosine similarity to all other articles

The number of collected articles in each domain varies consid-

and stored the results in a CSV file. The results were then sorted

erably, and also varies across the languages within each domain.

based on the publishing time of articles and we kept only the cal-

Table 2 shows statistics about each dataset.

culations of similarity to articles that are published later that the

article in hands. Since we are interested in information propaga-

4

SEMANTIC SIMILARITY BETWEEN NEWS

tion, we do not need to compare an article to those articles which

ARTICLES

have been published before it. As a result, we had a multiple

similarity score for each article where each score show the simi-

We have represented the cross-lingual news articles by monolin-

larity with other articles. Cosine similarity varies between zero

gual (English) Wikipedia concepts using the Wikifier service8.

and one, zero meaning no similarity and one meaning maximum

7https://github.com/EventRegistry/event-registry-python/blob/master/

similarity, i.e., a duplicate article.

eventregistry/examples/QueryArticlesExamples.py

8http://wikifier.org/info.html

9https://scikit-learn.org/stable/

6





A Dataset for Information Spreading over the News

Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

Table 3: Selected articles for evaluation

Domains

Percentage of correctly labelled pairs

Global Warming

100%

Earthquake

93%

FIFA World Cup

100 %

for Portuguese, German, Slovene and Spanish to translate them

into English.

Evaluation results shown that the annotation was significantly

related to information spreading. Articles in the "Information-

Propagated" class show that most articles were an exact or para-

phrased copy of each other, with some articles published within

few hours after each other. Articles in the "Unsure" class were Figure 3: Class distribution for all domains

typically also relevant to the event but involved extra and dif-

ferent discussions. Lastly, in the third class "Information-Not-

Propagated", articles involved only keywords related to event but

discussion was about other topics. Moreover, here the gap in the

4.1

Dataset annotations

publishing time was quite large.

The results of the semantic similarity calculation were in the

form of a table where rows shown the list of articles and columns

shown the corresponding similarity score in the range 0..1 with

5

CONCLUSIONS

all the other articles. This similarity score was calculated using

This paper proposed a methodology and explained the process

cosine between TF-IDF representation of news articles (See Sec-

of data collection from a news repository to provide a corpus

tion ??). First, we excluded those articles which had scored 1.0,

for event-centric information propagation between news articles.

as they were considered as a copy of the article. We then, for

This corpus covers three domains and each dataset corresponds

each article, chose an article which had the highest similarity

to one event type (FIFA World Cup, Earthquake, and Global

score to it from the list of all articles. After performing this step,

Warming). The corpus is available to others for the evaluation

we had one similarity score for each article which shows either

of techniques for information spreading as it allows the analysis

that the information spread to a certain extent (if >0) or not (if

of cross-lingual news articles published by different publishers

0). To decide about the class label whether the information is

located geographically in different places.

spreading or not, we divided the scores into three intervals. The

In the future, we plan to add more attributes to each dataset.

first is Similarity ≥ 0.7, the second is 0.7 > Similarity ≥ 0.4,

For instance, for now, we only know the publisher of a news

and the third is Similarity < 0.4. Articles that have scores in

article but in the future, we would like to include the publisher

the first interval were labeled as "Information-Propagated". The profile and the economic condition of a country from where the

second interval was considered as unclear whether the informa-

information is published. Also, we plan to apply and evaluate

tion from the article propagated or not such articles were labeled

different techniques to analysis information propagation barriers.

as "Unsure". The lowest interval was considered as a signal for no propagation and labeled "Information-not-Propagated". For

6

ACKNOWLEDGEMENTS

instance, low similarity can be of an article about a sports ground

which mentions the population of the city and another article

This work was supported by the Slovenian Research Agency and

that discusses the population itself. We have manually examined

the project leading to this publication has received funding from

concepts of articles in each class. Figure 3 shows the distribu-the European Union’s Horizon 2020 research and innovation

tion of class labels in FIFA World Cup, Earthquake, and Global

programme under the Marie Skłodowska-Curie grant agreement

Warming dataset respectively.

No 812997.

REFERENCES

4.2

Evaluation of dataset

[1]

Janez Brank, Gregor Leban, and Marko Grobelnik. 2017.

Each article was annotated with a label based upon the similarity

Annotating documents with relevant wikipedia concepts.

score threshold of each article with other articles (See Section

In Proceedings of Slovenian KDD Conference on Data Mining

4.1). For evaluation of the dataset we have checked the content of and Data Warehouses (SiKDD).

the corresponding articles which were responsible for a specific

[2]

Damon Centola. 2010. The spread of behavior in an online

class label. We performed the evaluation of labelling by manually

social network experiment. science, 329, 5996, 1194–1197.

inspecting a subset of pairs of articles. If a pair, for instance, were

[3]

Emily Chen, Kristina Lerman, and Emilio Ferrara. 2020.

labelled as "Information-Propagated" then two articles should Covid-19: the first public coronavirus twitter dataset. arXiv

have text discussing more or less the same event, both in mono-

preprint arXiv:2003.07372.

and cross-lingual settings.

[4]

David Liben-Nowell and Jon Kleinberg. 2008. Tracing in-

We have randomly chosen 10 articles with their corresponding

formation flow on a global scale using internet chain-letter

articles considering all languages in each class and in each dataset.

data. Proceedings of the national academy of sciences, 105,

In this way, we have manually checked 180 articles. Table 3 shows 12, 4633–4638.

these pairs of articles for evaluation in each dataset. We scanned

each article manually for all languages, using Google Translator

7

Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia Abdul Sittar, Dunja Mladenić, and Tomaž Erjavec

[5]

Kees Nieuwenhuis. 2007. Information systems for crisis

crisis informatics: study of 2013 oklahoma tornado. Trans-

response and management. In International Workshop on

portation Research Record, 2459, 1, 110–118.

Mobile Information Technology for Emergency Response.

[9]

Duncan J Watts and Peter Sheridan Dodds. 2007. Influen-

Springer, 1–8.

tials, networks, and public opinion formation. Journal of

[6]

Everett M Rogers. 2010. Diffusion of innovations. Simon

consumer research, 34, 4, 441–458.

and Schuster.

[10]

Zilong Zhao, Jichang Zhao, Yukie Sano, Orr Levy, Hideki

[7]

Sandeep Suntwal, Susan Brown, and Mark Patton. 2020.

Takayasu, Misako Takayasu, Daqing Li, Junjie Wu, and

How does information spread? an exploratory study of

Shlomo Havlin. 2020. Fake news propagates differently

true and fake news. In Proceedings of the 53rd Hawaii In-

from real news even at early stages of spreading. EPJ Data

ternational Conference on System Sciences.

Science, 9, 1, 7.

[8]

Satish V Ukkusuri, Xianyuan Zhan, Arif Mohaimin Sadri,

and Qing Ye. 2014. Use of social media data to explore

8





Learning to fill the slots from multiple perspectives

Patrik Zajec

Dunja Mladenič

patrik.zajec@ijs.si

dunja.mladenic@ijs.si

Jožef Stefan Institute and Jožef Stefan International

Jožef Stefan Institute and Jožef Stefan International

Postgraduate School

Postgraduate School

Jamova cesta 39

Jamova cesta 39

Ljubljana, Slovenia

Ljubljana, Slovenia

ABSTRACT

Furthermore, since the set of topics is not fixed and could expand

We present an approach to train the slot-filling system in a fully

over time, such a slot filling system should be able to adapt quickly

automatic, semi-supervised setting on a limited domain of events

to fill new slots and ideally should not be limited to the English

from Wikipedia using the summaries in different languages. We

language.

use the multiple languages and the different topics of the events

We believe that annotation work can be greatly minimized

to provide several alternative views on the data. Our experiments

if we rely on our limited domain to identify and annotate only

show how such an approach can be used to train the multilingual

informative examples and use the additional assumptions to prop-

slot-filling system and increase the performance of a monolingual

agate these labels. We also believe that simultaneous training of

system.

the system on multiple topics can be advantageous, as we can

introduce additional supervision on the common slots and use

KEYWORDS

distinct slots as a source of negative examples.

In this work we use Wikipedia and Wikidata [9] as the source information extraction, slot filling, machine learning, probabilis-of data. We treat the Wikidata entities that have the point-in-time

tic soft logic

property specified as events and summary sections of Wikipedia

articles about the entity in different languages as news articles.

1

INTRODUCTION

Each entity belongs to a single topic and we adopt the subset of

This paper is addressing the slot filling task that aims to extract

topic-specific properties as slot keys. An automatic exact match-

the structured knowledge from a given set of documents using a

ing of such values from Wikidata with named entities from

model trained for a specific domain and the associated slots. For

Wikipedia articles is rarely successful. We use the successful

example, within a news article reporting on an earthquake, the

and unambiguous matches as a set of labeled seed examples.

task is to detect the earthquake’s magnitude, the number of peo-

We formulate the task as a semi-supervised learning problem

ple injured, the location of the epicentre and other information.

[8] where the set of base learners is trained iteratively, starting We refer to those as a set of slot keys or slots, to their exact values with a small seed set of labeled examples and a larger set of unla-as a slot values and to the named entities from the documents

beled examples. In each iteration, the most confident predictions

corresponding to those values as target entities.

on the examples from unlabeled set are used to increase the train-

Slot filling is closely related to the task of relation extraction [1]

ing set by assigning pseudo-labels. We introduce an additional

and can be seen as a kind of unary relation extraction. Both tasks

component which combines the confidences of multiple base

can be formulated as classification and are usually approached

learners for each example.

by first training a classifier with a sentence and tagged entities at

To the best of our knowledge, we are the first to use the limited

the input and the prediction of relation or slot key as the output.

domain of news events, which allows the additional assumptions,

As there is a large number of relations between entities that

such as the connection between slots of different topics and the

we might be interested in detecting, there is also a large num-

redundancy of reporting in multiple languages, to first train and

ber of slot keys we seek the slot value for. In order to avoid the

later boost the performance of a slot-filling system.

resource-intensive process of annotating a large number of exam-

The contributions of this paper are the following:

ples for each possible slot/relation and to increase the flexibility

• we combine the data from Wikidata and Wikipedia to

of training procedures beyond the straight-forward supervised

setup a learning and evaluation scenario that mimics the

learning, many alternative approaches have been proposed, such

learning on news events and articles,

as bootstrapping [4], distant supervision [6] and self supervision

• we demonstrate how simultaneous learning on multiple

[5].

topics and languages can be used not only to train the

As stated both tasks can be performed for different types of

multilingual slot-filling system, but to also improve the

documents. We limit our focus to news events on multiple topics

performance of a monolingual system,

(such as natural disasters and terrorist attacks), taking the articles

• we show how an inference component can be used to com-

reporting about events as the documents. Since the number of

bine predictions from multiple base learners to improve

news topics is large, and consequently so is the number of slots,

the pseudo-labeling step of the semi-supervised learning

we would like to minimize the need for manual annotations.

process.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or 2

METHODOLOGY

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this 2.1

Problem Definition

work must be honored. For all other uses, contact the owner/author(s).

Given a collection of topics T (such as earthquakes, terrorist

Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

attacks, etc.), where each topic 𝑡 has its own set of slot keys S ,

© 2020 Copyright held by the owner/author(s).

𝑡

the goal is to automatically extract values from the relevant texts

9





Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia Patrik Zajec and Dunja Mladenič

to fill in the slots. For example, the members of S

of the XLM Roberta model [3] using the implementation from

𝑒𝑎𝑟 𝑡 ℎ𝑞𝑢𝑎𝑘𝑒𝑠

are number of injured, magnitude and location. For each topic

the Transformers 2 library. Note that the representation of each

𝑡 there is a set of events E , each of which took place at some

entity remains fixed throughout the learning process because we

𝑡

point in time and was reported by several documents in different

have found that the representation is expressive enough for our

languages.

purposes and it speeds up the training between iterations. Also

The values of all or at least most slot keys (or slots) from S are

note that since the entity is masked, it is not directly captured in

𝑑

represented in each of the documents as named entities, which

the representation.

we also refer to as target entities. We say most of the slots, since

it is possible that an earthquake caused no casualties. It is also

2.4

Selecting the topics

possible that some of the documents do not report about the

Our assumption is that training the system to detect the slots on

number of casualties as it may be too early to know if there were

multiple topics simultaneously can provide additional benefits.

any. In addition, the documents might contain different values for

For two topics

′

𝑡 and 𝑡

there is potentially a set of common slots

the same slot key, as for example, the reported number of people

and a set of topic-specific slots.

injured by an earthquake can increase over time. There may also

For slot ′

𝑠

which appears in both topics the base learner trained

be several different mentions of the same slot in a particular

on ′

𝑡

can be used to make predictions for examples from 𝑡 . By

document, as for example one magnitude might refer to an actual

combining predictions from learners trained on

′

𝑡 and 𝑡 , we could

earthquake that the event is about, while the other magnitude

get a better estimate of the true labels of the examples.

might refer to an earthquake that struck the same region years

For the slot 𝑠, which is specific to the topic 𝑡 , all examples from

ago.

the topic ′

𝑡

can be used as negative examples. Selecting reliable

Our task is actually a two step process. In the first step, the

negative examples from the same topic is not easy, as we may

goal is to train a system capable of identifying the target entities

inadvertently mislabel some of the positive examples.

for a set of slot keys from the context, which in our case is limited

to a single sentence. Such a system is not yet able to recognise

2.5

Using multiple languages

the true value for a given slot if there are multiple different

candidates, such as selecting the actual magnitude from several

Articles from different languages offer in some ways different

reported magnitude values. The goal of the second step is to

views on the same event. The slot values we are trying to detect

assign a single correct value to each of the slot keys. We assume

should appear in all the articles, as they are highly relevant to

that inferring the correctness of a value is a document-level task,

the event.

since it requires a broader context. Solving the first step is a kind

The values for slots such as location and time should be con-

of prerequisite for the second step, so we focus on it in this paper.

sistent across all articles, whereas this does not necessarily apply

to other slots such as the number of injured or the number of

2.2

Overview of the proposed method

casualties. Matching such values across the articles is therefore

not a trivial task, and although a variant of soft matching can be

The system is trained iteratively and starts with a noisy seed set,

performed, we leave it for the future work and limit our focus

which grows larger with pseudo-labeled positive and negative

only on the values that can be matched unambiguously.

examples. Each of the base learners is trained on the set of la-

We can combine the predictions of several language-specific

beled examples from the topic (or multiple topics) and language

base learners into a single pseudo-label for entities that can be

assigned to it. The prediction probabilities for each of the unla-

matched across the articles.

beled examples are determined by combining the probabilities of

all base learners. This is done either by averaging or by feeding

2.6

Assigning pseudo labels

the probabilities as approximations of the true labels into the

component, which attempts to derive the true value for each ex-

Each iteration starts with a set of labeled examples 𝑋 , a set of

𝑙

ample and the error rates for each learner [7]. The examples with unlabeled examples 𝑋 and a set of base learners trained on 𝑋 .

𝑢

𝑙

probabilities above or below the specific thresholds are given a

Base learners are simple logistic regression classifiers that use

pseudo-label and added to the training set.

vector representations of entities as features and classify each

The seed set is constructed by matching the slot values ob-

example 𝑥 as a target entity for the slot key 𝑠 or not.

𝑠

tained from Wikidata with named entities found in Wikipedia

Each base learner ¯

𝑓

is a binary classifier trained on the la-

𝑡 ,𝑙

articles for each event. There are only a handful of unambigu-

beled data for the slot key 𝑠 from the topic 𝑡 and the language

ous matches for each slot key, which are labeled as a positive

𝑙 . Such base learners are topic-specific as they are trained on a

examples, while the negative examples are all other named en-

single topic

𝑠

𝑡 . Base learners ¯

𝑓

are trained on the labeled data

𝑙

tities from the articles in which they appeared. Figure 1 shows for the slot key 𝑠 from the language 𝑙 and all the topics with the

a high-level overview of the proposed methodology. The entire

slot key 𝑠. Such base learners are shared across topics, as they

workflow is repeated in each iteration until no new examples are

consider the examples from all the topics as a single training set.

selected for pseudo-labelling.

We use the classification probability of the positive class instead

of hard labels, ¯𝑠

¯𝑠

𝑓

(𝑥 ), 𝑓 (𝑥 ) ∈ [0, 1].

𝑡 ,𝑙

𝑙

2.3

Representing the entities

For each entity 𝑥 from a news article with the language 𝑙

Each named entity together with its context forms a single ex-

reporting on the event 𝑒 from the topic 𝑡 we obtain the following

ample. We annotate each article and extract the named entities

predictions:

with Spacy 1. To capture the context, we compute the vector

• ¯𝑠

′

𝑓

(

and all such

that

′ , that

′

𝑥 ) for each 𝑠 ∈ S

𝑡

𝑠 ∈ S

𝑡

𝑡

𝑡 ,𝑙

representation of each entity by replacing it with a mask token

is the probability that 𝑥 is a target entity for the slot key

and feeding the entire sentence through a pre-trained version

1https://spacy.io/

2https://huggingface.co/transformers/

10





Learning to fill the slots from multiple perspectives

Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

Figure 1: High-level overview of the proposed methodology.

𝑠 , where 𝑠 is a slot key from the topic 𝑡 , using the topic-

We have collected the Wikipedia articles and Wikidata in-

specific base learner trained on examples from the same

formation of 913 earthquakes from 2000 to 2020 in 6 different

language on the topic ′

𝑡

that also has the slot key 𝑠,

languages, namely English, Spanish, German, French, Italian and

• ¯𝑠

𝑠

𝑓

(

(

and for each

Dutch. We have manually annotated the entities of 85 English

′ 𝑥 ) which equals ¯

𝑓

′ 𝑦) for each 𝑠 ∈ S𝑡

𝑡 ,𝑙

𝑡 ,𝑙

language ′

𝑙

such that there is an article reporting about

articles using the slot keys number of deaths, (number of injured

the same event 𝑒 in that language and contains an entity

and magnitude, which serve as a labeled test set and are not in-

𝑦 which is matched to 𝑥 ,

cluded in the training process. In addition, we have collected the

• ¯𝑠

𝑓

(𝑥 ) for each 𝑠 ∈ S , using the shared base learner, which

data of 315 terrorist attacks from 2000 to 2020 with the articles

𝑡

𝑙

is on examples from all topics ′

from the same 6 languages.

𝑡

that have the slot key 𝑠.

Predictions from multiple base learners for each 𝑥 and 𝑠 are

3.2

Evaluation Settings

combined as a weighted average to obtain a single prediction

𝑠

The evaluation for each approach is performed on the labeled

𝑓

(𝑥 ). The weight of each base learner ¯

𝑓 is determined by its error

rate

English dataset, where 76 entities are labeled as number of deaths,

𝑒 ( ¯

𝑓 ) which is estimated using an approach from [7] using both unlabeled and labeled examples. This is done by introducing

45 as number of injured and 125 as magnitude. The threshold

the following logical rules (referred to as ensemble rules in [7])

values for the pseudo-labeling are set to 𝑇 = 0.6 and 𝑇 = 0.05.

𝑝

𝑛

for each of the base learners ¯𝑠

The approaches differ by the subset of base learners used to form

𝑓

predicting for 𝑥:

¯

the combined prediction and by the weighting of the predictions.

𝑠

𝑠

𝑠

¯𝑠

𝑠

𝑠

𝑓

(𝑥 ) ∧ ¬𝑒 ( ¯

𝑓

) → 𝑓 (𝑥 ), 𝑎𝑛𝑑 , 𝑓 (𝑥 ) ∧ 𝑒 ( ¯

𝑓

) → ¬𝑓 (𝑥 ),

Single or multiple languages. In single language setting, only

¬ ¯𝑠

𝑠

𝑠

𝑠

𝑠

𝑠

𝑓

(𝑥 ) ∧ ¬𝑒 ( ¯

𝑓

) → ¬𝑓 (𝑥 ), 𝑎𝑛𝑑 , ¬ ¯

𝑓

(𝑥 ) ∧ 𝑒 ( ¯

𝑓

) → 𝑓 (𝑥 ).

English articles are used to extract the entities and train the base

The truth values are not limited to Boolean values, but instead

learners. In the multi-language setting, all available articles are

represent the probability that the corresponding ground predicate

used and the entities are matched across the articles from the

or rule is true. For a detailed explanation of the method we refer

same event.

the reader to [7]. We introduce a prior belief that the predictions of base learners are correct via the following two rules:

Single or multiple topics. In the single topic setting only the

examples from the earthquake topic are used. In the multi-topic

¯𝑠

𝑠

𝑠

𝑠

𝑓

(𝑥 ) → 𝑓 (𝑥 ), 𝑎𝑛𝑑 , ¬ ¯

𝑓

(𝑥 ) → ¬𝑓 (𝑥 ).

setting, the examples from terrorist attacks are used as negative

Since each

examples for the slot key magnitude, the base learners for the

𝑥 can be target entity for at most one slot key, we

introduce a mutual exclusion rule:

slot keys number of deaths and number of injured are combined

as described in the section 2.6.

¯

′

𝑠

𝑠

𝑠

𝑓

(𝑥 ) ∧ 𝑓

(𝑥 ) → 𝑒 ( ¯

𝑓

).

Uniform or estimated weights. In the uniform setting all pre-

The rules are written in the syntax of a Probabilistic soft logic

dictions of the base learners contribute equally, while in the

[2] program, where each rule is assigned a weight. We assign estimated setting the weights of the base learners are estimated

a weight of 1 to all ensemble rules, a weight of 0.1 to all prior

using the approach described in the section 2.6.

belief rules and a weight of 1 to all mutual exclusion rules. The

inference is performed using the PSL framework 3. As we obtain 3.3

Results and discussion

the approximations for all 𝑥 ∈ 𝑋 , we extend the set of positive

𝑢

examples for each slot

𝑠

The results of all experiments are summarized in the table 1. Since

𝑠 with all 𝑥 such that 𝑓 (𝑥 ) >= 𝑇

and

𝑝

the set of negative examples with all

𝑠

the test set is limited to the topic earthquake and English, only a

𝑥 such that 𝑓 (𝑥 ) <= 𝑇 ,

𝑛

for predefined thresholds

subset of base learners was used to make the final predictions. We

𝑇

and 𝑇 .

𝑝

𝑛

report the average value of precision, recall and F1 across all slot

3

EXPERIMENTS

keys. The threshold of 0.5 was used to round the classification

probabilities.

3.1

Dataset

Single iteration. Approaches in which base learners are trained

To evaluate the proposed methodology, we have conducted ex-

on the initial seed set for a single iteration achieve higher preci-

periments on two topics: earthquakes and terrorist attacks.

sion with the cost of a lower recall. We observe that they distin-

3https://psl.linqs.org/

guish almost perfectly between the slots from the seed set and

11





Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia Patrik Zajec and Dunja Mladenič

Table 1: Results of all experiments. The column Single iteration reports the results of approaches where base learners were trained on the seed set only. Results where base learners were trained in the semi-supervised setting with different weightings of the predictions are reported in the columns Uniform weights and Estimated weights. The values of precision, recall and F1 are averaged over all slot keys.

Single iteration

Uniform weights

Estimated weights

Model

P

R

F1

P

R

F1

P

R

F1

Single language, single topic

0.94

0.64

0.76

0.83

0.75

0.77

0.84

0.76

0.79

Multiple languages, single topic

0.94

0.64

0.76

0.82

0.74

0.76

0.83

0.75

0.77

Single language, multiple topics

0.91

0.76

0.83

0.83

0.83

0.83

0.86

0.83

0.84

Multiple languages, multiple topics

0.93

0.76

0.83

0.82

0.83

0.82

0.84

0.84

0.84

produce almost no false positives. Using one or more languages

REFERENCES

has almost no effect on the averaged scores when the number

[1]

Nguyen Bach and Sameer Badaskar. 2007. A Survey on Re-

of topics is fixed. When using multiple topics, a higher recall is

lation Extraction. Technical report. Language Technologies

achieved without a significant decrease in precision. All incorrect

Institute, Carnegie Mellon University.

classifications of the slot number on injured are actually examples

[2]

Stephen H Bach, Matthias Broecheler, Bert Huang, and

of the number of missing slot that is not included in our set and

Lise Getoor. 2017. Hinge-loss markov random fields and

likewise almost all incorrect classifications for the slot magnitude

probabilistic soft logic. The Journal of Machine Learning

are examples of the slot intensity on the Mercalli scale. This could

Research, 18, 1, 3846–3912.

easily be solved by expanding the set of slot keys and shows how

[3]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav

important it is to learn to classify multiple slots simultaneously.

Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard

Semi-supervised. Approaches in which base learners are trained

Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov.

iteratively trade precision in order to significantly improve recall.

2019. Unsupervised cross-lingual representation learning

Most of the loss of precision is due to misclassification between

at scale. arXiv preprint arXiv:1911.02116.

slots number of deaths and number of injured, similar as the exam-

[4]

Tianyu Gao, Xu Han, Ruobing Xie, Zhiyuan Liu, Fen Lin,

ple "370 people were killed by the earthquake and related building Leyu Lin, and Maosong Sun. 2020. Neural snowball for

collapses, including 228 in Mexico City, and more than 6,000 were

few-shot relation learning. In Proceedings of AAAI.

injured." where 228 was incorrectly classified as number of injured

[5]

Xu ming Hu, Lijie Wen, Y. Xu, Chenwei Zhang, and Philip S.

and not the number of deaths. The use of multiple topics reduces

Yu. 2020. Selfore: self-supervised relational feature learning

misclassification between these slots and further improves the

for open relation extraction. ArXiv, abs/2004.02438.

recall as new contexts are discovered by the base learners trained

[6]

Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky.

on terrorist attacks.

2009. Distant supervision for relation extraction without

labeled data. In Proceedings of the Joint Conference of the 47th

Uniform and estimated weights. Using the estimated error rates

Annual Meeting of the ACL and the 4th International Joint

as weights for the predictions of base learners shows a slight

Conference on Natural Language Processing of the AFNLP,

improvement in performance. It may be advantageous to estimate

1003–1011.

multiple error rates for topic-specific base learners, as they tend to

[7]

Emmanouil Platanios, Hoifung Poon, Tom M Mitchell, and

be more reliable in predicting examples from the same topic. We

Eric J Horvitz. 2017. Estimating accuracy from unlabeled

believe that more data and experimentation is needed to properly

data: a probabilistic logic approach. In Advances in Neural

evaluate this component. A major advantage is its flexibility,

Information Processing Systems, 4361–4370.

since we can easily incorporate prior knowledge of the slots or

[8]

Jesper E Van Engelen and Holger H Hoos. 2020. A survey

additional constraints on the predictions in the form of logical

on semi-supervised learning. Machine Learning, 109, 2, 373–

rules.

440.

[9]

Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a

4

CONCLUSION AND FUTURE WORK

free collaborative knowledgebase. Communications of the

We presented an approach for training the slot-filling system

ACM, 57, 10, 78–85.

which can benefit from large amounts of data from Wikipedia.

The experiments were performed on a relatively small dataset

and show that the proposed direction seems promising. However,

the right test of our approach would be to apply it to a much

larger number of topics and events, which will be done in the

immediate next step. Furthermore, the current approach needs

to be evaluated in more detail.

ACKNOWLEDGMENTS

This work was supported by the Slovenian Research Agency

and NAIADES European Unions project under grant agreement

H2020-SC5-820985.

12





Knowledge graph aware text classification

Nela Petrželková∗

Blaž Škrlj

Nada Lavrač

Jožef Stefan Institute

Jožef Stefan Institute and

Jožef Stefan Institute

Ljubljana, Slovenia

Jožef Stefan Int. Postgraduate School

Ljubljana, Slovenia

nela.petrzelkova@seznam.cz

Ljubljana, Slovenia

nada.lavrac@ijs.si

blaz.skrlj@ijs.si

ABSTRACT

(2) The proposed method is extensively empirically evaluated,

Knowledge graphs are becoming ubiquitous in many scientific

indicating that the proposed semantic feature construc-

and industrial domains, ranging from biology, industrial engi-

tion aids the classification performance on many real-life

neering to natural language processing. In this work we explore

datasets.

how one of the largest currently available knowledge graphs, the

(3) The implemented method is freely available3 with a simple-Microsoft Concept Graph, can be used to construct interpretable

to-use, scikit-learn API.

features that are of potential use for the task of text classification.

The paper is structured as follows. Section 2 presents the By exploiting graph-theoretic feature ranking, introduced as part

background and related work. Section 3 presents the proposed of the existing tax2vec algorithm, we show that massive, real-life

approach to semantic feature construction using the information

knowledge graphs can be used for the construction of features,

from a given knowledge graph. Section 4 describes the experi-derived from the relational structure of the knowledge graph

mental setting and the results, followed by a summary and further

itself. To our knowledge, this is one of the first approaches that

work in Section 5.

explores how interpretable features can be constructed from the

Microsoft Concept graph with more than five million concepts

2

BACKGROUND AND RELATED WORK

and more than 80 million IsA relations for the task of text classi-

In text classification tasks, characterized by short documents

fication. The proposed solution was evaluated on eight real-life

or small amounts of documents, deep learning methods are fre-

text classification data sets.

quently outperformed by more standard approaches, including

SVMs [4]. In such settings, it was shown that approaches capa-KEYWORDS

ble of using semantic context may outperform the naïve learn-

knowledge graphs, text classification, feature construction, se-

ing approaches, the examples are among other based on Latent

mantic enrichment

Dirichlet Allocation [5], Latent Semantic Analysis [6] or word embeddings [7], which is referred to as first-level context.

1

INTRODUCTION

Second-level context can be introduced by adding background

Text classification is the process of assigning labels to text accord-

knowledge into a learning process, which may help to increase

ing to its content. It is one of the fundamental tasks in Natural

performance and improve interpretability. Usage of knowledge

Language Processing (NLP) with various applications such as

graphs also helped in classification with extending neural net-

spam detection, topic labeling, sentiment analysis, news catego-

work based lexical word embedding objective function [8]. El-rization and many more [1]. In recent years, knowledge graphs—

hadad et al. [9] present an ontology-based web document, while real-life graph-structured sources of knowledge—are becoming

Kaur et al. [10] propose a clustering-based algorithm for docu-an interesting source of background knowledge, potentially use-

ment classification that also benefits from knowledge stored in

ful in contemporary machine learning [2]. Knowledge graphs, the underlying ontologies. Use of hypernym-based features was

such as DBPedia1 or the Microsoft Concept Graph2 span tens of performed already in e.g., the Ripper rule learning algorithm [11].

millions of triplets of the form subject-predicate-object, and in-

Wang and Domeniconi [12] used the derived background knowl-clude many potentially interesting relations, from which a given

edge from Wikipedia for text enriching. In short document clas-

machine learning algorithm can potentially benefit.

sification, it was shown that the tax2vec algorithm (described

In this work we propose an approach to scalable feature con-

below) can help those classifiers gain better results by adding

struction from one of the largest freely available knowledge

extra semantic knowledge to the feature vectors.

graphs, and demonstrate its utility on multiple real life data sets.

The tax2vec [3] is an algorithm for semantic feature construc-The main contributions of this work are as follows:

tion that can be used to enrich the feature vectors constructed

by the established text processing methods such as the tf-idf. It

(1) We propose an extension to the tax2vec [3] algorithm for takes as input a labeled or unlabeled corpus of documents and a

semantic feature construction, adapting it to operate with

word taxonomy, i.e. a directed graph to which parts of a given

real-life knowledge graphs comprised of tens of millions

document map to. It outputs a matrix of semantic feature vectors

of triplets.

where each row represents a semantics-based vector representa-

1https://wiki.dbpedia.org/

tion of one input document. It makes it by mapping the words

2https://concept.research.microsoft.com/Home/Introduction

from the document to a given taxonomy, WordNet or in this work

Microsoft Concept Graph, by which it creates the collection of

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or terms for each document and from it, a corpus taxonomy—a rela-distributed for profit or commercial advantage and that copies bear this notice and tional structure specific to the considered document space. The

the full citation on the first page. Copyrights for third-party components of this terms presented in the corpus taxonomy represent the potential

work must be honored. For all other uses, contact the owner/author(s).

Information society ’20, October 5–9, 2020, Ljubljana, Slovenia

features.

© 2020 Copyright held by the owner/author(s).

3https://github.com/SkBlaz/tax2vec

13





Information society ’20, October 5–9, 2020, Ljubljana, Slovenia Petrželková et al.

3

KNOWLEDGE GRAPH-BASED SEMANTIC

Table 1: Part of the Microsoft Concept Graph. The row is

FEATURE CONSTRUCTION

in form of hypernym - hyponym - frequency of relation

Semantic features are constructed as follows. With the help of

social network

facebook

4987

spaCy library [13], we first find nouns in each document in the symptom

fever

4966

corpus and for every noun we find all hypernyms in the associ-

sport

tennis

4964

ated knowledge graph. Next, we add the most frequent 𝑛 such

fruit

strawberry

4824

hypernyms to the document-based taxonomy (the number in

activity

fishing

4789

the third column in Table 1). We identified this step as critical, feature construction, how the text is being processed prior to

as the crawl-based knowledge graphs are commonly noisy, and

that and how are semantic features used after that.

prunning out uncertain relations is of high relevance. After per-

forming this for all documents in the corpus, document-based

3.2

Microsoft Concept Graph

taxonomies are concatenated into corpus-based taxonomy. Next,

we perform feature selection, discussed next.

We are using Microsoft Concept Graph4 [15] [16] for obtaining the extra semantic information. This large relational graph con-3.1

Feature selection

sists of more than 5.4 million concepts that are a part of more

than 80 million triplets. It was created by harnessing billions of

During feature selection we choose a predefined number of

web pages, so it is very general and various, offering a lot knowl-

features within the set of features with the goal to select the

edge to add to our text we want to classify. It contains mostly IsA

most useful or important features. Hence, from the set of hy-

relations, which was the part we use to obtain hypernyms for

pernyms which we constructed from the knowledge graph, we

nouns in the input text and enrich the feature vectors by some

choose only top 𝑑 features (= dimension of the space) based on

of them. A part of the downloaded knowledge graph is shown

one of the heuristics described below. Closeness centrality of

in Table 1. The number in the third column is the count of times a node is a measure of centrality in a network, calculated as

this relation was found when creating the knowledge graph, so

𝐶 (𝑥 ) =

1

, where 𝑑 (𝑦, 𝑥) is the distance (path length) be-

Í

𝑑 ( 𝑦,𝑥 )

a frequency of the relation’s occurrence. We removed relations

𝑦

tween vertices 𝑥 and 𝑦. The bigger the closeness centrality value

that had frequency of one, which immediately reduced the graph

a given node has, the closer it is to all other nodes. The rarest

approximately to half the size and removed mostly noisy rela-

terms are the most document-specific and are more likely to

tions. Later we used the NetworkX library [17] to transform the provide more information than the ones frequently occurring.

Microsoft Knowledge Graph from bare text to a directed graph.

Hence this heuristic simply takes overall counts of all the hy-

This step makes the subsequent exploitation of the knowledge

pernyms, sorts them in ascending order by their frequency of

graph easier.

occurrence and takes the top 𝑑. The mutual information be-

tween two random discrete variables represented as vectors 𝑋𝑖

3.3

Proposed approach extending tax2vec

(the 𝑖-th hypernym feature) and 𝑌 (the target binary class) is

Firstly, we tokenize each document and assign part-of-speech

defined as follows:

tags to the tokens with the help of the spaCy library [13]. Then for each noun in the text, we find its hypernyms in the knowledge

Õ

𝑝 (𝑋

= 𝑥, 𝑌 = 𝑦)

𝑖

𝑀 𝐼 (𝑋 , 𝑌 ) =

𝑝 (𝑋

= 𝑥, 𝑌 = 𝑦) log

𝑖

𝑖

2

graph. The number of hypernyms for each noun is a parameter

𝑝 (𝑋

= 𝑥 )𝑝 (𝑌 = 𝑦)

𝑖

𝑥 ,𝑦 ∈ {0,1 }

chosen by the user, we choose those hypernyms based on the

highest frequencies of relation between the current noun and

where 𝑝 (𝑋 = 𝑥) and 𝑝 (𝑌 = 𝑦) correspond to marginal distribu-

𝑖

the hypernyms. As shown later in the paper, bigger number of

tions of the joint probability distribution of 𝑋 and 𝑌 . Tax2vec

𝑖

hypernyms does not help a lot, but increases execution time sig-

computes the mutual information (MI) between all hypernym

nificantly, so it is more sensible to choose a smaller number. Then

features and a given class. So for each target class a vector of

we create a document-based taxonomy, which is a directed graph

mutual information scores is obtained, corresponding to MI be-

where edges are created as hypernym-noun for each hypernym

tween individual hypernym features and a given target class.

and each noun. We merge the document-based taxonomies into

Then the MI scores for each target class are summed up and the

one corpus-based taxonomy (maintaining unique nodes, merge-

final vector is obtained. The features are sorted by MI scores in

Graph method in the pseudocode) and on it we perform one of

descending order and the first 𝑑 features are chosen as the final

the above mentioned heuristics to choose the best 𝑑 hypernyms.

semantic space. The personalized PageRank algorithm takes

Those steps are outlined in Algorithm 1.

as an input a network and a set of starting nodes in the network

and returns a vector assigning a score to each node. The scores

4

EXPERIMENTS AND RESULTS

are calculated as the stationary distribution of the positions of a

random walker that starts its walk on one of the starting nodes

This section presents the setting of the experiments and the data

and, in each step, either randomly jumps from a node to one of

sets on which the experiments were conducted. We also describe

its neighbors (with probability

the metrics used to estimate classification performance.

𝑝 ) or jumps back to one of the

starting nodes (with probability 1-𝑝). In our experiments prob-

ability

4.1

Data sets

𝑝 was set to 0.85. The tax2vec exploits the idea initially

introduced in [14], where personalized PageRank scores are com-We conducted the experiments on eight different data sets, which

puted w.r.t. the terms, present throughout the document space.

are described below. They were chosen intentionally from differ-

This way, a graph-based, completely unsupervised ranking is

ent domains and the basic information about them can be seen

obtained, and is used in similar manner to other feature selection

in Table 2.

heuristics discussed in the previous paragraphs. In this section

we introduce how the knowledge graph is used for semantic

4https://concept.research.microsoft.com/

14





Knowledge graph aware text classification

Information society ’20, October 5–9, 2020, Ljubljana, Slovenia

Data: corpus, knowledgeGraph, maxHypernyms

some cases. We compare those results to the classification without

corpusTaxonomy = [ ];

any semantic features which is plotted as a grey horizontal line.

foreach 𝑑𝑜𝑐 ∈ 𝑐𝑜𝑟𝑝𝑢𝑠 do

On the other hand, on the datasets CNN News, Medical Relation

documentTaxonomy = [ ];

and SMS Spam we didn’t see any improvement with the addition

𝑡 𝑜𝑘𝑒𝑛𝑠 = tokenize(𝑑𝑜𝑐 );

of semantic features. Figure 2 shows the relation between feature foreach 𝑡𝑜𝑘𝑒𝑛 ∈ 𝑡𝑜𝑘𝑒𝑛𝑠 do

space size and the execution times.

if 𝑡𝑜𝑘𝑒𝑛 is 𝑛𝑜𝑢𝑛 then

edges = knowledgeGraph.edgesFrom(𝑡𝑜𝑘𝑒𝑛);

foreach 𝑒𝑑𝑔𝑒 ∈ 𝑒𝑑𝑔𝑒𝑠 do

if 𝑙𝑒𝑛(𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑇 𝑎𝑥𝑜𝑛𝑜𝑚𝑦) >=

𝑚𝑎𝑥 𝐻 𝑦𝑝𝑒𝑟 𝑛𝑦𝑚𝑠 then

break;

documentTaxonomy.add(𝑒𝑑𝑔𝑒 ∈ 𝑒𝑑𝑔𝑒𝑠)

corpusTaxonomy.mergeGraph(documentTaxonomy)

featureSelection(corpusTaxonomy)

Result: Selected semantic features

Algorithm 1: Semantic feature construction.

Table 2:

Data sets used for evaluation of knowledge

graph’s extra features impact on learning.

Data set

Classes

Words

Unique w.

Documents

PAN 2017 Gender

2

5169966

607474

3600

PAN 2017 Age

5

992742

185713

402

SMSSpam

2

86910

15691

5571

CNN-news

7

1685642

159463

2107

MedicalRelation

18

1136326

66235

22176

Articles

20

5524333

178443

19990

SemEval2019

2

295354

39319

13240

Yelp

5

1298353

88539

10000

PAN 2017 (Gender) Given a set of tweets per user, the task

is to predict the user’s gender [18].

PAN 2017 (Age) Given a set of tweets per user, the task is

to predict the user’s age group [19].

CNN News Given a news article (composed of a number of

paragraphs), the task is to assign to it a topic from a list

of topic categories. [20].

SMS Spam Given a SMS message, the task is to predict

whether it is a spam or not. [21].

Medical Relations Given an article with biomedical topic,

the task is to predict the relationship between the medical

terms annotated. [22].

SemEval 2019 Given a tweet, the task is to predict whether

it contains offensive content [23].

Articles Given an web article, the goal is to assign to it a

topic. [24].

Yelp Given an review of a restaurant, the goal is to predict

the ranking from one to five stars.

Settings. In all the datasets the stop words were removed.

Stop words are for example "the", "is", "are" etc. There is no uni-Figure 1: Results of text classification on data sets Yelp,

versal list of stop words in NLP research, however we used NLTK

pan-2017-age, pan-2017-gender, CNN News, SMSSpam, Se-

(Natural Language Toolkit) [25] for filtering stop words. The doc-mEval 2019, Medical Relation and Articles with execution

uments were tokenized with the help of spaCy’s NLP tool. The

times as the numbers in the plot.

data sets were divided into 90% training data and 10% test data

by using random splits. Number of hypernyms for each noun

was 10. We used linear SVM classifier for classification and 𝐹1

5

CONCLUSION

measure for performance.

We showed that information from a large, real-life knowledge

graph can improve text classification. Our approach aims at short

4.2

Results

texts like tweets, shorter articles, messages and similar. We firstly

Figure 1 shows that on some datasets (namely Yelp, PAN 2017 Age, process the document with spaCy, find nouns with their corre-PAN 2017 Gender and on SemEval 2019 and Articles) the extra

sponding hypernyms, from which we create a taxonomy and

semantic features constructed from the knowledge graph help in

from that we later choose the most helpful features with one

15





Information society ’20, October 5–9, 2020, Ljubljana, Slovenia

Petrželková et al.

[6]

T. K. Landauer. 2006. Latent semantic analysis.

[7]

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J.

Dean. [n. d.] Distributed representations of words and

phrases and their compositionality. In Advances in Neural

Information Processing Systems 26.

[8]

A. Celikyilmaz, D. Hakkani-Tür, P. Pasupat, and R. Sarikaya.

2015. Enriching word embeddings using knowledge graph

for semantic tagging in conversational dialog systems. In.

[9]

M. K. Elhadad, K. M. Badran, and G. I. Salama. 2018. A

novel approach for ontology-based feature vector genera-

tion for web text document classification.

[10]

R. Kaur and M. Kumar. 2018. Domain Ontology Graph

Approach Using Markov Clustering Algorithm for Text

Classification. Advances in Intelligent Systems and Com-

puting, 632.

[11]

S. Scott and S. Matwin. 1998. Text classification using

WordNet hypernyms. In Usage of WordNet in Natural Lan-

guage Processing Systems.

[12]

P. Wang and C. Domeniconi. 2008. Building semantic ker-

Figure 2: Results of text classification on data sets SMSS-

nels for text classification using wikipedia. In (August

pam and SemEval 2019 with execution times as the num-

2008).

bers in the plot.

[13]

M. Honnibal and I. Montani. spaCy 2: natural language un-

derstanding with Bloom embeddings, convolutional neu-

of the heuristics. The result remains interpretable, which is an

ral networks and incremental parsing. To appear, (2017).

advantage of this approach. This approach could be potentially

[14]

J. Kralj, M. Robnik-Sikonja, and N. Lavrac. 2019. Netsdm:

improved by performing some type of word sense disambigua-

semantic data mining with network analysis. Journal of

tion and by finding objects in texts, which consists of more than

Machine Learning Research, 20, 32, 1–50.

one word. Further, other knowledge graphs can be used for the

[15]

J. Cheng, Z. Wang, J.-R. Wen, J. Yan, and Z. Chen. 2015.

hypernym search. Also, because the hypernym search in each

Contextual text understanding in distributional semantic

document is independent, the documents can be processed in par-

space. In ACM International Conference on Information and

allel; however, such processing can be memory-intensive, which

Knowledge Management (CIKM).

is to be addressed.

[16]

W. Wu, H. Li, H. Wang, and K. Q. Zhu. 2012. Probase: a

probabilistic taxonomy for text understanding. In ACM In-

ACKNOWLEDGMENTS

ternational Conference on Management of Data (SIGMOD).

The work of BŠ was financed via a junior research grant (ARRS).

[17]

A. A. Hagberg, D. A. Schult, and P. J. Swart. 2008. Ex-

This paper is supported by European Union’s Horizon 2020 re-

ploring network structure, dynamics, and function using

search and innovation programme under grant agreement No.

networkx. In Proceedings of the 7th Python in Science Con-

825153, project EMBEDDIA (Cross-Lingual Embeddings for Less-

ference, 11 –15.

Represented Languages in European News Media). The authors

[18]

F. Rangel, P. Rosso, M. Potthast, and B. Stein. [n. d.] Overview

acknowledge also the financial support from the Slovenian Re-

of the 5th author profiling task at pan 2017: gender and

search Agency for research core funding for the programme

language variety identification in twitter.

Knowledge Technologies (No. P2-0103), the project TermFrame

[19]

F. Rangel, P. Rosso, B. Verhoeven, W. Daelemans, M. Pot-

- Terminology and Knowledge Frames across Languages (No.

thast, and B. Stein. 2016. Overview of the 4th author pro-

J6-9372) and the ARRS ERC complementary grant SDM-Open.

filing task at pan 2016: cross-genre evaluations.

[20]

M. Qian and C. Zhai. 2014. Unsupervised feature selection

REFERENCES

for multi-view clustering on text-image web news data,

1963–1966.

[1]

K. Kowsari, K. J. Meimandi, M. Heidarysafa, S. Mendu,

[21]

T. A. Almeida and J. M. G. Hidalgo. 2011. Sms spam col-

L. E. Barnes, and D. E. Brown. 2019. Text classification

lection v. 1. http : / / www . dt . fee . unicamp . br / ~tiago /

algorithms: A survey. CoRR, abs/1904.08067.

smsspamcollection/. (2011).

[2]

Q. Wang, Z. Mao, B. Wang, and L. Guo. 2017. Knowledge

[22]

2015. Medical information extraction. https://appen.com/

graph embedding: a survey of approaches and applications.

datasets / medical - sentence - summary - and - relation -

IEEE Transactions on Knowledge and Data Engineering.

extraction/. (2015).

[3]

2020. Tax2vec: constructing interpretable features from

[23]

M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra,

taxonomies for short text classification. Computer Speech

and R. Kumar. 2019. Predicting the Type and Target of

& Language.

Offensive Posts in Social Media. In Proceedings of NAACL.

[4]

F. Rangel, P. Rosso, M. Potthast, and B. Stein. 2017. Overview

[24]

2019. Text classification 20. https : / / www. kaggle. com /

of the 5th author profiling task at pan 2017: gender and

guiyihan/text-classification-20. (2019).

language variety identification in twitter. Working Notes

[25]

S. Bird, E. Klein, and E. Loper. 2009. Natural Language

Papers of the CLEF.

Processing with Python. O’Reilly Media.

[5]

D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latent dirichlet

allocation.

16





EveOut: Reproducible Event Dataset for Studying and

Analyzing the Complex Event-Outlet Relationship

Swati

Tomaž Erjavec

Dunja Mladenić

swati@ijs.si

tomaz.erjavec@ijs.si

dunja.mladenic@ijs.si

Jožef Stefan Institute

Jožef Stefan Institute

Jožef Stefan Institute

Jožef Stefan International

Jožef Stefan International

Jožef Stefan International

Postgraduate School

Postgraduate School

Postgraduate School

Ljubljana, Slovenia

Ljubljana, Slovenia

Ljubljana, Slovenia

ABSTRACT

relationship and impact of different features on the selection of

events by the outlets.

We present a dataset consisting of 77, 545 news events collected

between January 2019 and May 2020. We selected the top five

1.1

Contributions

news outlets based on Alexa Global Rankings and retrieved all

the events reported in English by these outlets using the Event

The paper makes the following three contributions to science:

Registry API. Our dataset can be used as a resource to analyze

• The dataset generation scripts, which provide a structured

and learn the relationship between events and their selection

reproducible approach to building a publicly available

by the outlets. It is primarily intended to be used by researchers

dataset of news events with varied features. This will not

studying bias in event selection. However, it may also be used to

only speed up the development of future versions of Eve-

study the geographical, temporal, categorical and several other

Out, but will also help to create custom datasets with the

aspects of the events. We demonstrate the value of the resource

desired outlets and features.

in developing novel applications in the digital humanities with

• The compilation of EveOut, a novel dataset with a rich

motivating use cases. Website with additional details is available

range of event features and spanning multiple news cate-

at http:// cleopatra.ijs.si/ EveOut/ .

gories.

• Identification of possible use cases intended to facilitate

KEYWORDS

the creation of tools to improve digital journalism and to

Dataset, News Event Analysis, Event selection bias, News cover-

help researchers study the complex relationship between

age

events and news outlets.

1

INTRODUCTION

2

DATASET

News outlets are constantly faced with the task of selecting events

Several news outlets may cover a single world event as a story in

they will report on, dependent on the perceived interest of the

a variety of different ways. A collection of one or more stories, all

event to their readership. This can be driven by various factors,

of which describe the same world event, is referred to as an ‘event’

such as the geographical origin of the event, involvement of

in the entire paper. In the following subsections, we define our

well-known persons, etc. Such selection requires monitoring of

data generation process and provide statistics on the resulting

current affairs to determine their news value for the outlet.

dataset.

Machine learning tools may help outlets to deal with the large

numbers of events, help them explore strategies for selecting

2.1

Data Source

publishable events, and build dedicated decision support systems

We use Event Registry1[4] as the data source which monitors, for this task. The effectiveness of these systems depends on the

collects, and provides news articles from news outlets around the

availability of news event collections complemented by relevant

world in over 30 languages. It also identifies the major incidents

event details such as date, category, country of occurrence, brief

reported in the articles and aggregates them into clusters known

description, etc.

as events. For example, “missiles launched by Iran at US forces in

In this paper we introduce EveOut, the first large publicly

Iraq” is an event reported across the globe in over 3,200 news

available data set of 77, 545 English news events with a variety of

articles.

features collected between January 2019 and May 2020. It includes

To construct an event, Event Registry follows a series of steps.

events in eight different categories of news, i.e. business, politics,

News aggregation is the first step in which RSS feeds are con-

technology, environment, health, science, sports, and arts-and-

stantly monitored for new articles. The next major step is the

entertainment. We hope that EveOut will encourage publishers

semantic event information extraction, which retrieves informa-

and others involved in the news production process to develop

tion from the articles in a structured way to be used in subsequent

tools to enhance digital journalism. The data set would also allow

steps. Clustering algorithms are then used to group articles that

researchers from digital humanities to study and analyze the

describe the same event. In the last step, the article clusters are

marked as events and are annotated with rich metadata such as

Permission to make digital or hard copies of part or all of this work for personal a unique id to track the event coverage, categories to which it

or classroom use is granted without fee provided that copies are not made or may belong, geographical location, sentiment, etc. As a result, its

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this extensive temporal coverage can be used effectively to study the

work must be honored. For all other uses, contact the owner /author(s).

complex correlation between events and news outlets.

Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

© 2020 Copyright held by the owner/author(s).

1 https://eventregistry.org

17





Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

Swati, Tomaž Erjavec, and Dunja Mladenić

𝑄

. Next, we set the time limit 𝑄

= [𝑄

, 𝑄

] for ex-

𝑡 𝑖𝑚𝑒

𝑡 𝑖𝑚𝑒

𝑠𝑑

𝑒𝑑

tracting events that occurred within the specified time where,

Select Outlets

Set Time Constraint

𝑄

= ‘2019-01-01’ and 𝑄

= ‘2020-05-31’ signify the event’s

𝑠𝑑

𝑒𝑑

Ex: Top 5 Global Newspapers

Ex: 2019-01-01 to 2020-05-31

start date and end date. Since the outlet’s event selection pol-

icy may change over time, we selected this time frame as re-

cent data tends to be more reliable in predicting event cover-

age patterns. We then set 𝑄

= {𝑄

,

𝑄

,

𝑄

} where,

𝑡 𝑒𝑥 𝑡

𝑜𝑢𝑡

𝑙 𝑎𝑛𝑔

𝑐𝑎𝑡

Generate Event List

𝑄

= {‘𝑛𝑦𝑡𝑖𝑚𝑒𝑠’, ‘𝑖𝑛𝑑𝑖𝑎𝑡𝑖𝑚𝑒𝑠’, ‘𝑤𝑎𝑠ℎ𝑖𝑛𝑔𝑡𝑜𝑛𝑝𝑜𝑠𝑡 ’, ‘𝑢𝑠𝑎𝑡𝑜𝑑-

𝑜𝑢𝑡

Ex: eng-4500343

𝑎𝑦 ’, ‘𝑐ℎ𝑖𝑛𝑎𝑑𝑎𝑖𝑙𝑦 ’}, 𝑄

= {‘𝑒𝑛𝑔’}, and 𝑄

= {‘𝑝𝑜𝑙𝑖𝑡𝑖𝑐𝑠’, ‘𝑏𝑢-

𝑙 𝑎𝑛𝑔

𝑐𝑎𝑡

𝑠𝑖𝑛𝑒𝑠𝑠 ’,

‘𝑠 𝑝𝑜𝑟 𝑡 𝑠 ’, ‘𝑎𝑟 𝑡 𝑠 𝑎𝑛𝑑 𝑒𝑛𝑡 𝑒𝑟 𝑡 𝑎𝑖𝑛𝑚𝑒𝑛𝑡 ’, ‘𝑠𝑐𝑖𝑒𝑛𝑐𝑒 ’, ‘𝑡 𝑒𝑐ℎ-

𝑛𝑜𝑙 𝑜𝑔𝑦 ’,

‘ℎ𝑒𝑎𝑙 𝑡 ℎ’, ‘𝑒𝑛𝑣𝑖𝑟 𝑜𝑛𝑚𝑒𝑛𝑡 ’} represent the outlets, languages Extract Event Info

and news categories respectively.

Ex: id, date, title, summary, ...

From the extracted event list, we first excluded events that

were not covered by any of the selected outlets. We then extracted

individual outlets from the event’s outlet list and created a column

Generate Outlet Label

in the dataset to represent each of them. We use a binary scalar

Ex: 0- Not Covered, 1- Covered

value to indicate whether the outlets covered the event or not.

The event coverage by the outlets is not uniform, which can be

EveOut - Event Outlet

visualized in Figure 2.

nytimes

Figure 1: EveOut dataset generation process.

chinadaily

indiatimes

Table 1: Description of the dataset attributes.

Attribute

Description

usatoday

washingtonpost

uri

a unique event identifier

title

title of the event in English

event_date

date in yyyy-mm-dd format

sentiment

event sentiment

categories

Figure 2: Distribution of event coverage by the outlets.

event categories

loc_country

country where the event occurred

loc_continent

continent where the event

3

AVAILABILITY

occurred

total_article_count

total number of articles published

The GitHub repository containing the scripts is available at

article_count

total number of articles published

https:// github.com/ Swati17293/ EveOut. To facilitate discov-in English

erability and preservation, the full data set is archived as an on-

summary

summary of the event

line resource at https:// doi.org/ 10.5281/ zenodo.3953878. Eve-outlet_list

list of outlets that reported the

Out is available in three common formats ( JSON, XML, and CSV)

event

for direct download and use. The documentation meets the re-

quirements of the FAIR Data principles3 with all necessary metadata defined. Under the Creative Commons Attribution 4.0 Interna-

2.2

Data Generation Process

tional license, it is freely available to make it reusable for almost

any purpose. A separate web page with detailed statistics and

To generate the dataset we adopted an automated approach which

illustrations can be found at

http:// cleopatra.ijs.si/ EveOut/

is depicted in Figure 1. We use Event Registry API to collect event for in-depth analysis.

related information mentioned in Table 1. The script is designed to simplify the release of future versions and to be able to replicate

3.1

Reusability

the process of generating custom datasets. The outlined process

The resource is currently being used for individual projects

is the result of the resource’s core requirement to best address

and as a contribution to the project’s deliverables of the Marie

the potential use-cases referred to in Section 4.

4

Skłodowska-Curie CLEOPATRA Innovative Training Network .

For data generation, we first selected the top five news out-

2

A major part of this project aims to provide a temporal, cross-

lets based on Alexa Global Rankings

. We then used an ex-

lingual analysis of concepts around different events, exploring

plicit temporal query (𝑄 ) to retrieve all events in all news cat-

𝑡

how language impacts the mediatic narratives built by the media.

egories from the Event Registry API. 𝑄

= {𝑄

,

𝑄

}

𝑡

𝑡 𝑒𝑥 𝑡

𝑡 𝑖𝑚𝑒

It also aims to analyse news reporting bias and multiple media

consists of the text component 𝑄

and the time component

𝑡 𝑒𝑥 𝑡

3 http://www.nature.com/articles/sdata201618/

2

4

https://www.alexa.com/topsites/category/Top/News/Newspapers

http://cleopatra- project.eu/

18





EveOut: Reproducible Event Dataset for Studying and Analyzing the Complex Event-Outlet Relationship Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia Figure 3: Overview of the category-wise event coverage by the outlets.

narratives which would enable to filter out appropriate informa-

that category are high/low than usual, it will be reflected in the

tion which then will be used to build information representation

outlet’s coverage pattern.

tools. Since EveOut serves as the basis for the study and analysis

Figure 4 reveals that instead of favoring events with neutral of events and their attributes, it is ideally suited to the project

sentiment, outlets tend to favor events with positive sentiment.

needs.

In addition, event coverage by ‘usatoday’ and ‘washingtonpost’ is

quite diverse with respect to sentiments.

4

POTENTIAL USE CASES

4.1

Examine Event-Selection Bias

It is important for a journalist to know which event is worthy

enough to be published. Even readers would be interested to know

the factors that affect this selection. An automated solution can

be devised using EveOut to provide an overview of the event and

to visualize differences in coverage.

4.2

Outlet Prediction

EveOut is designed to predict the likelihood of an event being

covered by the outlet. It would enable the publishers of the outlets

to assess the significance of the event. In addition, it may also

be used by independent editors who prefer to report on events

Figure 4: Distribution of event coverage by the outlets with

covered by mainstream outlets.

respect to sentiments.

5

STATISTICS AND ANALYSIS

In this section we provide further information about the data

In terms of the sentiments used in each category as plotted in

contained in EveOut, focusing explicitly on the distribution of

Figure 5, it is worth noting that ‘technology’ and ‘sports’ events events between the outlets.

are mostly positive.

With regard to the distribution of event categories covered by

the outlets, as shown in Figure 3, ‘politics’ is the most common category, while ‘environment’ is the least common category. It

is also worth noting that each outlet focuses on the different

categories of events aside from ‘politics’. For instance, ‘india-

times’ focuses more on events related to ‘arts and entertainment’,

whereas ‘chinadaily’ tends to cover more ‘business’ related events.

As far as the coverage of the event over time is concerned,

it is also inconsistent as depicted in Figure

6. Furthermore, the

event-coverage of ‘usatoday’ and ‘washingtonpost’ is slightly

inconsistent. It is also interesting to note the sharp decline in

coverage by ‘usatoday’ in ‘Aug 2019’ and by ‘washingtonpost’ in

‘May 2020’.

The drop in the graph for washingtonpost in ‘May 2020 is due

to its event preference. It is evident from washingtonpost’s radial

graph in Figure 3 that its coverage is biased towards politics and sports. These two categories alone represent around 50% of

events in the dataset. However, this percentage dropped to 40%

in ‘May 2020 and, as a result, the coverage of washingtonpost

dropped significantly. Increase of event coverage in ‘Mar 2019

is also attributed to the fact that about 56% of events were from

Figure 5: Distribution of category over sentiments.

these two categories. In nutshell, if the outlet favors a certain

category of events and, in a specific time frame, and events of

19





Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

Swati, Tomaž Erjavec, and Dunja Mladenić

Figure 6: Distribution of the event coverage by the outlets over time.

6

RELATED WORK

ACKNOWLEDGMENTS

There are a number of datasets that focus on news articles [7]. As This work was supported by the Slovenian Research Agency and

far as the availability of event-centric datasets is concerned, there

the European Union’s Horizon 2020 research and innovation

is a scarcity of publicly available datasets. There are few related

program under the Marie Skłodowska-Curie grant agreement No

research on the event data [3, 1], but the extracted/generated 812997.

datasets for the experiments is also not publicly accessible.

GDELT [5] is the most popular, very large and publicly avail-REFERENCES

able event-oriented news dataset. It contains data in multiple

[1]

Dylan Bourgeois, Jérémie Rappaz, and Karl Aberer. 2018.

languages from a wide range of online publications. It’s collection

Selection bias in news coverage: learning it, fighting it. In

of world events is centered on location, network and temporal

Companion Proceedings of the The Web Conference 2018, 535–

attributes. There is no attribute defining the outlet list for the

543.

event in the dataset. As a result, there is a lack of knowledge

[2]

Cindy Cheng, Joan Barceló, Allison Spencer Hartnett, Robert

essential to the analysis of the event-outlet relationship that is

Kubinec, and Luca Messerschmidt. 2020. Covid-19 govern-

the foundation of our dataset.

ment response event dataset (coronanet v. 1.0). Nature Hu-

In addition, the existing event datasets [6, 2] are category-man Behaviour, 1–13.

dependent (politics/healthcare/disaster etc.) which renders them

[3]

Felix Hamborg, Norman Meuschke, and Bela Gipp. 2018.

useful for specific research purposes only. Therefore, by providing

Bias-aware news analysis using matrix-based news aggre-

a generalized event-centric news dataset, EveOut addresses the

gation. International Journal on Digital Libraries, 1–19.

stated dataset bottleneck.

[4]

Gregor Leban, Blaz Fortuna, Janez Brank, and Marko Gro-

belnik. 2014. Event registry: learning about world events

7

CONCLUSIONS AND FUTURE WORK

from news. In Proceedings of the 23rd International Confer-

In this paper, we introduced the EveOut dataset, which covers

ence on World Wide Web, 107–110.

events reported by the top five global news outlets for over 17

[5]

Kalev Leetaru and Philip A Schrodt. 2013. Gdelt: global data

months. We have ensured that the dataset complies with the

on events, location, and tone, 1979–2012. In ISA annual

FAIR principles. In conjunction with the data set, we provide the

convention. Volume 2, 1–49.

source code for reproducing the dataset with varied features.

[6]

Clionadh Raleigh, Andrew Linke, Håvard Hegre, and Joakim

For instance, it is possible to generate a reduced version of Eve-

Karlsen. 2010. Introducing acled: an armed conflict location

Out, focused on just one category, say ‘politics’. Specific outlets,

and event dataset: special data feature. Journal of peace

dates, and languages can also be specified in accordance with

research, 47, 651–660.

the requirements. We illustrate potential use cases to show how

[7]

Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu,

the dataset could be used to study the pattern of event coverage

Tao Qi, Jianxun Lian, Danyang Liu, X. Xie, Jianfeng Gao,

of an individual outlet and to predict whether or not the outlet

Winnie Wu, and M. Zhou. 2020. Mind: a large-scale dataset

will cover a specific event. Researchers from digital humanities

for news recommendation. In Proceedings of the 58th Annual

can also use it for an in-depth analysis of complex event-outlet

Meeting of the Association for Computational Linguistics,

relationships. In the future , we intend to extend the dataset to

3597–3606. doi: 10 . 18653 / v1 / 2020 . acl - main . 331. https :

include events described in different languages.

//www.aclweb.org/anthology/2020.acl- main.331.

20





Ontology alignment using Named-Entity Recognition methods in the domain of food

Gorjan Popovski1,2∗ , Tome Eftimov1 , Dunja Mladenić1,2 and Barbara Koroušić Seljak1,2

1Jožef Stefan Institute, 1000 Ljubljana, Slovenia

2Jožef Stefan International Postgraduate School, 1000 Ljubljana, Slovenia

{gorjan.popovski, tome.eftimov, dunja.mladenic, barbara.korousic}@ijs.si Abstract

Terminology-driven NER methods, also called dictionary-

based NER methods [Zhou et al., 2006], match text phrases In recent years, a great amount of research has

against concept synonyms that exist in the terminological re-

been done in predictive modeling in the domain

sources (dictionaries). The main disadvantage of these meth-

of healthcare. Such research is facilitated by the

ods is that only the entity mentions that exist in the resources

existence of various biomedical vocabularies and

will be recognized, but the benefit of using them is related to

standards which play a crucial role in understand-

the frequent updates of the terminological resources with new

ing healthcare information. In addition, the Unified

concepts and synonyms.

Medical Language System (UMLS) links together

Rule-based NER methods [Hanisch et al., 2005] use regu-biomedical vocabularies to enable interoperability.

lar expressions that combine information from terminological

However, in the food domain such resources are

resources and characteristics of the entities of interest. The

scarce. To address this issue, this paper explores a

main disadvantage of these methods is the manual construc-

methodology for ontology alignment in the domain

tion of the rules, which is a time-consuming task and depends

of food by leveraging Named-Entity-Recognition

on the domain.

(NER) methods based on different semantic re-

Corpus-based NER methods [Alnazzawi et al., 2015; Lea-

sources. It is based on a recently published rule-

man et al., 2015] are based on an annotated corpus provided based NER method named FoodIE, whose seman-by subject-matter experts as well as the use of ML tech-

tic annotations are based on the Hansard corpus,

niques to predict the entities’ labels. These methods are less

as well as a NER tool called Wikifier, from which

affected by terminological resources and manually created

DBpedia URIs are extracted. To perform the align-

rules. However, their limitation is their dependence on an ex-

ment we use the FoodBase corpus, which consists

istence of an annotated corpus for the domain of interest. The

of recipes annotated with food entities and includes

construction of the annotated corpus for a new domain is a

a ground truth version which is additionally used

time consuming task and requires effort by the subject-matter

for evaluation.

experts to produce it.

To exploit unlabelled data in constructing NER methods,

1

Introduction

AL can be used [Settles, 2010; Tran et al., 2017]. This represents semi-supervised learning in which an algorithm is

Information Extraction (IE) is the task of automatically ex-

able to interactively query the user to obtain the desired la-

tracting information from unstructured data and, in most

bels/outputs at new data points. Which examples are sent

cases, is concerned with the processing of human language

to the user for labelling is chosen by the algorithm and their

text by means of natural language processing (NLP) [Aggar-

number is often much lower than the number of examples re-

wal and Zhai, 2012]. The main idea behind IE is to provide quired for supervised learning. It usually consists of three

a structure to the information extracted from the unstructured

components: (1) the annotation interface, (2) the corpus-

data.

based NER, and (3) component for querying samples.

One of the core IE tasks is named-entity recognition

(NER), which addresses the problem of identification and

classification of predefined concepts [Nadeau and Sekine,

2

Related work

2007]. It aims to determine and identify words or phrases in text into predefined labels (classes) that describe concepts

2.1

Hansard corpus

of interest in a given domain. Various NER methods ex-

ist: terminology-driven, rule-based, corpus-based, methods

The Hansard corpus is a collection of text and concepts cre-

based on active learning (AL), and methods based on deep

ated as a part of the SAMUELS project [Alexander and An-

neural networks (DNNs).

derson, 2012; Rayson et al., 2004]. It contains 37 higher level semantic groups, one of which is our topic of interest — Food

∗Contact Author

and Drink.

21





2.2

FoodIE

Having annotated the recipes with both methods, we can

FoodIE is a rule-based food Named-Entity Recognition

perform the ontology alignment by using the location infor-

method [Popovski et al., 2019a]. As it is rule-based, it con-mation for each annotation in each recipe. Each unique con-

sists of a rule-engine in which the rules are based on compu-

cept from both methods (semantic resources) is assigned its

tational linguistics and semantic information that describe the

unique ID, and then a table is constructed for each concept

food entities.

mapping containing the IDs.

2.3

Wikifier

5

Evaluation and experimental setup

Wikifier is a tool that uses an efficient approach for annotating

5.1

Match types

documents with relevant concepts from Wikipedia [Brank et

•

al., 2017]. It is based on a pagerank method to identify a set of True Positives (TP) — these are matches where the

relevant concepts. As it provides the location in the document

whole food concept is correctly annotated;

where the annotation occurs, it is effectively a Named-Entity

• False Positives (FP) — these are matches where a non-

Recognition method. It provides Wikipedia concepts as anno-

food concept is annotated as a food concept;

tations, additionally assigning DBpedia concepts if they exist.

• False Negatives (FN) — these are matches where a food

entity is not properly annotated;

3

Data

• Partial match — these are matches where only some to-

A recent publication provides one of the first annotated cor-

kens from a food concepts are properly annotated.

pora, named FoodBase [Popovski et al., 2019b], containing food entities. It consists of two version, a ground truth set

5.2

Evaluation metrics

referred to as “curated” (containing 1,000 annotated recipes),

Using the concept of True Positives, False Positives and False

as well an “un-curated” version, consisting of around 22,000

Negatives, we compute the widely used evaluation metrics:

recipes. The recipe categories that are included are: Appe-

Precision (P), Recall (R) and F1 Score (F1). They are defined

tizers and snacks, Breakfast and Lunch, Dessert, Dinner, and

as:

Drinks. In this paper, we use the curated version to perform

•

the ontology alignment as well as evaluate the methodology.

P =

T P

T P +F P

This version was manually checked by subject-matter ex-

• R =

T P

perts, so the false positive food entities were removed, while

T P +F N

the false negative entities were manually added in the corpus.

• F 1 = 2 P ·R

P +R

An example of a recipe can be found on Figure 1.

6

Results and discussion

4

Ontology alignment

After running the evaluation, we obtain the following results.

Using FoodIE and the Wikifier tool, we obtain annotations

The matches for both methods are presented in Table 1, while for all 1,000 recipes from the FoodBase.

the evaluation metrics are presented in Table 2.

FoodIE extracts and annotates each recipe with semantic

tags from the Hansard corpus. Each annotation contains the

Table 1: Match types.

location of the extracted entity, i.e. where in the raw text the

surface form representing the concept occurs, and its corre-

FoodIE

Wikifier

sponding semantic tags from the Hansard corpus.

TPs

11461

6380

The Wikifier tool is used to annotate the recipes with DB-

FNs

684

4121

pedia URIs. As these are general DBpedia concepts, ad-

FPs

258

5861

ditional information to filter out food concepts from non-

Partial

359

3297

food concepts is required. Webscraping the pages for the

URIs provides useful information that can be used to dis-

tinguish food from non-food concepts, such as the broader

Table 2: Evaluation metrics.

concept/class to which the concept of interest belongs. The

post-processing of the DBpedia URIs checks the entity type

FoodIE

Wikifier

of the concept and checks if it is one of: “FOOD”, “FOODS”,

F1 Score

0.9605

0.5611

“DISH”, “INGREDIENT”, “FOOD AND DRINK”, “BEV-

Precision

0.9780

0.5212

ERAGE”, “PLANT”, “ANIMAL”, or “FUNGUS”. If it does

Recall

0.9437

0.6076

not belong to one of the above entity types, the page is

checked for mentions of other URIs which are semantically

From the results in the tables it is evident that FoodIE pro-

related to food: “FOOD”, “PLANT”, “ANIMAL”, or “FUN-

vides more promising results. However, this was expected as

GUS”. These URI mentions can occur anywhere in the page

this NER method was specifically constructed to only cater

and if one of these matches is satisfied, the entity is assumed

to the domain of food. Of especial interest are the matches of

to be a food entity.

type partial, since they represent a match where only a subset

A post-processed example of such an annotation can be

of the tokens in a food entity are correctly recognized. For

found on Figure 2.

example, looking at Figure 1, the first extracted food entity 22





Figure 1: Example recipe from the “curated” part of FoodBase.

Figure 2: Wikifier annotation example on a single recipe

23





should be “dry ranch salad dressing”, which is correctly ex-

[Alnazzawi et al., 2015] Noha Alnazzawi, Paul Thompson,

tracted by FoodIE. Looking at Figure 2, the same food entity Riza Batista-Navarro, and Sophia Ananiadou. Using text

is only extracted as “salad”. Such match types do not factor

mining techniques to extract phenotypic information from

in the calculation of the evaluation metrics, as it is debatable

the phenochf corpus. BMC medical informatics and deci-

whether to count them as TPs or FNs. Nevertheless, they

sion making, 15(2):1, 2015.

are interesting to compare, since even partial matches con-

[Brank et al., 2017] Janez Brank, Gregor Leban, and Marko

vey at least some semantic meaning regarding the food entity.

Grobelnik. Annotating documents with relevant wikipedia

Moreover, FP annotations on the same figure are “bowl” and

concepts. Proceedings of SiKDD, 2017.

“shape” which are not food entities. Additionally, a recent

comparison of existing food NER methods can be found in

[Hanisch et al., 2005] Daniel

Hanisch,

Katrin

Fundel,

[Popovski et al., 2020], where the authors compare the per-Heinz-Theodor Mevissen, Ralf Zimmer, and Juliane

formance of FoodIE with NER methods using other food on-

Fluck.

Prominer: rule-based protein and gene entity

tologies available in the BioPortal.

recognition. BMC bioinformatics, 6(1):S14, 2005.

Regarding the mapping of the concepts, a total of 348 ex-

[Leaman et al., 2015] Robert Leaman, Chih-Hsuan Wei,

plicit concept mappings were discovered by the methodology.

Cherry Zou, and Zhiyong Lu. Mining patents with tm-

An example mapping for the concept “garlic” would be:

chem, gnormplus and an ensemble of open systems. In

• A000016: ‘garlic’, AG.01.h.02.e [Onion/leek/garlic].

Proce. The fifth BioCreative challenge evaluation work-

shop, pages 140–146, 2015.

• E000029: ‘garlic’, http://dbpedia.org/resource/Garlic

[Nadeau and Sekine, 2007] David

Nadeau

and

Satoshi

7

Conclusion and future work

Sekine.

A survey of named entity recognition and

classification.

Lingvisticae Investigationes, 30(1):3–26,

In this work we propose a methodology for ontology align-

2007.

ment by using Named-Entity Recognition methods in the do-

main of food. It utilizes the newly proposed FoodIE NER

[Popovski et al., 2019a] Gorjan Popovski, Stefan Kochev,

method and the Wikifier text annotation tool. Our experimen-

Barbara Koroušić Seljak, and Tome Eftimov. Foodie: A

tal results show that FoodIE provides more promising results

rule-based named-entity recognition method for food in-

than Wikifier, achieving an F 1 score of 0.9605, compared

formation extraction.

In Proceedings of the 8th Inter-

to 0.5611. This is expected since FoodIE is specifically de-

national Conference on Pattern Recognition Applications

signed for the food domain, while Wikifier uses general vo-

and Methods, (ICPRAM 2019), pages 915–922, 2019.

cabulary and annotates text with Wikipedia concepts.

[Popovski et al., 2019b] Gorjan Popovski, Barbara Koroušić

For future work, recursive webscraping can be performed

Seljak, and Tome Eftimov. FoodBase corpus: a new re-

to more accurately distinguish between food and non-food

source of annotated food entities.

Database, 2019, 11

annotated concepts from the Wikifier tool. Specifically, this

2019. baz121.

would mean repeating the steps to check if the entity is a

[Popovski et al., 2020] G. Popovski, B. K. Seljak, and T. Ef-

food entity or not on the parent nodes in DBpedia. Addition-

timov.

A survey of named-entity recognition methods

ally, more food semantic resources can be included to provide

for food information extraction. IEEE Access, 8:31586–

mapping between multiple ontologies. Doing this is depen-

31594, 2020.

dent on the existence of a NER method that works with con-

cepts from the desired food semantic resource.

[Rayson et al., 2004] Paul Rayson, Dawn Archer, Scott Piao,

and AM McEnery. The ucrel semantic analysis system.

Acknowledgements

2004.

This research was supported by the Slovenian Research

[Settles, 2010] Burr Settles. Active learning literature sur-

Agency (research core grant number P2-0098), and the Eu-

vey.

University of Wisconsin, Madison, 52(55-66):11,

ropean Union’s Horizon 2020 research and innovation pro-

2010.

gramme (FNS-Cloud, Food Nutrition Security) (grant agree-

[Tran et al., 2017] Van Cuong Tran, Ngoc Thanh Nguyen,

ment 863059). The information and the views set out in this

Hamido Fujita, Dinh Tuyen Hoang, and Dosam Hwang. A

publication are those of the authors and do not necessarily re-

combination of active learning and self-learning for named

flect the official opinion of the European Union. Neither the

entity recognition on twitter using conditional random

European Union institutions and bodies nor any person acting

fields. Knowledge-Based Systems, 132:179–187, 2017.

on their behalf may be held responsible for the use that may

[Zhou et al., 2006] Xiaohua Zhou, Xiaodan Zhang, and Xi-

be made of the information contained herein.

aohua Hu. Maxmatcher: Biological concept extraction us-

ing approximate dictionary lookup. In Pacific Rim Interna-

References

tional Conference on Artificial Intelligence, pages 1145–

[Aggarwal and Zhai, 2012] Charu C Aggarwal and ChengX-

1149. Springer, 2006.

iang Zhai. Mining text data. Springer Science & Business

Media, 2012.

[Alexander and Anderson, 2012] Marc Alexander and J An-

derson. The hansard corpus, 1803-2003. 2012.

24





Extracting structured metadata from multilingual textual descriptions in the domain of silk heritage

M.Besher Massri

Dunja Mladenić

Jožef Stefan Institute, Slovenia

Jožef Stefan Institute

besher.massri@ijs.si

Jožef Stefan International Postgraduate School

Ljubljana, Slovenia

dunja.mladenic@ijs.si

ABSTRACT

processing and annotation, we generated 24 binary datasets and

19 multi-class datasets (four for English, two for Spanish, and

In this paper, we present a methodology for extracting structured

one for French). Using machine learning techniques we trained

metadata from museum artifacts in the field of silk heritage. The

classifiers on the labeled data examples to predict the labels (slot

main challenge was to train on a relatively small and noisy data

values) based on the textual descriptions. Despite relatively small

corpus with highly imbalanced class distribution by utilizing a

and unbalanced data corpora, using sampling techniques and

variety of machine learning techniques. We have evaluated the

weighted loss function helped mitigate the issue. In an experi-

proposed approach on real-world data from five museums, two

mental evaluation, we observed that on our data using traditional

English, two Spanish, and one French. The experimental results

methods might be as good as using deep learning models when

show that in our setting using traditional machine learning al-

the data is scarce. However, using deep learning allows for build-

gorithms such as Support Vector Machines gives comparable

ing multilingual models that scale across different languages.

and in some cases better results than multilingual deep learning

The main contribution of this paper is in proposing an ap-

algorithms. The study presents an effective approach for catego-

proach to adding metadata to historical artifacts based on ap-

rization of text described artifacts in a niche domain with scarce

plying machine learning on multilingual textual descriptions of

data resources.

the artifacts. Moreover, we have defined the learning problem in

KEYWORDS

collaboration with domain experts and performed evaluations on

real-world data in English, Spanish, and French. The rest of this

Information extraction, Text classification, Silk heritage, Trans-

paper is structured as follows. Section 2 provides a description of

formers, Support Vector Machines.

the data, Section 3 describes the proposed methodology, Section

4 gives the results of the evaluation and Section 5 concludes the

1

INTRODUCTION

paper summarizing the approach and the findings.

When looking to improve the understanding of silk heritage we

find that the data available in the museums often lack seman-

tic information on the artifacts or have them to some extent

2

DESCRIPTION OF DATA

included in textual descriptions. To facilitate automatic analysis

We used the SilkNow knowledge graph [8] as our source of data.

of silk heritage data and support digital modeling of the weaving

The source consists of records of different museums in different

techniques, we propose multilingual metadata extraction from

languages as shown in Table 1. The largest are MET with8364

textual descriptions provided by the museums.

artifacts in English, VAM with 7231 artifacts in English, and Ima-

We propose the usage of machine learning techniques to model

tex with 6799 artifacts in Spanish. We have used a subset of the

the target variables, referred here as slots to align with the ter-

data that contain artifacts with provided metadata and textual

minology of information extraction. Using machine learning

descriptions in related fields that were pointed out as relevant by

methods we build a model for each of the target variables in

the domain experts. Each record consists of the basic information

order to annotate the text. This enabled us to add metadata to

about the object, such as the title and the museum it belongs to,

the silk heritage artifacts of the museums. The domain experts

along with two other sets of attributes, textual attributes, and

collaborating on Silknow project [9] have identified four kinds categorical attributes. Textual attributes hold a textual descrip-of metadata information that would be useful and are contained

tion of the object in several fields, such as physical description

in texts of at least some of the targeted museums. We treat these

and a technical description. The categorical description holds

as four slots for information extraction, where the list of possible

metadata information, such as technique or materials used. How-

slot values for each of the four was defined by the domain experts.

ever, the data quality varies across the museums and records.

Based on that we formed a multi-class dataset for each slot.

Some museums are rich in both textual and categorical attributes,

The corpora of text included were in three different languages

like the VAM museum, and others have short/low-quality textual

(English, Spanish, and French) from five different museums, with

attributes like Imatex. Also, some records have a text description

a total of 500 museum records used in the study. After the data

in their categorical attributes instead of a single category value.

The metadata fields that we have considered are weaving

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or technique, weave, motifs, and style. The list of labels or slot

distributed for profit or commercial advantage and that copies bear this notice and values for each of the metadata field (i.e. slot for information

the full citation on the first page. Copyrights for components of this work owned extraction) were compiled by the domain experts. These values

by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior describe the silk artifacts’ nature and structure. Each of those

specific permission and /or a fee. Request permissions from permissions@acm.org.

slot values is represented by a term and a list of alternatives, up

Information society ’20, October 5–9, 2020, Ljubljana, Slovenia

to four alternatives per term. Examples of slot values are satin,

© 2020 Association for Computing Machinery.

twill, and tabby, representing possible values of the weave slot.

25





Information society ’20, October 5–9, 2020, Ljubljana, Slovenia Museum

Language

Count

The features were generated from sequences of words, referred

CER

Spanish

1296

to as n-grams, of length 1, 2, and 3. The remaining parameters

Garin

Spanish

3101

were left unchanged from their default values. We used nltk [1]

Imatex

Spanish

6799

library for tokenization, SpaCy [4] for lemmatization, and Snow Joconde

French

376

Ball Stemmer [6] for stemming.

MAD

French

763

Due to the methodology of data labeling, we sometimes ended

MET

English

8364

up with a highly imbalanced datasets having a lot more negatives

MFA

English

3297

than positives. Therefore, in the binary dataset, we took a random

MTMAD

French

663

subset from the negative examples to match the positive count. In

RISD

English

3338

addition, some examples were generated from the same records,

by having more than one textual record with mentions of the

VAM

English

7231

Table 1: Museums from the Silknow knowledge graph

same class’s term/alternatives, therefore, corrections have been

showing the language of the artifacts and the number of

applied to the dataset by putting all examples of the same record

artifacts included in the knowledge graph.

in either train or test but not in both. This process was done to

ensure no leakage occurs by potentially having highly similar

textual text in train and text.

3.3

Multi-class Classification Tasks

3

METHODOLOGY

For multi-class classification, we used a deep learning approach.

The architecture consists of a pre-trained transformer, an LSTM

3.1

Annotating datasets with slot values

layer, a dropout layer, a dense (linear) layer, and finally a soft-max

Based on the data and target variables, two types of datasets

activation layer. For the transformer we used BERT [3], multi-were formed for two types of text classification tasks. The first

lingual BERT, and XLM-ROBERTA [2]. The loss function used type is binary classification dataset, in which the target class

was a cross-entropy loss with Adam as the optimizer. We used

is one of the slot values. The other is multi-class classification

PyTorch framework [7] and hugging-face transformers library dataset, in which a dataset is formed for each of the four slots in

[10].

each museum, where the target classes are the slot values that fall

Considering that some of the datasets have a large class imbal-

under the selected slot in addition to extra "other" class indicating ance, which can be a couple of thousand examples of the majority

that the example doesn’t fall under any of them.

class and only a few examples of the minority classes, we exper-

For forming the binary classification dataset we used a simple

imented with several class-weighting schemas. First, we tried

string matching approach. For each target class in each museum,

assigning weights to the classes in the loss function is inversely

examples were formed out of textual attributes of the museum

proportional to the number of examples of each class. In addi-

records that contain a mention of either one of the possible value

tion, when we used weighted sampling with return for loading

terms or its alternatives. Categorical attributes of the same record

the examples into batches. This had the effect of over-sampling

were used to determine the label of the example. The task is to

the minority classes and under-sampling the majority classes to

classify whether the example has the slot value against the other

achieve as balanced batch representation as possible. Finally, we

slot values of the same slot. Each item is classified as True if

tried a derivable version of F1 Macro as a loss function where the

the categorical attributes contain only the target value or one

prediction matrix is taken as a probability rather than a binary

of its alternatives but not any of the other slot values’ terms

value.

or their alternatives. If there is no mention of the slot value

term or alternatives, then it’s classified as false. If it contains

4

RESULTS

this slot value’ term along with other slot values’ terms then it’s

4.1

Experimental Datasets

considered as indeterminate and the example is removed.

To form the multi-class datasets, we merged the datasets of

The dataset collection methodology was applied to 10 museums

the same museum with target classes representing slot values

and 4 categories holding more than 150 class values overall. How-

that fall under the same slot. The true items of each slot value

ever, most of the datasets have no positive items. In this research,

dataset formed the set of the examples with that slot value as the

we have selected datasets with at least 10 positive examples for

labels. The items that are false in each slot value dataset formed

binary classification tasks and at least 10 non-other in multi-

the "Other" class in the multi-class dataset.

class tasks. This final list consists of 24 binary datasets and 19

multi-class datasets. These datasets are used for training machine

3.2

Binary Classification Tasks

learning classifiers.

For binary classification, we used TFIDF word-vector represen-

4.2

Binary Classification Tasks

tation for generating the feature vectors and trained a Linear

For binary Classification, we applied the described methodology

Support Vector Machines (SVM) as the classifier using scikit-

on all the datasets with at least 10 positive examples. The results

learn library [5]. All dataset were split into train and test using of binary classification are consolidated in Table 2.

80-20 stratified split. We performed a grid search with 5-fold

The graph in figure 1 displaying the correlation between the cross validation on the training part using the following options:

number of examples and the F1 score reveals a weak correlation

• stemming, lemmatisation, or none

of 0.19. We can see that when having more than 600 examples, we

• max document frequency: [0.95.1.0]

achieve F1 over 0.8. Upon closer inspection on the museum level,

• min document frequency: [0,0.05]

we found that the best results are achieved in the MFA museum on

• SVM tolerance: [1e-4,1e-5]

motifs and weaving technique and Joconde museums on weave.

26





Extracting structured metadata from multilingual textual descriptions in the domain of silk heritage Information society ’20, October 5–9, 2020, Ljubljana, Slovenia

Museum

Slot value

Slot

Language

#Exs

Accuracy

Precision

Recall

F1

cer

bordado

weaving technique

Spanish

278

0.89

0.87

0.93

0.9

cer

motivo vegetal

motifs

Spanish

146

0.57

0.56

0.6

0.58

cer

tafetÃ¡n

weave

Spanish

581

0.77

0.9

0.6

0.72

cer

terciopelo

weaving technique

Spanish

118

0.67

0.67

0.67

0.67

garin

brocatel

weaving technique

Spanish

932

0.88

0.85

0.92

0.89

garin

damasco

weaving technique

Spanish

1748

0.9

0.92

0.87

0.89

garin

espolÃn

weaving technique

Spanish

972

0.88

0.89

0.88

0.88

joconde

Satin

weave

French

159

0.91

0.9

0.95

0.93

joconde

Taffetas

weave

French

110

0.95

0.92

1

0.96

mfa

Lace

motifs

English

190

0.92

0.9

0.95

0.92

mfa

plain

weaving technique

English

130

1.00

1.00

1.00

1.00

vam

brocade

weaving technique

English

634

0.87

0.87

0.87

0.87

vam

damask

weaving technique

English

480

0.84

0.85

0.83

0.84

vam

Ear

motifs

English

262

0.83

0.84

0.81

0.82

vam

Edge

motifs

English

178

0.81

0.87

0.72

0.79

vam

embroidery

weaving technique

English

1614

0.85

0.86

0.83

0.84

Table 2: Results for the binary classification task.

Overall the best results are achieved by MFA and Joconde with

because of the large fluctuation in F1 macro value across training

an average F1 of .96 and .95 respectively followed by Garin, VAM,

epochs caused by having minority classes with few examples.

and CER with the average F1 of .89, .81, and .72 respectively.

Model configuration

Accuracy

F1

Base model

84.6

43.1

Weighted loss

82.1

47.2

Weighted sampling

82.6

52.2

F1 loss function

77.5

59.1

weighted sampling and f1 loss

52

22.8

Weighted loss and weighted sampling

84.8

54.7

+ Learning rate 1e-4 −

→ 5𝑒 − 6

86.1

57.9

Multi-Lingual BERT

85.3

55.2

XLM-ROBERTA

87.5

53.6

Table 3: Comparison between different model configura-

tion on the Weave Slot detection in VAM Dataset

Figure 1: F1 score vs #Examples showing good perfor-

mance on the largest datasets, when the number of exam-

ples is at least 600.

Comparing the learning curves of BERT and multi-lingual

BERT in figure 2 reveals that despite the comparable results, the multi-lingual BERT took double the number of epochs to

4.3

Multi Class Classification Class

stabilize and finish training compared to its BERT counterpart.

4.3.1

Use Case: Detecting Weave Slot from VAM museum. We

This can be due to the fact that Multi-lingual BERT is trained in

selected the VAM Weave slot as a use case dataset to perform

many languages and it needs more fine-tuning to adapt to any

hyperparameter tuning and select the best configurations for

certain language, whereas the BERT transformer was trained in

weighting. The dataset contains 2760 items with a baseline of

English-only documents.

52.9% distributed across 4 classes: Satin, Tabby, Twill, and Other.

The dataset was split into train, test, and validation in the form

4.3.2

Generalizing towards all datasets. After we experimented

of 60-20-20 split. The results in Table 3 show that using class with different parameter settings, we decided to use the follow-weighting in both loss function and sampling provides the best

ing parameters on all the datasets: Weighted Loss function and

−6

results w.r.t both classification accuracy and F1. Using F1 as a loss

Weighted Sampling for batches; learning rate of 5 ∗ 10

; batch

function sometimes provided good results but was discarded as

size of 16 for BERT and 12 for multi-lingual BERT and XLM-

it was not stable across different datasets. In addition, decreasing

ROBERTA, due to memory limits; 1024 Units for LSTM Layer;

the learning rate improved results and stabilized the training

dropout layer of 0.5.

curve. Finally, using the XLM-ROBERTA transformer showed an

Moreover, the datasets were tested against three types of trans-

improvement in accuracy. The number of epochs was determined

former: Language-Specific BERT, Multilingual BERT, and XLM-

based on the accuracy performance of the validation dataset. The

ROBERTA, as well as the SVM classifier. The accuracy results in

training would stop when the accuracy did not improve for the

Table 4 show that on most of the datasets SVM performs better last 15 epochs. The accuracy (F1 micro) was chosen over F1 macro

or comparable to the deep learning models.

27





Information society ’20, October 5–9, 2020, Ljubljana, Slovenia

Museum

Lang

Slot

Baseline

# Cls

# Exs

SVM

BERT

Multi BERT

XLM-ROBERTA

VAM

English

Weave

52.9

4

2760

82.8

86

85.3

87.5

VAM

English

Weaving Technique

35.9

14

3525

77.6

80.1

78

78

VAM

English

Motifs

84.8

9

5500

91

90.6

87.4

87

CER

Spanish

Weave

59.3

5

945

75.1

75.1

64

72

CER

Spanish

Weaving Technique

61.1

11

720

74.3

74.1

71.5

66

Joconde

French

Weave

55.6

4

180

66.7

30.6

86.1

91.7

Joconde

French

Weaving Technique

60

5

150

97.2

70

76.7

63.3

Table 4: Results for the multi-class classification task.

ACKNOWLEDGMENTS

This work was supported by the Slovenian Research Agency

and SilkNow European Unions Horizon 2020 project under grant

agreement No 769504.

REFERENCES

[1]

Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural

Language Processing with Python. O’Reilly Media.

[2]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav

Figure 2: Comparison of a learning curve between BERT

Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard

and Multi-Lingual BERT as a transformer in the deep

Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov.

learning model trained on the VAM museum Weave Slot

2020. Unsupervised cross-lingual representation learning

dataset.

at scale. In Proceedings of the 58th Annual Meeting of the

Association for Computational Linguistics. Association for

Computational Linguistics, Online, (July 2020), 8440–8451.

doi: 10.18653/v1/2020.acl- main.747. https://www.aclweb.

org/anthology/2020.acl- main.747.

5

CONCLUSION AND FUTURE WORK

[3]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina

Toutanova. 2018. Bert: pre-training of deep bidirectional

We propose an approach to extracting metadata from a multilin-

transformers for language understanding. arXiv preprint

gual text description of silk heritage domain museum artifacts.

arXiv:1810.04805.

The datasets had several specifics that made the model devel-

[4]

Matthew Honnibal and Ines Montani. spaCy 2: natural

opment a non-trivial task. First, the size of the dataset some-

language understanding with Bloom embeddings, con-

times was too small to train a model. Second, some class values

volutional neural networks and incremental parsing. To

have considerably more examples than others, which caused

appear, (2017).

the datasets to be imbalanced. Finally, in the preparation phase,

[5]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B.

the datasets were labeled to accommodate the described issues,

Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,

which in itself is an approximation and carries an inherent error

V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M.

rate. We have improved the performance of the model by over-

Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn:

sampling minority classes, under-sampling majority classes, and

machine learning in Python. Journal of Machine Learning

using a class-weighted loss function. In addition, by perform-

Research, 12, 2825–2830.

ing cross-validation in the binary classification case or adding a

[6]

Martin F. Porter. 2001. Snowball: a language for stemming

dropout layer and validating based on a validation dataset, we

algorithms. Published online. Accessed 11.03.2008, 15.00h.

managed to mitigate some of the over-fitting behavior caused by

(2001). http : / / snowball . tartarus . org / texts / introduction .

having a little amount of data. We believe that the over-fitting

html.

could be mitigated further by using regularization on the LSTM

[7]

[n. d.] Pytorch: an imperative style, high-performance

layer, as well as using weight-decaying in the optimizer.

deep learning library. In.

The experimental results show that with low data quality and

[8]

2020. Silknow knowledge graph data. https://github.com/

having not enough data, traditional methods such as SVM in

silknow/converter/tree/master/output. (2020).

some cases outperform deep neural network models. We expect

[9]

2020. SilkNow project. https://silknow.eu/. (2020).

that the results could be improved by having an assembly of

[10]

T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A.

those models instead of using one of them only, which is a part

Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison,

of the future work. Furthermore, one can fine-tune each model

S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu,

independently to achieve better performance.

T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M.

In future work, we plan to test cross-museum learning by

Rush. [n. d.] Huggingface’s transformers: state-of-the-art

training on one museum and predicting other museums both in

natural language processing.

the same language and in different languages using multi-lingual

transformers. This has practical value for labeling the data in the

museums that do not contain metadata information but do have

suitable textual descriptions of the artifacts.

28





Hierarchical classification of educational resources

Gregor Žunič

Erik Novak

Jožef Stefan Institute

Jožef Stefan Institute

Ljubljana, Slovenia

Jožef Stefan International Postgraduate School

gregor.zunic@ijs.si

Ljubljana, Slovenia

erik.novak@ijs.si

ABSTRACT

2

RELATED WORK

This paper describes an approach to automate the process of la-

There are two approaches to hierarchically classify the data: (1) the

belling hierarchically structured data. We propose a top-down level-

Big-bang, and (2) the Top-down level-based approach [4, 8, 9].

based approach with SVMs to classify the data with scientific do-

The big-bang approach works by training (complex) global

main labels. The model was trained on labeled open education

classifiers which consider the entire class hierarchy as a whole.

lectures and returns high accuracy predictions for lectures in the

Each global classifier is binary and decides if the material fits the

English language. We found that our model performs better with

entire hierarchy (entire hierarchy is for example “Computer Sci-

the traditional text extraction method TF-IDF than with pre-trained

ence/Machine Learning/Support Vector Machine”). The advantage

language model XLM-RoBERTa.

of this approach is that it avoids class-prediction inconsistencies

across multiple levels. The major drawback of this approach is the

KEYWORDS

high complexity due to the enforcing the model to correctly predict

hierarchical classification, support vector machine, multi-class clas-

the whole hierarchy branch, which can be difficult to achieve.

sification, machine learning, open educational resources

The top-down level-based approach works by training local

classifiers at each level to distinguish between its child nodes. An

ACM Reference Format:

example will first, at the root level, be classified into a second-

Gregor Žunič and Erik Novak. 2020. Hierarchical classification of educa-level category. It will then be further classified at the lower level

tional resources. In Proceedings of Slovenian KDD Conference (SiKDD’20).

category until it reaches one or more final categories where it can

ACM, New York, NY, USA, Article 4, 4 pages. https://doi.org/10.475/123_4

not be classified any further. The main advantage of this model is

its simplicity. The disadvantage is the difficulty to detect an error

1

INTRODUCTION

in the parent category which could lead to false classification.

Manually labeling data can be tedious work; one must have suf-

The most common implementation of a local classifier [3] is the ficient background knowledge about the data and have clear in-support vector machine [7, 11]. In the later papers they propose to structions in the labeling process. This becomes even more difficult

train separate SVMs for every level of a branch in the hierarchy.

when the data needs to be annotated with hierarchically structured

labels.

3

DATA SET

In this paper we present a top-down level-based approach us-

ing support vector machines (SVMs) for labeling open education

The data set used in the experiment consists of 28,769 OER lec-

resources (OERs). The labels are in a hierarchical structure and

tures available at Videolectures.NET [10], an award winning video represent different scientific domains. We compare different lecture

OER repository. For each lecture we collected the following meta-

representations using TF-IDF and XLM-RoBERTa and find that the

data: title, description, labels, language, authors, date published and TF-IDF representations yield better results. Even though the paper

the length of the lecture. The description is present in 58% of the

focuses on OERs the method can be generalized to any textual data

lectures. The data set contains 24532 lectures in English, 3930 in

set.

Slovene and 307 lectures in other 16 languages.

The remainder of the paper is structured as follows. Section 2

Preprocessing. For our methodology we used only the lecture’s

describes the related work done on the topic of hierarchical classifi-

title, description, language and categories. Each lecture is labeled

cation. Next, we present the data used in the evaluation in Section 3.

with one or more scientific (sub-)domains most relevant for the

The methodology is described in Section 4. The evaluation setting lecture (e.g. “Computer Science”, "Computer Science/Crowd Sourc-and its results are described in Section 5 followed by a discussion ing"). Figure 1 shows the distribution of lectures per number of in Section 6. We present the future work in Section 7 and conclude labels.

the paper in Section 8.

Almost half of the lectures have more than one label. Lectures

with no labels are placed under the “No Labels” category. These

lectures are mostly introductory speakers’ presentations in confer-

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed ences. We focus on predicting a single label with high accuracy. We

for profit or commercial advantage and that copies bear this notice and the full citation prescribed to only have one label per lecture. We achieve this by

on the first page. Copyrights for third-party components of this work must be honored.

duplicating a lecture

For all other uses, contact the owner/author(s).

𝑛 times, where 𝑛 is the number of labels of

SiKDD’20, October 2020, Ljubljana, Slovenia

the lecture and assign a distinct label to each duplicate. Although

© 2020 Copyright held by the owner/author(s).

the duplicates may reduce the performance of the models we do

ACM ISBN 123-4567-24-567/08/06.

not reduce the already small number of lectures used during the

https://doi.org/10.475/123_4

29





SiKDD’20, October 2020, Ljubljana, Slovenia

Gregor Žunič and Erik Novak

XLM-RoBERTa. The model is based on the RoBERTa model

released in 2019. It is a large language model trained on 2.5 TB

of CommonCrawl data [2]. The model achieves state-of-the-art performance on cross-lingual classification, sequence labeling and

question answering. The most useful feature of the model is that

it does not require the sentence language as an input. In theory, it

extracts the same vectors for similar words in 100 languages.

The length of the vector that the model outputs is 768. To ex-

tract the features a CUDA-enabled GPU is required and the model

training is very slow.

4.2

Multi-class SVM Classifier

Figure 1: Distribution of lectures per number of correspond-

We chose the top-down level-based approach for our classifier. The

ing labels. Most of the lectures have only one label.

raw text input is firstly vectorized following one of the two feature

extraction approaches described in Section 4.1. The vector is then training process. Figure 2 shows the top scientific domain labels in input to the main SVM which determines the first category. Then

the data set.

the input is handled by the second SVM, trained specifically for sub-

labels of first classified category. If a sub-label tops the threshold

of 0, this step is repeated, otherwise the model outputs the lowest

level parent category.

For example “Computer Science” is the first determined cate-

gory. Then the input is handled by the SVM trained on sub-labels

of “Computer Science”, which determines that the input does not

match with any of the sub-labels. The model puts the lecture in the

“Computer Science” category. This is visually explained in figure 3.

Input

...

“Machine

Feature

“Computer

0 . 1

- 0 . 2

Learning”

SVM

Figure 2: Top scientific domain labels in the data set. The

extraction

Science”

most frequent label is Computer_Science.

SVM

“Semantic

- 0 . 7

“Business”

- 0 . 7

SVM

. . .

Web”

The most frequent label is “Computer Science”. In addition, a

“Social

- 1 . 0

SVM

. . .

Sciences”

large number of lectures are not labeled; this is because a lot of

...

lectures are presentations that do not correspond to any of the

scientific domains. The data set is unbalanced on both domain and

Figure 3: Visual representation of hierarchical SVM classi-

sub-domain levels.

fier. The example shows a lecture classified as belonging to

the “Computer Science” category

4

METHODOLOGIES

In this section we describe the methods used to perform the feature

Each SVM is an implementation of a multi-class classifier using

extraction of the text, the implementation of multi class classifier

the one-vs-rest approach. Predicted class should always be domi-

model and the lectures’ weights.

nant otherwise the recommendation is not relevant.

The input to the classifier is a raw string created by concatenating

the title and the description if the description is available. It is then 4.3

Lecture Weights

converted to a vector. In this paper we experimented with two

Each lecture is assigned a weight of 1 , 𝑥 = 4, where 𝑛 is the

𝑥

𝑛

approaches: TF-IDF and XLM-RoBERTa.

number of total labels in the original lecture and 𝑥 is a parameter.

If 𝑥 < 4 the accuracy is greatly reduced, if 𝑥 > 4 the accuracy is 4.1

Feature Extraction

increased by a small margin. It converges when 𝑥 → ∞. When

TF-IDF. Each lecture is represented with a vector of its TF-IDF

increasing the parameter 𝑥 the weight comes closer to 0 which

values [6]. TF measures how frequently a term occurs in a lecture’s means that the model accounts for data less during training. This

text. The IDF is a measure of how much information the word

means that the 4th power is a sufficient balance between excluding

provides. If it is common across all lectures its value is close to 0.

some data and reducing the accuracy.

The terms with the highest TF-IDF scores are usually the ones that

The other approach could be to ignore multi-label lectures during

characterize the topic of the lecture best.

testing phase ( 1∞ ).

𝑛

The size of the lecture’s vector representation is exactly the same

Because some labels are so scarce, we limit ourselves to labels

as the total number of unique words. Since most of the features are

with at least 20 lectures. This reduces the total number of labels in

zero the lecture vectors are sparse.

the data set from 502 to 244.

30





Hierarchical classification of educational resources

SiKDD’20, October 2020, Ljubljana, Slovenia

5

EVALUATION

the model would opt for SVMs trained on features extracted using

5.1

Parameters and Specifications

TF-IDF, because of the better performance. All other languages

would be handled by SVMs trained by XLM-RoBERTa, because the

SVM. The SVM implementation used in the evaluation is the Lin-

classifier performs much better than random.

earSVC [1] with the default parameters.

The TD-IDF method could also be used to classify lectures that

XLM-RoBERTa. The model used for representation generation

are in the non-english languages by firstly translating the text to

is the hugging face’s pretrained model [5] which was trained on English before using them during training. With this approach the

default parameters found in the paper [2]. The training was exe-model could work in all languages and retain the simplicity of TF-

cuted on the Google Colab (online hosted Jupyter notebook) free

IDF. Note that that this approach would be strongly dependant on

tier machine (12GB RAM, dual core CPU, NVIDIA K80).

the quality of the translations.

Weighting the errors during the training process. We did

5.2

Results

not use the hierarchy structure for calculating the error between

Table 1 shows the performance of the different models with linear the predicted and the actual labels hence all the errors types during

kernel. We have also evaluated other kernels (polynomial, RBF,

training were the same. This is not ideal because the error should

sigmoid), but the performance was worse than using linear kernel.

be more significant when the classifier incorrectly predicts the

That is why we omitted them from the performance table.

main branch versus when it incorrectly predicts a lower level label.

TF-IDF with linear kernel SVM. Using the TF-IDF method for

For example, if we take a lecture that is labeled as “Computer

feature extraction we found that the SVMs performed the best with

Science/Machine Learning” then the error should be bigger if our

linear kernel. One explanation for such results is that the dimension

classifier predicts the “Biology” label rather than the “Computer

of the features is large (more than 60k), which means that other

Science/Semantic Web” label.

more advance kernels might lead to over-fitting.

XLM-RoBERTa with linear kernel SVM. The model’s perfor-

7

FUTURE WORK

mance was worse than using TF-IDF. The accuracy of the main

We intend to improve the performance of the XLM-RoBERTa and

classifier was 19% compared to 70% when using TF-IDF. The other

to experiment with other language models and try to achieve better

SVM kernels (polynomial, RBF, sigmoid) performed worse com-

performance.

pared to linear kernel. Table 1 shows the performance of the model.

One additional direction for future work might be training a

SVM. The problem with current SVM implementation is that it

multiclass classifier to predict more than one label to a given lecture.

can only put the lecture in one category. One way to solve the issue

We tried implementing the multi label output classifier using the

of only one label would be to firstly predict one label. Then, if the

MultiOutputClassifier wrapper on SVM but the precision of the

user (editor) wants another prediction, the model can output the

model was noticeably lower.

prediction with second highest certainty.

The model is ready to be used in production in Videolectures.NET

TF-IDF vs XLM-RoBERTa. The advantage of choosing XLM-

as a recommender engine to help the editors. The service could

RoBERTa over of TF-IDF is that it works with 100 languages. The

either be wrapped in a Flask microservice or directly into Videolec-

vector outputs are similar [2] for all languages. This was proven tures.NET’s backend.

by translating the same text input into multiple languages (using

Google Translate) and the predicted category did not change. When

8

CONCLUSION

using TF-IDF you have to split the original data set into subsets

In this paper we explore a top-down level-based approach for clas-

containing a single language and train the model from scratch. That

sifying OER lectures with scientific domain labels. We used over-

would be possible with enough data. For some languages (German,

sampling to handle label unbalance and experimented with two

French) the the data set contains less than 30 lectures, which means

text representation approaches, TF-IDF and XLM-RoBERTa. We

that you can not train an SVM sufficiently.

found that the model using the TF-IDF representations gives better

results.

6

DISCUSSION

ACKNOWLEDGMENTS

Unbalanced Data Set. We found the SVM trained on an over-

sampled data set to be working better than the SVM trained on the

This work was supported by the Slovenian Research Agency and

raw data set. Due to the unbalanced data if the data set is not re-

X5GON European Unions Horizon 2020 project under grant agree-

sampled the bias towards the strongest category (Computer Science)

ment No 761758.

is strongly presented. For example neutral words such as “ ”, “the”

etc. are classified as belonging in

REFERENCES

Computer Science category.

Comparing Word Embedding Techniques. The TF-IDF ap-

[1] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, proach performs much better than XLM-RoBERTa which is surpris-Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gaël ing. Pre-trained models usually perform better than legacy feature

Varoquaux. 2013. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining extractors. The reason could be that the hyper parameters of the

and Machine Learning. 108–122.

model were not set correctly, but we did not find the right balance

[2] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guil-for the model to perform any better. The production versions could

laume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised Cross-lingual Representation Learning include both models. For languages with a lot of data in the data set,

at Scale. arXiv preprint arXiv:1911.02116 (2019).

31

SiKDD’20, October 2020, Ljubljana, Slovenia

Gregor Žunič and Erik Novak

parent

TF-IDF

XLM-RoBERTa

materials

category

acc.

recc.

F

prec.

acc.

recc.

F

prec.

Root

70%

69%

72%

75%

19%

11%

19%

68%

27009

Computer Science

59%

59%

60%

61%

9%

4%

8%

50%

12935

Machine Learning

60%

55%

59%

64%

11%

5%

9%

26%

3260

Semantic Web

75%

71%

75%

79%

23%

20%

31%

68%

454

Computer Vision

82%

79%

81%

83%

57%

55%

59%

63%

140

Social Sciences

73%

72%

73%

74%

35%

24%

34%

60%

2928

Society

74%

72%

72%

72%

36%

28%

38%

60%

890

Politics

76%

66%

75%

86%

59%

43%

54%

73%

83

Law

96%

96%

96%

96%

57%

41%

51%

67%

112

Journalism

100%

100%

100%

100%

91%

88%

90%

92%

53

Technology

84%

82%

82%

82%

50%

43%

50%

60%

970

Nanotechnology

69%

59%

69%

83%

46%

37%

46%

62%

78

Business

74%

72%

73%

74%

43%

36%

43%

54%

1009

Transportation

63%

53%

61%

71%

33%

22%

32%

56%

267

Humanities

85%

83%

84%

85%

55%

48%

55%

65%

873

Biology

71%

66%

67%

68%

23%

17%

22%

31%

430

Science

78%

77%

78%

79%

53%

51%

52%

53%

656

Medicine

89%

88%

89%

90%

39%

34%

48%

83%

326

Computers

83%

83%

83%

83%

55%

48%

53%

59%

731

Mathematics

89%

87%

89%

91%

41%

36%

38%

40%

421

Physics

86%

81%

85%

89%

36%

32%

38%

46%

227

Arts

88%

87%

85%

83%

45%

40%

49%

63%

338

Visual Arts

100%

100%

100%

100%

62%

56%

70%

92%

159

Design

52%

46%

55%

68%

23%

9%

14%

30%

104

Chemistry

100%

100%

100%

100%

85%

83%

91%

100%

161

Environment

94%

94%

93%

92%

71%

66%

73%

81%

161

Earth Sciences

73%

67%

74%

82%

50%

51%

50%

49%

27

Table 1: Comparison of model performance using the linear kernel. The performance of the TF-IDF approach is better than that of XLM-RoBERTa.

[3] Susan Dumais and Hao Chen. 2000. Hierarchical Classification of Web Content.

In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’00). Association for Computing Machinery, New York, NY, USA, 256–263. https://doi.org/10.1145/345508.345593

[4] A. D. Gordon. 1987. A Review of Hierarchical Classification. Journal of the Royal Statistical Society: Series A (General) 150, 2 (1987), 119–137. https://doi.org/10.

2307/2981629 arXiv:https://rss.onlinelibrary.wiley.com/doi/pdf/10.2307/2981629

[5] huggingface. 2020. huggingface.co - pretrained models. https://huggingface.co/

transformers/pretrained_models.html.

[6] J.D. Rajaraman, A.; Ullman. 2011. Mining of Massive Datasets. pp. 1–17. http:

//i.stanford.edu/~ullman/mmds/ch1.pdf.

[7] Ahmad Shalbaf, Reza Shalbaf, Mohsen Saffar, and Jamie Sleigh. 2020. Monitoring the level of hypnosis using a hierarchical SVM system. Journal of Clinical Monitoring and Computing 34, 2 (2020), 331–338. https://doi.org/10.1007/

s10877-019-00311-1

[8] Carlos N. Silla and Alex A. Freitas. 2011. A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery 22, 1

(2011), 31–72. https://doi.org/10.1007/s10618-010-0175-9

[9] Aixin Sun, Ee-Peng Lim, and Wee-Keong Ng. 2003. Performance measurement framework for hierarchical text classification. Journal of the American Society for Information Science and Technology 54, 11 (2003), 1014–1028. https://doi.org/10.

1002/asi.10298 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/asi.10298

[10] VideoLectures.Net. 2020. VideoLectures.NET - VideoLectures.NET. https://

videolectures.net/. Accessed: 2020-08-20.

[11] S. V. M. Vishwanathan and M. Narasimha Murty. 2002. SSVM: a simple SVM

algorithm. 3 (2002), 2393–2398 vol.3.

32





Are You Following the Right News-Outlet? A Machine

Learning based approach to outlet prediction

Swati

Dunja Mladenić

swati@ijs.si

dunja.mladenic@ijs.si

Jožef Stefan Institute

Jožef Stefan Institute

Jožef Stefan International Postgraduate School

Jožef Stefan International Postgraduate School

Ljubljana, Slovenia

Ljubljana, Slovenia

ABSTRACT

outlet is forced to select a set of reporting events. Several factors,

such as the geographical origin of the event, the involvement of

In this work, we propose a benchmark task of outlet prediction

an elite person or country, etc. influences such selection. Also

and present a dataset of English news events tailored to the

the procedure requires rigorous monitoring of current affairs to

proposed task. Addressing this problem would not only allow

determine the news value, and may result in event selection bias

readers to choose and respond to relevant and broader facets

also known as gatekeeping bias.

of events but also enable the outlets to examine and report on

their work. We also propose a neural network based approach

However, no well-established automated method reveals to

to recommend a list of probable outlets covering an event of

users the outlets that will cover the event of their interest. This

interest. Evaluation results reveal that even in its simplest form,

drives the motivation of this study. The aim is to predict a list of

our model is capable of predicting the outlet significantly better

outlets reporting on a given event. Addressing this problem would

than the existing rule based approaches. The proposed model

not only allow readers to choose and respond to relevant and

will also serve as a baseline for evaluating approaches intended

broader facets of events but also enable the outlets to examine and

to address the task. Implementation scripts can be found at https:

// github.com/ Swati17293/ outlet-prediction

report on their work. For instance, some outlets tend to publish

.

events covered by well-established outlets. Instead of waiting for

KEYWORDS

the news to be published, the proposed system will help them to

get an insight into the degree of predictability of event selection

News bias, Event Selection bias, News coverage, News Event

by the major outlets.

Analysis, Recommendation System

1

INTRODUCTION

1.1

contributions

We make the following contributions in this context:

The advancement in the field of Natural Language Processing [9,

10, 5, 4] over the last decade, has made solutions to complex

• We propose a benchmark task of outlet prediction and

machine learning problems more convenient. The problems such

present a dataset of English news events tailored to the

as machine translation, text summarization, and segmentation

proposed task.

are being solved much more efficiently than ever before. Conse-

• We provide a neural network model that can serve as a

quently, it offered the researchers the opportunity to use these

baseline for evaluating approaches intended to address

advanced techniques to solve problems in a variety of contexts

the task.

such as news bias analysis. This analysis task is poised as the

The GitHub repository containing our code is available at

identification of the inherent bias present in the news production

https:// github.com/ Swati17293/ outlet-prediction.

and its coverage process. It occurs when a news outlet publishes

a news story selectively or incorrectly.

1.2

Problem Statement

The problem is addressed as an outlet prediction task in which the

If the news is biased, then it can bias the thought process

bias is examined by comparing the learning ability of a classifier

and decision making of the person listening, watching, and/or

trained to predict the probability of event coverage by an outlet.

reading it [12]. It can have several direct or indirect implications whether political or social. For example, if the news shows only

2

LITERATURE REVIEW

the positive or negative side of a political party; it has been ob-

During the different stages of news production, various forms of

served to influence the public vote [2]. Not only politics but also news bias arise as described by Baker et al. [1]. The first stage the news about the disaster or spread of viral disease affects the

begins with the selection of events also called gatekeeping, where

belief system of the general public.

an outlet selects or rejects an event for reporting. The selection

process is driven by a number of factors, such as the geographical

There are numerous events that happen continuously, and

origin of the event, the involvement of an elite person or country,

any form of bias can arise in numerous possible ways. It is not

etc., and requires rigorous monitoring of current affairs to de-

possible for any single outlet to capture every event. Thus, an

termine the news value. To our knowledge, only a few methods

Permission to make digital or hard copies of part or all of this work for personal have been suggested that explicitly attempt to examine this bias.

or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this Saez-Trumper et al. [11] attempted to identify bias in online work must be honored. For all other uses, contact the owner /author(s).

news sources and social media groups surrounding them. They

Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

studied the disparity in the selection of events based on the quan-

© 2020 Copyright held by the owner/author(s).

tity and exclusivity of stories published by 80 mainstream news

33





Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

Swati and Dunja Mladenić

outlets across the globe over a span of two weeks. From the re-

3.2

Dataset

view, it is found that there is a weak correlation between the

For our experiments, we first selected the top three news outlets

quantity and exclusivity of news articles published by the outlets.

3

based on Alexa Global Rankings

. We then used the Event Reg-

It is also discovered that both the news and social media follow

istry API to collect all news events reported in English between

the same pattern of selection of events in similar geographical

January 2019 and May 2020. We excluded events that were not

areas. However, media in the same region often choose the same

covered by any of the selected outlets. We ended up with 51, 409

events and publish similar-length posts.

events for which we extracted basic information such as event id,

title, summary, and source. Since the event coverage by these out-

Bourgeois et al. [3] used a matrix factorization method to ex-lets is not uniform, which can be visualized in Figure 1, we used tract latent factors that determine the selection of the event by

a stratified split to mimic this imbalance across the generated

an outlet. They combined the method with a BPR optimization

train-valid-test sets.

scheme developed by Rendle et al.[8]. They used the events derived from the GDELT dataset and arranged the outlets in rows

and their reported events in columns to form a matrix. Each cell

value of the resulting matrix describes the selection/rejection of

the event by the outlet.

nytimes

washingtonpost

For the bias analysis, they chose affiliation, ownership, and

geographic proximity of the different outlets as the major factors.

They suggest that each outlet follows its own latent preferences

structure which facilitates the outlet to rank events. They also

indiatimes

suggested that events should be selected such that the selected

list should be diverse and should include a wide range of actively

reported events. They thus adopted the method of Maximum

Marginal Relevance which facilitates ranking based on the rel-

Figure 1: Distribution of event coverage by the outlets.

evance and diversity of the events. It is discovered that event

selection favors the most discussed topics rather than the unique

ones.

4

MATERIALS AND METHODS

F. Hamborg et al. [6] uses a matrix similar to the one created 4.1

Problem Modeling

by Bourgeois et al.[3] Each cell in the matrix represent the most For an event 𝐸 and its associated pair (𝑇 , 𝑆 ), the task is to generate representative topic of the article reported by one country about

a list of outlets 𝑂 expected to cover 𝐸 . Here 𝑇 is the event title

the other. By spanning the matrix through outlets and topics in

and 𝑆 is a short summary of the event as provided by the Event

a region, the bias can be examined. They used a collection of 1.6

Registry. Mathematically, the task can be formulated as,

million articles from more than 100 countries over a two-month

1

span from the Europe Media Monitor (EMM)

as their dataset.

𝑂 = 𝑓 (𝑇 , 𝑆, 𝛼 )

(1)

Authors in [6] aggregates the related articles and then out-where, 𝑓

is the outlet prediction function and 𝛼 denotes the

source the task of bias identification to the users, forcing them

model parameters. 𝑂 can have a well-thought-out variable length

𝑙

to determine the bias on their own. While the rest of the existing

response generated from the list unique outlets 𝑂 . For this work,

𝑙

work analyzes the selection bias, it certainly does not present an

|𝑂 | = 3.

automated approach suited to the outlet prediction task, unlike

our work.

4.2

Methodology

We extract feature vectors from 𝑇 and 𝑆 . We fuse them together to

3

DATA DESCRIPTION

create a fused vector which is then passed through several layers

to finally generate 𝑂 . Figure 2 illustrates the entire prediction 3.1

Raw Data Source

process. We further outline these tasks with more details in the

Event Registry2 [7] monitors, collects, and provides news arti-following subsections.

cles from news outlets around the world. It also aggregates them

4.2.1

Feature Extraction and Fusion. We used Google’s Univer-

into clusters that are referred to as events. Each event is then

sal Sentence Encoder 4(USE) to extract 128-dimensional feature annotated with several metadata such as unique id to track the

′

′

′

′

vectors 𝑇

and 𝑆 . For feature fusion, we concatenated 𝑇

and 𝑆

event coverage, categories to which it may belong, geographical

and applied 𝑡 𝑎𝑛ℎ activation to generate 𝐹 . We then used batch-

location, sentiment, etc. As a result, its large-scale temporal cov-

normalization to increase the stability of the network and for

erage can be used effectively to study the event selection process

regularization.

of news outlets.

′

′

𝐹 = 𝐵𝑁 (𝑡 𝑎𝑛ℎ (𝑇

⊕ 𝑆 ))

(2)

In Eq 2, 𝐵 𝑁 and ⊕ represents batch-normalization and concatenation respectively.

1

3

https://ec.europa.eu/knowledge4policy/

https://www.alexa.com/topsites/category/Top/News/Newspapers

2

4

https://eventregistry.org

https://tfhub.dev/google/universal- sentence- encoder/

34





A Machine Learning based approach to outlet prediction

Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

USE

T

Event Title

T′

tanh

F

softmax

Batch Norm

FC

Outlet (Ô)

Event

S

S′

Summary

USE

Figure 2: Outlet prediction process.

4.2.2

Outlet Prediction.

Table 1: Multiple correct predictions.

We solve the problem using a multi-label

classification model for which we create a separate outlet-index

dictionary for outlets 𝐷 = {𝑜

: 1

: 2

:

1

,

𝑜 2

. . .

𝑜

𝑛 }, where 𝑛

𝑛

indiatimes nytimes washingtonpost

𝑙

is the total number of unique outlets in 𝑂 . To predict the list

indiatimes washingtonpost nytimes

of outlets we pass 𝐹 to the fully-connected layer (FC) having

𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 activation with 𝑛 output neurons. Since an event can

be covered by more than one outlet, we formulate the recursive

• Subset Accuracy (𝑎): It measures the percentage of in-

prediction procedure as,

stances in which all of the outlets are correctly classified.

ˆ

𝑜 = P (𝑜 |𝐹 , ˆ

𝑜

+ 𝑏 )

(3)

𝑁

𝑖

𝑖 −1, 𝑏 ) = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 (𝐹 𝑤𝑖

𝑖

1

Õ

Subset Accuracy (𝑎) =

( ˆ

𝑜

− 𝑜 )

(6)

𝐹 𝑤 +𝑏

𝑖

𝑖

𝑒

𝑖

𝑖

𝑁

=

(4)

𝑖 =1

Í𝑛

𝐹 𝑤 +𝑏

𝑒

𝑗

𝑗

𝑗 =1

• Hamming Loss (ℓ): It measures the fraction of the incor-

𝑡 ℎ

rectly predicted outlet to the total number of outlets. Since

where, ˆ

𝑜 is the probability of selecting the 𝑖

outlet (𝑜 ) given 𝐹 ,

𝑖

it is a loss function, its ideal value is 0.

bias (𝑏 ), and the set of probabilities of previously predicted outlets ( ˆ

𝑜

), and 𝑤 is the weight. We use categorical cross entropy as

𝑁





𝑖 −1

1

Õ

∩

ˆ

𝑜

𝑜

𝑖

𝑖

the loss function as follows:

Hamming Loss (ℓ ) =





(7)

𝑁

ˆ

𝑜

∪ 𝑜

𝑖

𝑖

𝑛

𝑥

𝑖 =1

Õ Õ

L (𝑜, ˆ

𝑜 ) = −

(𝑜

∗ log( ˆ

𝑜

))

(5)

𝑖 𝑗

𝑖 𝑗

5.3

Results and Analysis

𝑗 =1 𝑖 =1

Table 2 shows the comparison of our model with the baseline

𝑡 ℎ

In Eq (5), for 𝑖

outlet in the output sequence of length 𝑥 , 𝑜𝑖 𝑗

models in terms of subset accuracy and hamming loss.

and ˆ

𝑜

denotes the actual and predicted probability of selecting

𝑖 𝑗

𝑡 ℎ

the 𝑗

outlet from 𝐷 .

Table 2: Comparison between the baseline models and our

4.2.3

Hyper-parameters.

5

We used Categorical accuracy

as the

proposed model.

metrics to calculate the mean accuracy rate for multilabel classi-

fication problems across all the predictions. We consider a batch

Subset Accuracy

Hamming Loss

of size 128 and number of epocs as 100 for training. To optimize

Uniform

0.140

0.526

the weights during training we use Adam optimizer.

Stratified

0.286

0.422

5

EXPERIMENTAL EVALUATION

Ours

0.546

0.275

5.1

Baselines

Quantitative analysis of the experimental results shows that,

We use the following well-known and simplified methods as our

our model outperforms the Uniform and Stratified models by a

baseline models.

margin of 0.41 and 0.26 points for subset accuracy and by 0.25

• Uniform: Generate predictions randomly using a uniform

and 0.15 points for hamming loss respectively. The performance

distribution.

difference is clearly visible in Figure 3.

• Stratified: Generates predictions by respecting the class

distribution of the training set.

The intersection that we find among the different outlet pairs

differs considerably as evident in Figure 1. This can be best seen 5.2

Evaluation Metric

by assessing the conditional probability of an event covered by an

We aim to predict the list of outlets in this work. However, it is

outlet given that it is covered by another outlet as listed in Table 3.

not necessary to predict the sequence in which outlets appear on

For example, we can note that the 𝑃 (𝑤 𝑎𝑠ℎ𝑖𝑛𝑔𝑡𝑜𝑛 |𝑛𝑦𝑡𝑖𝑚𝑒𝑠 ) =

this list. This is explained with an example given in Table 1. In 0.492 which is quite high and indicates that 𝑤 𝑎𝑠ℎ𝑖𝑛𝑔𝑡 𝑜𝑛𝑝𝑜𝑠𝑡 tends other cases, a combination of correct and incorrect outlets may

to cover most of the events covered by 𝑛𝑦𝑡 𝑖𝑚𝑒𝑠 . It is also inter-be predicted by the model.

esting to note that 𝑖𝑛𝑑𝑖𝑎𝑡 𝑖𝑚𝑒𝑠 do not follow 𝑤 𝑎𝑠ℎ𝑖𝑛𝑔𝑡 𝑜𝑛𝑝𝑜𝑠𝑡 or

𝑛𝑦𝑡 𝑖𝑚𝑒𝑠 , and vice versa.

We used the following metrics to evaluate the effectiveness

of our model where, ˆ

𝑜 is the predicted outlet, 𝑜 is the true outlet,

6

CONCLUSIONS AND FUTURE WORK

and 𝑁 is the total number of instances.

It is important for a journalist to know which event is worthy

5 https://github.com/keras-team/keras/blob/master/keras/metrics.py

enough to be published. Even readers would be interested to know

35





Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia Swati and Dunja Mladenić

Table 3: Conditional probability of an event to be covered by an outlet, provided it is covered by another outlet.

P(x|y)

nytimes

indiatimes

washingtonpost

nytimes

1.000

0.067

0.364

indiatimes

0.034

1.000

0.023

washingtonpost

0.492

0.063

1.000

[3]

Dylan Bourgeois, Jérémie Rappaz, and Karl Aberer. 2018.

Selection bias in news coverage: learning it, fighting it. In

Companion Proceedings of the The Web Conference 2018,

535–543.

[4]

Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong

Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen

Hon. 2019. Unified language model pre-training for natural

language understanding and generation. In Advances in

Neural Information Processing Systems, 13042–13054.

[5]

Zihao Fu. 2019. An introduction of deep learning based

word representation applied to natural language process-

ing. In 2019 International Conference on Machine Learning,

Figure 3: Comparison between the baseline models and

Big Data and Business Intelligence (MLBDBI). IEEE, 92–104.

our proposed model.

[6]

Felix Hamborg, Norman Meuschke, and Bela Gipp. 2018.

Bias-aware news analysis using matrix-based news aggre-

gation, 1–19.

the outlets that are going to cover the event of their interest. Yet

[7]

Gregor Leban, Blaz Fortuna, Janez Brank, and Marko Gro-

it is certainly not an automated approach, therefore in this work,

belnik. 2014. Event registry: learning about world events

we propose an approach to address the outlet prediction task

from news. In Proceedings of the 23rd International Confer-

given the event title and description. We also find that even in its

ence on World Wide Web, 107–110.

simplest form, our model is capable of predicting the outlet. In

[8]

Steffen Rendle, Christoph Freudenthaler, Zeno Gantner,

the future, we intend to enhance our proposed model to better

and Lars Schmidt-Thieme. 2009. Bpr: bayesian personal-

predict the outlets and to work in a cross-lingual setting. We

ized ranking from implicit feedback. In Proceedings of the

plan to include a few more metadata provided by Event Registry

Twenty-Fifth Conference on Uncertainty in Artificial Intelli-

(refer Section

3.1) along with Wikipedia concepts. We also plan

gence (UAI ’09). AUAI Press, Montreal, Quebec, Canada,

to analyze the speed of reporting, time-span, and importance

452–461. isbn: 9780974903958.

given to the events by the outlets. In addition, we will also be

[9]

Sebastian Ruder. 2019. Neural transfer learning for natural

looking into how the outlets change their coverage style over

language processing. PhD thesis. NUI Galway.

time.

[10]

Sebastian Ruder, Matthew E Peters, Swabha Swayamdipta,

ACKNOWLEDGMENTS

and Thomas Wolf. 2019. Transfer learning in natural lan-

guage processing. In Proceedings of the 2019 Conference of

This work was supported by the Slovenian Research Agency and

the North American Chapter of the Association for Compu-

the European Union’s Horizon 2020 research and innovation

tational Linguistics: Tutorials, 15–18.

program under the Marie Skłodowska-Curie grant agreement No

[11]

Diego Saez-Trumper, Carlos Castillo, and Mounia Lalmas.

812997.

2013. Social media news communities: gatekeeping, cov-

REFERENCES

erage, and statement bias. In Proceedings of the 22nd ACM

international conference on Information & Knowledge Man-

[1]

Brent H Baker, Tim Graham, and Steve Kaminsky. 1994.

agement, 1679–1684.

How to identify, expose & correct liberal media bias.

[12]

Rune J Sørensen. 2019. The impact of state television on

[2]

Matthew Barnidge, Albert C Gunther, Jinha Kim, Yang-

voter turnout. British Journal of Political Science, 257–278.

sun Hong, Mallory Perryman, Swee Kiat Tay, and Sandra

Knisely. 2020. Politically motivated selective exposure and

perceived media bias, 82–103.

36





MultiCOMET – Multilingual Commonsense Description

Adrian Mladenic Grobelnik

Dunja Mladenic

Marko Grobelnik

Artificial Intelligence Laboratory

Artificial Intelligence Laboratory

Artificial Intelligence Laboratory

Jozef Stefan Institute

Jozef Stefan Institute

Jozef Stefan Institute

Ljubljana Slovenia

Ljubljana Slovenia

Ljubljana Slovenia

adrian.m.grobelnik@ijs.si

dunja.mladenic@ijs.si

marko.grobelnik@ijs.si



ABSTRACT

The main contributions of this paper are (1) a new multilingual approach to annotating natural language sentences with

This paper presents an approach to generating multilingual

commonsense descriptors, (2) implementation of the proposed

commonsense descriptions of sentences provided in natural

language. We have expanded on an existing approach to automatic

approach that is made publicly available as an online service

knowledge base construction in English to work on different

MultiCOMET http://multicomet.ijs.si/ (illustrated in Figure 4), (3) languages. The proposed approach has been utilized to develop

evaluation of the proposed approach on the Slovenian language. An

MultiCOMET, a publicly available online service for generating

additional contribution is the publicly available source code [3]

multilingual commonsense descriptions. Our experimental results

allowing users to train their own models for other natural

show that the proposed approach is suitable for generating

languages.

commonsense description for natural languages with Latin script.

Comparing performance on Slovenian sentences to the English

The rest of this paper is organized as follows: Section 2 provides a

original, we have achieved precision as high as 0.7 for certain types

data description. Section 3 describes the problem and the algorithm

of descriptors.

used. Section 4 exhibits our experimental results. The paper

concludes with discussion and directions for the future work in CCS CONCEPTS

Section 5.

•CCS Information systems Information retrieval Document

2 Data Description

representation Content analysis and feature selection KEYWORDS

One might say the only way for AI to learn to perform

deep learning, commonsense reasoning, multilingual natural

commonsense reasoning, is to learn from humans. Following the

approach proposed by COMET [1], we used data from the

language processing

ATOMIC [2] dataset. The ATOMIC dataset consists of over 24,000

sentences containing common phrases manually labelled by

1 Introduction

workers on Amazon Turk. For each sentence the workers were

As artificial intelligence systems are becoming better at performing

asked to assign open-text values to nine descriptors which capture

highly specialized tasks, sometimes outperforming humans, they

nine if-then relation types to distinguish causes vs. effects, agents

are unable to understand a simple children’s fairy tale due to their

vs. themes, voluntary vs. involuntary events and actions vs. mental

inability to make commonsense inferences from simple events.

states [2] as described in ATOMIC.

With recent breakthroughs in the area of deep learning and overall

The following are the nine descriptors and their explanations:

increases in computing power, it has enabled us to model

xIntent – Because PersonX wanted…

commonsense inferences with deep learning models. In our

research, we expand on the approach to automatic generation of xNeed – Before, PersonX needed…

commonsense descriptors proposed in COMET [1] by applying

their deep learning models to languages other than English.

xAttr – PersonX is seen as…

The approach presented in COMET tackles automatic

xReact – As a result, PersonX feels…

commonsense completion with the development of generative

xWant – As a result, PersonX wants…

models of commonsense knowledge, and commonsense

transformers that learn to generate diverse commonsense

xEffect – PersonX then…

descriptions in natural language [1].

oReact – As a result, others feel…

Our research hypothesis is that the approach proposed by COMET

oWant – As a result, others want…

[1] can be expanded to Latin script languages other than English.

To test this claim, we have trained our own deep learning model on

oEffect – Others then…

the original training data, and another model on the data translated



into another natural language.

37



The dataset contains almost 300,000 unique descriptor values for

we were strict in our comparisons, for instance “to stay away from

the listed nine descriptors. An example of a labeled sentence is people” and “to get away from others” do not count in overlap.

shown in Figure 3.

Experimental results show there is considerable difference in

In order to test the proposed approach, we implemented it for the

performance between the nine descriptors. The best performing

Slovene language. We have translated the sentences from the

descriptor was xReact, where precision@5 was 0.716, followed by

ATOMIC dataset to Slovene, keeping the descriptor values in

oReact and oWant with precisions@5 of 0.706 and 0.468

English. The translation was done using Google Cloud’s

respectively. The worst performing descriptor was xWant, with a

Translation API [4].

precision@5 of 0.21 (see Table 1).

3 Problem Description and Algorithm

Descriptor

Precision

The problem we are solving is predicting the most likely values for

xIntent

0.324

each tag in the ATOMIC [1] dataset, given an input sentence in a

xNeed

0.352

Latin script language. Following the proposal in COMET, we are

addressing the following problem:

xAttr

0.438

xReact

0.716

Given a training knowledge base of natural tuples in the {𝑠, 𝑟, 𝑑}

format, where 𝑠 is the sentence, 𝑟 is the relation type and 𝑑

xWant

0.210

represents the relation values. The task is to generate 𝑑 given 𝑠 and xEffect

0.456

𝑟 as inputs.

oReact

0.706

Figure 1 depicts our approach to solving this problem. The system

oWant

0.468

takes labelled sentences as input, translates them to the targeted oEffect

0.310

Latin language and trains a deep learning model capable of

Average

0.442

labelling previously unseen sentences with values for nine

Table 1: Experimental results on the nine descriptors, showing

descriptors capturing the nine predefined relation types as

precision of the top 5 predictions.

described in Section 2.

The best performing descriptor was xReact (representing the

relation: As a result, PersonX feels). This was likely due to the fact

that most predicted values were only one word long for both

models, making it considerably easier for their predictions to

overlap.

The worst performing descriptor was xWant (representing the

relation: As a result, PersonX wants), this could be attributed to the

fact that the most predicted values were at least 3-4 words in length,

greatly decreasing the likelihood of overlap. Another reason for such low precision could be our strict overlap comparisons.

Figure 1: Architecture of the proposed approach



Original

Translated/Predicted

4 Experimental Results

Sentence

PersonX looks PersonY

PersonX izgleda PersonY

Prior to training the model, we split the ATOMIC dataset into train,

___ in the face

___ v obraz

test and development sets identical to those used in COMET [1]. In

xReact

nervous

satisfied

our evaluation we used 100 sentences from the test set.

Values

Our deep learning models are trained on the ATOMIC [2] dataset.



happy

happy

We have trained one model on the original dataset in English, and

another model on an automatically translated dataset to Slovene.



satisfied

attractive

Both models were trained under the same parameter settings: batch



powerful

proud

size=6, iterations=50000, maximum number of input features = 50.



confident

angry

To evaluate the performance of the proposed approach, we

compared the predictions of the model trained on Slovene

Table 2: One of the worst performing test sentences for xReact

sentences with the predictions of the English model. As the

performance metrics, we took the top 5 predicted values for each

Table 2 shows the predicted values of one of the worst performing

sentences for the xReact descriptor. Note the sentence “PersonX

descriptor and checked their overlap. By taking the English

looks PersonY ___ in the face” can refer to “Bob looks Mary

predictions as the ground truth, we are measuring the precision of

slowly

in the face” or “Adrian looks Anna kindly in the face”

our model by the number of identical descriptor values. Note that

or something

38





else. The columns in Table 2 and Table 3 labelled “Original” show

the original English sentence and its predicted descriptor values.

The columns labelled “Translated/Predicted” show the sentence

translated into Slovene and its predicted descriptor values.

Table 3 shows the predicted values of one of the worst performing

sentences for the xWant descriptor. We can see that there are no common predictions between the two models. Note the sentence

“PersonX avoids every ___” can refer to “Marko avoids every car

on the road” or “Dunja avoids every boring event” or something else.



Original

Translated/Predicted

Sentence

PersonX avoids every ___

PersonX se izogiba

vsakemu ___

xWant

to stay away from people

to get away from others

Values



to avoid trouble

to make sure they are ok



to stay away

to get away from the

situation



to not get caught

to be alone



to not be noticed

to make a decision

Table 3: One of the worst performing test sentences for xWant

While Tables 2 and 3 show the model’s outputs for a single

descriptor, Figure 3 shows the full output of the model, given an

example sentence “Mojca je pojedla odličen sendvič” (Mary ate an

excellent sandwich). Figure 2 shows a close-up of the output of Figure 3. The images in Figures 2 and 3 were taken directly from

the interface of our online service MultiCOMET [5].



Figure 3: Full tree of predicted descriptor values generated for

an example Slovene sentence

For the sentence “Mojca je pojedla odličen sendvič” (Mary ate an

excellent sandwich) depicted in Figures 2 and 3, here is a potential

English interpretation of the Slovenian output of the model:

Mary was hungry (xAttr) and wanted to eat food (xIntent). To do that, she needed to go to the restaurant (xNeed). At the restaurant,

other people were also eating food (oEffect). As a consequence of

eating the sandwich, Mary’s clothes got dirty (xEffect). Mary feels

impressed (xReact) and wants to eat something else (xWant). The

restaurant is grateful (oReact) for Mary’s visit and wants to thank

Mary (oWant).

The MultiCOMET online service is a publicly available

implementation of our proposed approach, shown in Figure 4. At

the time of writing, MultiCOMET only supports English and

Slovene.



Figure 2: Close-up of predicted descriptor values generated for

an example Slovene sentence



39





Figure 4: Illustrative example of MultiCOMET after submitting a query “Mary ate a wonderful sandwich.”



5

After testing the proposed multilingual approach on the Slovene

Discussion

language, we intend to expand our coverage to other Latin script

In our research we expanded on an existing monolingual

languages including Croatian, Italian and French.

approach and proposed a new approach to generating

multilingual commonsense descriptions from natural language.

ACKNOWLEDGMENTS

In order to implement our approach, we built on an existing

The research described in this paper was supported by the

library, implementing the approach proposed by COMET [1].

Slovenian research agency under the project J2-1736 Causalify

Our experimental results show that we are getting meaningful

and co-financed by the Republic of Slovenia and the European

values for the descriptors. Experimental comparison of the

Union under the European Regional Development Fund. The

predicted descriptor values of the Slovene and English models

operation is carried out under the Operational Programme for the

show an average precision of 0.44, given our strict comparison

Implementation of the EU Cohesion Policy 2014–2020.

methodology. We noted the precision values ranged from 0.716

to 0.210 across different descriptors.

REFERENCES

Based on our literature review (September 2020), none of the

[1] Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli articles citing the original COMET [1] paper expanded their

Celikyilmaz, Yejin Choi. (2019). COMET: Commonsense Transformers for

Automatic Knowledge Graph Construction. Allen Institute for Artificial approach to include other languages. The most similar work we

Intelligence, Seattle, WA, USA. Paul G. Allen School of Computer Science found in the literature combining commonsense and

& Engineering, Seattle, WA, USA. Microsoft Research, Redmond, WA, USA.

[2] Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas multilinguality was [6] where the authors were extending the

Lourie, Hannah Rashkin, Brendan Roof, Noah A. Smith, Yejin Choi. (2019).

SemEval Task 4 solution using machine translation.

ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning. Paul

G. Allen School of Computer Science & Engineering, University of The possible direction for future work includes improving the

Washington, Seattle, USA. Allen Institute for Artificial Intelligence, Seattle, USA.

quality of the translated sentences from ATOMIC by manual

[3]

MultiCOMET

GitHub

https://github.com/AMGrobelnik/MultiCOMET

translation to improve the precision of the models. Another

Accessed 31.08.2020

possible direction would be to evaluate the performance of our

[4] Google Cloud’s Translation API Basic https://cloud.google.com/translate

Accessed 31.08.2020

models on a larger number of sentences to increase the reliability

[5] MultiCOMET http://multicomet.ijs.si/ Accessed 31.08.2020

of the results.

[6] Josef Jon, Martin Fajcik, Martin Docekal, Pavel Smrz. (2020). BUT-FIT at SemEval-2020

Task

4:

Multilingual

commonsense.

arXiv.

https://arxiv.org/pdf/2008.07259.pdf



40





A Slovenian Retweet Network 2018-2020

Bojan Evkoski

Igor Mozetič &

Jožef Stefan International

Nikola Ljubešić &

Postgraduate School,

Petra Kralj Novak

Jožef Stefan Institute

Jožef Stefan Institute

Jamova cesta 39

Jamova cesta 39

Ljubljana, Slovenia

Ljubljana, Slovenia

Bojan.Evkoski@ijs.si

ABSTRACT

the tweets in terms of hashtags and URLs. We draw con-

As the popularity of social media has been growing steadily

clusions in Section 6.

since the beginning of their era, the use of data from these

platforms to analyze social phenomena is becoming more

2.

DATA

and more reliable. In this paper, we use tweets posted over a

We acquired 5,147,970 tweets in the period from January

period of two years (2018-2020) to analyze the socio-political

2018 to January 2020 with the TweetCat tool [6], built

environment in Slovenia. We use network analysis by ap-

specifically for collecting Twitter data written in “smaller”

plying community detection and influence identification on

languages. The tool identifies users tweeting in the focus lan-

the retweet network, as well as content analysis of tweets

guage by searching for most common words in that language

by using hashtags and URLs. Our study shows that Slove-

through the Twitter Search API, and collects these users’

nian Twitter users are mainly grouped in three major socio-

tweets through the whole data collection period. On aver-

political communities: Left, Center and Right. Although

age, the dataset containis around 8,000 tweets per day, with

the Left community is the most numerous, the most influ-

the three highest volume peaks on March 13, 2018 (11,556

ential users belong to the Right and Center communities.

tweets, the resignation of Slovenia’s PM, Miro Cerar), June 1,

Finally, we show that different communities prefer different

2018 (13,506 tweets, the last day of the 2018 Slovenian par-

online media to inform themselves, and that they also pri-

liamentary elections campaign), and May 9, 2019 (12,381

oritize topics differently.

tweets, Eurovision semi-final in which Slovenia had a suc-

cessful run). The variation of the daily volume of tweets

Keywords

is affected by many phenomena, but the more evident are:

Complex networks, Twitter, community detection, influencers

a weekly seasonality with high volumes on working days

and low volumes on weekends, extraordinary periods for

1.

INTRODUCTION

the country (e.g. the 2018 Slovenian parliamentary elections

Since the rise of the social networks, their data has been ex-

campaign, boosting average daily tweets by around 2,000),

tensively used in social analysis. As the popularity of these

and holidays (e.g. 2018 and 2019 Easters as local minima

platforms continues to grow daily, using them as a proxy to

with 5,174 and 4,887 tweets, respectively).

analyze specific phenomena is becoming more and more re-

liable. Their popularity, accessibility and availability made

3.

COMMUNITY DETECTION

them the go-to way to share one’s opinion, support another

We used the collected tweets to construct a retweet network

and even get in conflict with an opposing one. Recently, with

for the purpose of community detection. A retweet network

the targeted advertising advancements, social media became

is a directed weighted graph, where nodes represent Twit-

the most important cultural and political battlefront.

ter users and edges represent the retweet relations. An edge

from node (user) A to node B exists if B retweeted A at

In this paper, the country of interest is Slovenia and the

least once, indicating the information spread from A to B,

proxy is Twitter data. By following the methodology devel-

or A influenced B. Note that retweeting a retweet is actually

oped in [3, 2, 4, 8], we address the following questions:

retweeting the original tweet (source), thus ignoring all in-

termediate retweets. The weight of an edge is the number of

• Are there groups of densely connected Twitter users

times user B retweeted user A. We removed all self-retweets,

in the Slovenian retweet network 2018-2020?

since they did not provide us additional information for com-

• Who are the leading influencers in these groups?

munity and influence detection. Consequently, we formed a

• What is the content of the tweets in these groups and

network with 10,876 users (94% of all users) and 1,576,792

how much does it overlap?

retweets (92% of all retweets).

This paper is organised as follows. In Section 2, the data

This network can be simplified if the direction of the edges

acquisition process and the collected Twitter data are pre-

is ignored, meaning that two users are linked if one retweets

sented. Section 3 discusses the communities in the retweet

the other while the source and destination are irrelevant. It

network and their properties. Section 4 covers the notion of

turns out that such undirected retweet graphs between Twit-

influencers and identifies the main influencers in the Slove-

ter users are useful to detect communities of like-minded

nian retweet network. Section 5 investigates the content of

users who typically share common views on specific topics.

41



Figure 1: The Slovenian retweet network (2018-2020) colored according to the detected communities, with shares of the total number of users. The label size of a node corresponds to the number of unique users that retweeted it. Only nodes with at least 700 unique retweeters are included.

In complex networks, a community is defined as a subset of

the communities. Most of the properties are normalized by

nodes that are more closely connected to each other than

the user to ease the comparison between communities.

to other nodes. For the purpose of this paper, we apply a

standard algorithm for community detection, the Louvain

• Nodes – unique users count

method [1]. The method partitions the nodes into commu-

• Central user – user with most retweets

nities by maximizing modularity (which measures the differ-

• Central user retweets – times the central user is retweeted

ence between the actual fraction of edges within the commu-

nity and such fraction expected in a randomized graph with

• Central user retweeters – unique users retweeting the

the same degree sequence) [7]. Modularity values range from

central user

−0.5 to 1.0, where a value of 0.0 indicates that the edges are

• HHI (n = 50) – Herfindahl–Hirschman index [9] mea-

randomly distributed, and larger values indicate a higher

sures the distribution of influence of the top n influen-

community density.

tial users. Higher value reflects the community influ-

ence concentrated only in few influential users, while

We ran the Louvain method (resolution = 1.05) on our undi-

lower value indicates more dispersed and balanced in-

rected retweet network resulting in 183 communities with a

fluence distribution.

modularity value of 0.382, which indicates a strong connect-

• Edges in/node – edges remaining in the community per

edness within communities. Only the three largest commu-

user (source and destination in the same community)

nities each have more than 5% of all users, while combined

• Edges out/node – edges going out of the community

they contain 85% of all users. The three main detected com-

per user (destination in a different community)

munities are presented in Fig. 1. We observe the following:

• Weighted edges in/node – weighted edges remaining in

• The three largest communities are labeled as Left, Cen-

the community per user

ter and Right with 55%, 20% and 10% as their re-

• Weighted edges out/node – weighted edges going out

spective shares of all users. The labeling of the com-

of the community per user

munities does not necessarily represent their political

• Out/In ratio – “Edges out” divided by “Edges in”

orientation.

• Weighted out/in ratio – “Weighted edges out” divided

• The Left community, even though the largest, con-

by “Weighted edges in”

tains the smallest number of users with more than 700

unique retweeters.

4.

INFLUENCERS

• The Left community is well separated from the Center

We use two simple, but powerful metrics to detect influ-

and the Right communities, which are more tightly

encers in the retweet network: the weighted out-degree and

interlinked.

the Hirsch index (h-index) [5]. Both metrics are calculated

from the number of retweets, thus known as retweet influ-

We performed an exploratory data analysis and calculated

ence metrics, indicating the ability of a user to post content

the community properties presented in Table 1, to compare

of interest to others.

42



Figure 2: Weighted out-degree (total retweets) and h-index comparison. Both charts include the top 25 most influential Slovenian Twitter users according to their respective metric. Bar colors represent the community of a user. Triangles point to users exclusive to one of the charts.

Table 1: Community properties

For domain URLs, we filtered the 2,297,008 tweets which

Left

Center

Right

contain a URL. Then, we extracted the domain part of the

Nodes

7,030

1,223

2,519

URLs and removed the domains with no specific meaning

Central user

vecer

BojanPozar

JJansaSDS

Central user retweets

10,398

31,432

50,688

for Slovenia’s content analysis (e.g. social networks: twit-

Central user retweeters

973

1,325

1,242

ter.com, facebook.com, instagram.com, etc., and URL short-

HHI (n = 50)

0.031

0.066

0.042

eners: ift.tt, bit.ly, ow.ly, etc.).

This results in 512,308

Edges in/node

19.32

14.53

69.30

tweets (approximately 22% of all the tweets with links). The

Edges out/node

4.47

37.11

13.19

most frequently occurring domains are owned by Slovenian

Weighted edges in/node

52.91

83.68

308.33

media with nova24tv.si, rtvslo.si and delo.si as the top three

Weighted edges out/node

6.95

119.42

36.14

Out/In ratio

0.23

2.55

0.19

URL domains with 23,879, 20,210 and 17,360 occurrences

Weighted Out/In ratio

0.13

1.43

0.12

respectively. If instead of the total number of occurrences

we count only the unique number of users which posted a do-

Weighted out-degree is simply the total number of retweets

main URL, the top three domains are rtvslo.si, siol.net and

of a particular user, while the h-index is an author-level bib-

delo.si with 2,802, 2,193 and 2,186 unique users respectively.

liometric indicator that measures the scientific output of a

scholar by quantifying both the number of publications (i.e.,

For the hashtag analysis, we filtered only tweets which con-

productivity) and the number of citations per publication

tain a hashtag, ending up with 701,266 tweets. The top three

(i.e., citation impact). Adapted to a Twitter network, it

hashtags are the following: #volitve2018 (the 2018 Slove-

would be described as: a user with an index of h has posted

nian parliamentary elections), #plts (the Slovenian First

h tweets and each of them was retweeted at least h times.

Football League) and #sdszate (Slovenian Democratic Party

hashtag, meaning: SDS for you) with 9,845, 9,318 and 7,308

Let RT be the function indicating the number of retweets

occurrences respectively. If we count only the unique num-

for each original tweet. The values of RT are ordered in

ber of users using a particular hashtag, the results for the

decreasing order, from the largest to the lowest, while i in-

top three Slovenian hashtags are as follows: #volitve2018

dicates the ranking position in the ordered list. The h-index

with 2,473, #slovenija with 1,611 and #fakenews with 1,343

is then defined as follows:

users.

h-index(RT) = max min(RT(i), i)

To see these results in the context of communities, we look at

i

the tweets authored by members of the three largest commu-

The top 25 most influential users by weighted out-degree and

nities, resulting in 84% of the tweets with relevant domain

h-index are shown in Fig. 2. The two metrics provide fairly

URLs and 83% of the tweets with relevant hashtags. We

similar results (they differ only in 9 users). Both results

summed the domain URL counts, while grouping them by

confirm the already visible phenomena from the previous

the community in which their user belongs. We applied the

observations: The Right community has the most influential

same procedure to the hashtags. Finally, we filtered the top

users, while the Left community, even though the biggest,

eight domain URLs and hashtags for each community and

does not have nearly as popular users as the ones from the

put them on a single Sankey diagram in Fig. 3. Even though

other two communities.

overlaps exist, the most popular hashtags and media very

much differ from community to community, meaning that

5.

CONTENT ANALYSIS

all three main communities prioritize topics differently and

We refer to content analysis in terms of getting knowledge

they inform themselves via different media.

from the text of the tweets. In this paper, we perform two

kinds of content analysis: domain URLs and hashtags.

43



Figure 3: A Sankey diagram depicts the use of the eight most common hashtags (left-hand side) and URLs (right-hand side) by the three largest detected communities.

6.

CONCLUSIONS

Parliament: Roll-call votes and Twitter activities. PLoS

In this paper we explored the Slovenian twitter network from

ONE, 11(11):e0166586, 2016.

January 2018 until January 2020. We applied community

[3] D. Cherepnalkoski and I. Mozetič. Retweet networks of

detection, identifying three main communities: Left, Center

the European Parliament: Evaluation of the community

and Right. We identified the most influential and the central

structure. Applied Network Science, 1(1):2, 2016.

users of each community by calculating the weighted out-

[4] M. Grčar, D. Cherepnalkoski, I. Mozetič, and P. Kralj

degree and the h-index of the nodes. We used the Herfind-

Novak. Stance and influence of Twitter users regarding

ahl–Hirschman index to estimate the distribution of influ-

the Brexit referendum. Computational Social Networks,

ence within the top communities in the network. Finally, by

4(1):6, 2017.

analysis of hashtags and URL domains in tweets, we discov-

[5] J. E. Hirsch. An index to quantify an individual’s

ered the most popular topics for Slovenians as well as the

scientific research output. Proceedings of the National

most referred Slovenian media on Twitter. We showed that

Academy of Sciences, pages 16569–16572, 2005.

users from different communities prioritize different topics

[6] N. Ljubešić, D. Fišer, and T. Erjavec. TweetCaT: a

and use different media to inform themselves.

tool for building Twitter corpora of smaller languages.

In Proceedings of the Ninth International Conference on

7.

ACKNOWLEDGMENTS

Language Resources and Evaluation (LREC’14), pages

The authors acknowledge financial support from the Slove-

2279–2283, Reykjavik, Iceland, May 2014. European

nian Research Agency (research core funding no. P2-103

Language Resources Association (ELRA).

and P6-0411), and the European Union’s Rights, Equality

[7] M. E. J. Newman. Modularity and community

and Citizenship Programme (2014-2020) project IMSyPP

structure in networks. Proceedings of the National

(Innovative Monitoring Systems and Prevention Policies of

Academy of Sciences, 103(23):8577–8582, 2006.

Online Hate Speech, grant no. 875263).

[8] P. K. Novak, L. D. Amicis, and I. Mozetič. Impact

investing market on twitter: influential users and

8.

REFERENCES

communities. Applied Network Science, 3(1):40, 2018.

[1] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and

[9] G. J. Werden. Using the Herfindahl–Hirschman index.

E. Lefebvre. Fast unfolding of communities in large

In L. Phlips, editor, Applied Industrial Economics,

networks. Journal of Statistical Mechanics: Theory and

number 2, pages 368–374. Cambridge University Press,

Experiment, 2008(10):P10008, 2008.

1998.

[2] D. Cherepnalkoski, A. Karpf, I. Mozetič, and M. Grčar.

Cohesion and coalition formation in the European

44





Toward improved

semantic annotation of food and nutrition data

Lidija Jovanovska

Panče Panov

Jožef Stefan International Postgraduate School &

Jožef Stefan Institute &

Jožef Stefan Institute

Jožef Stefan International Postgraduate School

Ljubljana, Slovenia

Ljubljana, Slovenia

lidija.jovanovska@ijs.si

pance.panov@ijs.si

ABSTRACT

repository without which there is a great difficulty in achieving

This paper aims to provide a critical overview of the state-of-the-

cross-cultural and expert consensus. 1

art vocabularies used for semantic annotation of databases and

In this paper, we will briefly go through the fundamental

datasets in the domain of food and nutrition. These vocabularies

components of the Semantic Web technologies, as well as the

are commonly used as a backbone for creating metadata that is

standards for the development of high-level KOS (Section 2). Next, usually used in search. Furthermore, the paper aims to provide a

we provide a critical overview of the most significant semantic

summary of ICT technologies used for storing food and nutrition

resources in the domain of food and nutrition (Section 3). Finally, datasets and searching digital repositories of such datasets. Fi-we present a proposal for the design and implementation of a

nally, the results of the paper will provide a roadmap for moving

broad ontology that would allow us to harmonize and integrate

towards FAIR (findable, accessible, interoperable, and reusable)

reference vocabularies and ontologies from different sub-areas

food and nutrition datasets, which can then be used in various

of food and nutrition (Section 4).

AI tasks.

2

BACKGROUND

KEYWORDS

The goal of the Semantic Web is to make Internet data machine-

ontologies, semantic technologies, data mining, food and nutri-

readable by enhancing web pages with semantic annotations.

tion

Linked data is built upon standard web technologies, also in-

cluding semantic web technologies in its technology stack [11].

Resource Description Framework (RDF) allows the represen-

1

INTRODUCTION

tation of relationships between entities using a simple subject-

Today more than ever before in history, we live in an age of

predicate-object format known as a triple. The triples form an

information-driven science. Vast amounts of information are be-

RDF database — called a triplestore — which can be populated

ing produced daily as a result of new types of high-throughput

with RDF facts about some domain of interest. RDF Schema

technology in all walks of life. Consequently, the quantity of

(RDFS) was developed immediately after the appearance of RDF

available scientific information is becoming overwhelming and

as a set of mechanisms for describing groups of related resources

without its proper organization, we would not be able to maxi-

and the relationships between them. Simple Protocol and RDF

mize the knowledge we harvest from it. Namely, research groups

Query Language (SPARQL) is the query language for querying

carry out their research in different ways, with specific and pos-

RDF triples stored in RDF triplestores.

sibly incompatible terminologies, formats, and computer tech-

The Web Ontology Language (OWL) is based on Descrip-

nologies. To tackle these issues, researchers have developed high-

tion Logics, a family of logics that are expressively weaker than

level knowledge organization systems (KOS), such as ontologies,

First Order Logic, but enjoy certain computational properties ad-

which constitute the core of the semantic web stack. Throughout

vantageous for purposes such as ontology-based reasoning and

the years, an abundance of ontologies has been developed and

data validation. Most of the ontologies used today are represented

released, slowly expanding from the biomedical sciences to the

in the OWL format.

fields of information science, machine learning, as well as the

All the semantic technologies operate on top of various KOS. A

domain of food and nutrition science.

KOS is intended to encompass all types of schemes for organizing

There is an old, yet simple saying which goes: “You are what

information and promoting knowledge management [7]. One you eat”. As the world becomes more globalized and food pro-example of a KOS is a thesaurus as a structured, normalized, and

duction grows massively, it is becoming increasingly difficult to

dynamic vocabulary designed to cover the terminology of a field

track the farm-to-fork food path. In the last few decades, digital

of specific knowledge. It is most commonly used for indexing

technology has been profoundly affecting many health and eco-

and retrieving information in a natural language in a system

nomic aspects of food production, distribution, and consumption.

of controlled terms. When looking at the expressiveness of a

Issues regarding food safety, security, authenticity as well as con-

KOS, a thesaurus is on the lower side of the scale. On the other

flicts arising from biocultural trademark protection are issues

side, ontologies enjoy greater expressiveness than thesauri due to

that were further enhanced by the lack of a centralized food data

the inclusion of description logics. Arp, Smith, and Spear define

the term ontology as “A representation artifact, comprising a

Permission to make digital or hard copies of part or all of this work for personal taxonomy as proper part, whose representations are intended to

or classroom use is granted without fee provided that copies are not made or designate some combination of universals, defined classes, and

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this certain relations between them” [1].

work must be honored. For all other uses, contact the owner/author(s).

Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

© 2020 Copyright held by the owner/author(s).

1https://www.nature.com/scitable/knowledge/library/food-safety-and-food-

security-68168348/, accessed 22/04/2020

45





Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Jovanovska and Panov

The Open Biomedical Ontologies (OBO) Foundry applies the

of more sophisticated ontologies, such as FoodOn. Even though

key principles that ontologies should be open, orthogonal, instan-

the OBO Foundry principles apply only to ontologies, we can

tiated in a well-specified syntax, and designed to share a common

use the more general ones as evaluation criteria for the LanguaL

space of identifiers. Open means that the ontologies should be

thesaurus. For instance, as previously mentioned, the thesaurus is

available for use without any constraint or license and also recep-

open, made available in an accepted concrete syntax, versioning

tive to modifications proposed by the community. Orthogonal

is ensured, textual definitions are available for all the terms and

means that they ensure the additivity of annotations and compli-

a sufficient amount of documentation is provided.

ance with modular development. The proper and well-specified

syntax is expected to support algorithmic processing and the

FoodOn [4] is an open-source, comprehensive ontology com-common system of identifiers enables backward compatibility

posed of term hierarchy facets that cover basic raw food source

with legacy annotations as the ontologies evolve [17].

ingredients, process terms for packaging, cooking, and preser-

The FAIR guiding principles for scientific data management

vation, and different product type schemes under which food

and stewardship were conceived to serve as guidelines for those

products can be categorized. FoodOn is applicable in several use-

who wish to enhance the reusability and invaluableness of their

cases, such as personalized foods and health, foodborne pathogen

data holdings [19]. The power of these principles lies in the fact surveillance and investigations, food traceability and food webs,

that they are simple and minimalistic in design and as such can be

and sustainability. FoodOn echoes most of LanguaL’s plant and

adapted to various application scenarios. Findability ensures that

animal part descriptors —– both anatomical (arm, organ, meat,

a globally unique and persistent identifier is assigned to the data

seed) and fluid (blood, milk) —– but reuses existing Uberon [12]

and the metadata which describes the data. Accessibility ensures

and Plant Ontology [10] term identifiers for them. Multiple com-that the data and the metadata can be retrieved by their identifier

ponent foods are more challenging because LanguaL provides

using a standardized communications protocol. Interoperability

no facility for giving identifiers to such products.

ensures that data, as well as metadata, use a formal, accessible,

Building on top of this, FoodOn allows food product terms like

and shared language for knowledge representation. Reusability

lasagna noodle to be defined directly in the ontology, and allows

ensures that data and metadata are accurately described, released

them to reference component products through various relations

with a clear and accessible license, have detailed provenance, and

which do not exist in LanguaL, such as: "has ingredient", "has meet domain-relevant community standards.

part", "composed primarily of". As a suggestion, these relations can all be represented with a single relation "has ingredient" and 3

CRITICAL OVERVIEW OF FOOD AND

the quantity can be expressed explicitly when annotating the

NUTRITION SEMANTIC RESOURCES

objects. All of the ontology terms have unique identifiers and

In this section, we provide a critical overview of the most relevant

the ontology is accessible and can be searched via The European

KOS in the field of food and nutrition. We start by describing

Bioinformatics Institute (EMBL-EBI) and its Ontology Lookup

LanguaL [8], a thesaurus that serves as a foundation for most of Service (OLS).3 The ontology itself is open-source and is a mem-the ontologies in this domain. We are more focused on analyzing

ber of the OBO Foundry. It also includes the upper-level Basic

ontologies which belong to different sub-spheres of the food and

Formal Ontology (BFO) [1]. The adherence to BFO proves useful nutrition domain. Namely, FoodOn [4], as a more general food in the case of aligning ontologies covering different domains

description ontology, ONS [18], relevant in the field of nutritional because they share the same top-level.

studies and ISO-Food [6], relevant in the field of annotating isotopic data acquired from food samples.

ONS [18] is the first systematic effort to provide a solid and extensible ontology framework for nutritional studies. ONS was

built to fill the gap between the description of nutrition-based

LanguaL [8] is a thesaurus used for describing, capturing, and retrieving data about food. Since 1996, it has been used to index

prevention of disease and the understanding of the complex im-

numerous European Union (EU) and US agency databases, among

pact nutrition has on health. Its structure consists of 3334 terms

which, the US Department of Agriculture (USDA) Nutrient Data-

imported from already existing ontologies and 100 newly de-

base for Standard Reference and 30 European Food Information

fined terms. The usability of ONS was tested in two scenarios:

Resource (EuroFIR) databases. Food ingredients are represented

an observational study, which aims at developing novel and af-

with indexing terms, preferably in the form of a noun or a phrase.

fordable nutritious foods to optimize the diet and reduce the risk

The thesaurus also includes precombined terms which are food

of diet-related diseases among groups at risk of poverty, and

product names to which facet terms have been assigned. There

an intervention study represented by the impact of increasing

are 4 main facets in LanguaL: A (Product Type), B (Food Source),

doses of flavonoid-rich and flavonoid-poor fruit and vegetables

C (Part of Plant or Animal), and E (Physical State, Shape, or Form).

on cardiovascular risk factors in an “at risk” group study.

Other food product description facets include chemical additive,

The development of ONS followed FAIR principles and as a

preservation or cooking process, packaging, and standard na-

result, it has been published in the FAIR-sharing database.4 Be-tional and international upper-level product type schemes.

fore defining new terms, the developers of ONS have ensured

The LanguaL thesaurus complies with the FAIR guidelines.

that they are not yet defined, with the use of the ONTOBEE web

The completeness of LanguaL’s indexing is to a large extent

service. Terms that were already defined were imported using the

assured by the Langual Food Product Indexing (FPI) software,

ontology reuse service — ONTOFOX [20]. In compliance with which verifies that all facets have been indexed for each food

the OBO Foundry principles, the ONS has been developed to be

in the list [8]. It is available online2 and can be queried using a interoperable with other ontologies, as it has been formalized

food descriptor or synonym. Its interoperability and reusability

are eminent as it represents a cornerstone in the development

3https://www.ebi.ac.uk/ols/ontologies/FoodOn, accessed 22/04/2020

2https://www.langual.org, accessed 22/04/2020

4https://fairsharing.org/bsg-s001068/, accessed 22/04/2020

46





Toward improved semantic annotation of food and nutrition data Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

using the latest OWL 2 Web Ontology Language and RDF speci-

fications and edited using Protégé [13] and the Hermit reasoner for consistency checking. It is also accessible, under the Creative

Commons license (CC BY 4.0), published on GitHub and at NCBO

BioPortal. Moreover, this ensured the adoption of a well-defined

and widely adopted structure for the top and mid-level classes

and principally the adherence to BFO as upper-level ontology.

ISO-Food is an ontology that was conceived to aid with the or-

ganization, harmonization, and knowledge extraction of datasets

containing information about isotopes, that represent variants of

a particular chemical element which differ in neutron number. To

develop this ontology a mixed approach was used, a combination

of both expert knowledge-driven (bottom-up) and data-driven

(top-down) methods. Its main classes include Isotope, Sample,

Location, Measurement, Article. The main class Isotope is con-

nected to the rest of the classes with respective relations. The

Food and Nutrient classes are linked to the RICHFIELDS ontology

[5]. The ontology was further applied in a study for describing isotopic data, to annotate a data sample that consists of isotopic

measurements of milk and potato samples.

The ISO-Food ontology can be accessed online via the Bio-

Portal repository of biomedical ontologies.5 It reuses terms from several ontologies, such as the concept Unit from the Units of

Measurements Ontology (UO), the classes Food and Component

from the RICHFIELDS ontology [5], the class Document from Figure 1: Diagram representing the alignment of the pro-the Bibliographic Ontology (BIBO) [3].

posed ontology with the identified relevant upper-level

and domain ontologies.

4

PROPOSAL

Ontologies for data mining. To provide a suitable formalized

representation of the outcomes of the research in the food and

domain of food and nutrition (see Figure 1). In this way, we can

nutrition domain, as well as to suggest new ways to extract knowl-

also use the benefits of cross-domain reasoning. Since FoodOn,

edge from the ever-abundant data produced in this field, we turn

ONS, and OntoDM all use BFO as a main top-level ontology, they

to ontologies that are used to formally represent the data analysis

speak the same general language and are consequently, easier to

process. More specifically, we focus on the

align.

OntoDM ontology,

which provides a unified framework for representing data mining

entities. It consists of three modular ontologies:

Towards the FNS Harmony ontology. In the context of the

OntoDM-core

[15] which represents core data mining entities, such as datasets, H2020 project FNS Cloud6 (food, nutrition, security) the goal is to data mining tasks, algorithms, models and patterns,

develop an infrastructure and services to exploit food, nutrition

OntoDT

[16] — a generic ontology of datatypes, and

and security data (data, knowledge, tools – resources) for a range

OntoDM-KDD [14]

which describes the process of knowledge discovery.

of purposes. To support the different functionalities required by

The ontology defines top-level concepts in data mining and

the cloud platform, we started with the development of the FNS-

machine learning, such as data mining task, algorithm, and their

Harmony (FNS-H). The application ontology would allow us to

generalizations, which denote the outputs of applying an imple-

harmonize and integrate the different reference vocabularies and

mentation of an algorithm on a particular dataset. Starting with

ontologies from different sub-areas of food and nutrition, as well

these general concepts, OntoDM also defines the components of

as ontologies representing the domain of data analysis.

the algorithms, such as distance and kernel functions, and other

features they may contain. From the input and output data per-

Initial ontology development. The development of FNS-H,

spective, in this ontology, there is a hierarchical representation

which is intended to bridge the gap between the field of data

of data, from general concepts such as dataset to more specific

analysis and food and nutrition will be guided by common best

concepts regarding its structure, such as the number of features,

practice principles for ontology development. The aim is to max-

their role in a given task, concluding with the datatype of each

imize the reuse of available ontology resources and simultane-

attribute. These properties of OntoDM provide a complete formal

ously follow the Minimum Information to Reference an External

representation of the data mining process from beginning to end.

Ontology Term (MIREOT) principles [2]. In the first phase, we will integrate the FoodOn ontology and the ONS ontology with

the OntoDM suite of ontologies. With this integration, we will

Combining orthogonal domain ontologies. Our goal is to

align the selected ontologies in the domain of food and nutrition

be able to (1) define domain-specific data types for the domain

with the OntoDM ontology of data mining to improve the se-

of food and nutrition by extending OntoDT generic data types;

mantic annotation of the food and nutrition domain datasets, as

(2) define food and nutrition analysis pipelines for the domain

well as to formally represent data analysis tasks performed in the

of food and nutrition by extending OntoDM-core, and (3) define

5http://bioportal.bioontology.org/ontologies/ISO-FOOD, accessed 22/04/2020

6https://www.fns-cloud.eu/

47





Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Jovanovska and Panov

food and nutrition knowledge discovery scenarios by extending

[3] Bojana Dimić Surla, Milan Segedinac, and Dragan Ivanović.

OntoDM-KDD ontology.

2012. A bibo ontology extension for evaluation of scien-

The development of the ontology already started in a top-

tific research results. In Proceedings of the Fifth Balkan

down fashion, it is expressed in OWL2 and being developed using

Conference in Informatics, 275–278.

the Protégé ontology development tool. Aspiring to maximize

[4] Damion M Dooley, Emma J Griffiths, and Gurinder S Gosal

accessibility, the ontology will be available for access on a GitHub

et al. 2018. Foodon: a harmonized food ontology to in-

repository, 7 as well as via BioPortal. In the current stage of crease global food traceability, quality control and data

development, an initial set of higher-level domain terms, data

integration. npj Science of Food, 2, 1, 1–10.

types, data formats, data provenance metadata, lists of external

[5] Tome Eftimov, Gordana Ispirova, and Peter Korosec et al.

ontologies and vocabularies were extracted from the literature

2018. The richfields framework for semantic interoperabil-

and FNS-Cloud project documents.

ity of food information across heterogenous information

In the next steps, we will first align the extracted terms with

systems. In KDIR, 313–320.

the BFO ontology and then integrate them with domain terms

[6] Tome Eftimov, Gordana Ispirova, and Doris Potočnik. 2019.

from the domain ontologies based on BFO, such asFoodOn, and

Iso-food ontology: a formal representation of the knowl-

ONS, at the first instance, as well as with the OntoDM set of

edge within the domain of isotopes for food science. Food

ontologies. Other potentially relevant ontologies include the On-

chemistry, 277, 382–390.

tology for Biomedical Investigations (OBI), Ontology of Biologi-

[7] Heather Hedden. 2016. The accidental taxonomist. Infor-

cal and Clinical Statistics (OBSC), Ontology of Chemical Entities

mation Today, Inc.

of Biological Interest (ChEBI), Ontology of Statistical Methods

[8] Jayne D Ireland and A Møller. 2010. Langual food descrip-

(STATO), and others. To achieve integration of different ontolog-

tion: a learning process. European journal of clinical nutri-

ical resources, we will use the ROBOT tool [9] that supports the tion, 64, 3, S44–S48.

automation of a large number of ontology development tasks and

[9] Rebecca C Jackson, James P Balhoff, and Eric Douglass.

helps developers to efficiently produce high-quality ontologies.

2019. Robot: a tool for automating ontology workflows.

BMC bioinformatics, 20, 1, 407.

5

CONCLUSION

[10] Pankaj Jaiswal, Shulamit Avraham, and Katica Ilic et al.

In this paper, we provided an overview of the most relevant

2005. Plant ontology (po): a controlled vocabulary of plant

knowledge organization systems in the domain of food and nu-

structures and growth stages. Comparative and functional

trition. We started with the LanguaL food thesaurus that served

genomics, 6, 7-8, 388–397.

as a foundation for the development of the more sophisticated

[11] Brian Matthews. 2005. Semantic web technologies. E-learning,

ontologies — FoodOn, used for a multi-faceted description of

6, 6, 8.

various foods; ONS, used for observational and interventional

[12] Christopher J Mungall, Carlo Torniai, and Georgios V Gk-

nutrition studies; ISO-Food for the studies of isotopic data in

outos et al. 2012. Uberon, an integrative multi-species

foods. Next, we assessed the selected vocabularies with respect

anatomy ontology. Genome biology, 13, 1, R5.

to the FAIR principles and OBO Foundry guidelines for scien-

[13] Mark A Musen. 2015. The protégé project: a look back and

tific data management. All of the selected vocabularies showed

a look forward. AI matters, 1, 4, 4–12.

compliance with these accomplishment criteria, with only minor

[14] Panče Panov, Larisa Soldatova, and Sašo Džeroski. 2013.

suggestions for improvement provided from our side. Finally, in

Ontodm-kdd: ontology for representing the knowledge

our proposal, we lay down the foundations of a new ontology

discovery process. In International Conference on Discovery

which would connect data mining concepts in the domain of

Science. Springer, 126–140.

food and nutrition using domain ontologies (FoodOn, ONS) with

[15] Panče Panov, Larisa Soldatova, and Sašo Džeroski. 2014.

ontologies for datatypes, data mining, and knowledge discovery

Ontology of core data mining entities. Data Mining and

in databases (OntoDT, OntoDM-core, OntoDM-KDD). By doing

Knowledge Discovery, 28, 5-6, 1222–1265.

so, we can provide richer semantic annotation and discover new

[16] Panče Panov, Larisa N Soldatova, and Sašo Džeroski. 2016.

scenarios of harvesting knowledge from the food and nutrition

Generic ontology of datatypes. Information Sciences, 329,

data.

900–920.

[17] Barry Smith, Michael Ashburner, and Cornelius Rosse

ACKNOWLEDGMENTS

et al. 2007. The obo foundry: coordinated evolution of

This work was supported by the Slovenian Research Agency through the

ontologies to support biomedical data integration. Nature

grant J2-9230, as well as the European Union’s Horizon 2020 research and biotechnology, 25, 11, 1251–1255.

innovation programme through grant 863059 (FNS-Cloud, Food Nutrition

[18] Francesco Vitali, Rosario Lombardo, and Damariz Rivero et

Security).

al. 2018. Ons: an ontology for a standardized description

of interventions and observational studies in nutrition.

REFERENCES

Genes & nutrition, 13, 1, 12.

[19] Mark D Wilkinson, Michel Dumontier, and IJsbrand Jan

[1] Robert Arp, Barry Smith, and Andrew D Spear. 2015. Build-

Aalbersberg et al. 2016. The fair guiding principles for

ing ontologies with basic formal ontology. Mit Press.

scientific data management and stewardship.

[2] Mélanie Courtot, Frank Gibson, and Allyson L Lister et al.

Scientific

2011. Mireot: the minimum information to reference an

data, 3.

[20] Zuoshuang Xiang, Mélanie Courtot, and Ryan R Brinkman

external ontology term. Applied Ontology, 6, 1, 23–33.

et al. 2010. Ontofox: web-based support for ontology reuse.

7https://github.com/panovp/FNS-Harmony

BMC research notes, 3, 1, 175.

48





Absenteeism prediction from timesheet data: A case study Peter Zupančič

Biljana Mileva Boshkoska

Panče Panov

1A Internet d.o.o.

Faculty of Information Studies in

Jožef Stefan Institute and

Naselje nuklearne elektrarne 2

Novo mesto, Ljubljanska cesta 31a,

Jožef Stefan International

Krško, Slovenia

Novo mesto, Slovenia

Postgraduate School

peter.zupancic91@gmail.com

Jožef Stefan Institute, Jamova cesta

Jamova cesta 39

39, Ljubljana, Slovenia

Ljubljana, Slovenia

biljana.mileva@fis.unm.si

pance.panov@ijs.si

ABSTRACT

In this paper, we address the task of absenteeism prediction

Absenteeism, or employee absence from work, is a perpetual

from time sheets data. More specifically, based on data that we get

problem for all businesses, given the necessity to replace an

from MojeUre time attendance register system, we want to build a

absent worker to avoid a loss of revenue. In this paper, we focus

predictive model to predict if or for how many days an employee

on the task of predicting worker’s absence based on historical

would be absent. In this case, we are considering one-week-ahead

timesheet data. The data are obtained from MojeUre, a system for

prediction from workers profiles and one year historical time

tracking and recording working hours, which includes timesheet

sheets data. To predict if an employee will be absent in a given

profiles of employees from different companies in Slovenia. More

week, we employee the task of binary classification, which can

specifically, based on historical data for one year, we want to

be addressed by using a large number of binary classification

predict, under (which) certain conditions, if an employee will be

methods. On the other hand, to predict the number of days an

absent from work and for how long (e.g., a week, a month). In

employee would be absent in a given week, we employee re-

this respect, we compare the performance of different predictive

gression, which can be addressed by using regression methods.

modeling methods by defining the prediction task as a binary

Furthermore, we observe and discuss how adding of aggregate

classification task and as a regression task. Furthermore, in the

attributes influences the prediction power if used together with

case of one week ahead prediction, we test if we can improve the

the timesheet profiles.

predictions by using additional aggregate descriptive attributes,

together with the timesheet profiles.

2

DATA

KEYWORDS

In this section, we present the MojeUre system and then de-

Absenteeism at work, absence prediction, predictive modeling,

scribe the structure of the raw data, as well as the process of

timesheet data, human resource management

data cleaning. Then we present the structure of the dataset, used

for learning the predictive and the aggregate attributes, we con-

structed in order to test if they would improve the predictive

1

INTRODUCTION

power of the predictive models.

Companies strive to have better predictive accuracy in their day

to day operations, with the main goal of improving the productiv-

ity of the human resources (HR) department and hence obtaining

2.1

MojeUre system

higher profits and lower HR expenditures. They obtain informa-

The MojeUre system (https://mojeure.si) was developed to sup-

tion and insight from the large collections of human resource

port the process of planning workers schedules, as well as for

management (HRM) data that each employer owns, to support

recording work attendance and absenteeism. In addition to the

day to day operations and decision making, as well as, to comply

easy recording of the working hours of employees by a company,

to the national and international legislation.

the system also provides access to each employee’s own working

The new era of HR executives is moving from settling on

hours, vacation control, sick leave, travel orders, etc. The system

receptive choices exclusively taking into account reports and

can be accessed using the web or by using a mobile application.

dashboards towards connecting business information and hu-

The entry of working hours is done either through a web

man asset information to foresee future results which will bring

application or a mobile application. In the case the company also

changes. Having such data enables them to detect patterns and

wants to invests into a working time registrar, this can be done

trends, anticipate events and spot anomalies, forecast using what-

through the registrar where the employee has a personalized card

if simulations and learn of changes in employee behaviour so that

for clock-in or clock-out (for example usage of break, such as a

employee can take actions that lead to desired business outcomes.

lunch break, a private break, etc.). The system allows different

The purpose of HRM is measuring employee performance and en-

types of registered hours to be entered in the system in a single

gagement, studying workforce collaboration patterns, analyzing

day.

employee churn and turnover and modelling employee lifetime

All data used in the paper was obtained from the electronic

value [1].

system for recording working hours. There are currently more

than 150 different companies that use the system for registering

Permission to make digital or hard copies of part or all of this work for personal workers attendance. The basic function of the system is to record

or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the arrivals and departures of an employee at work and to record

the full citation on the first page. Copyrights for third-party components of this the various types of employee absence, such as sick leave and

work must be honored. For all other uses, contact the owner/author(s).

vacation leave. In addition, the system covers other absences

Information society ’20, October 5–9, 2020, Ljubljana, Slovenia

© 2020 Copyright held by the owner/author(s).

such as paternity leave, maternity leave, part-time leave, study

leave, student leave, etc.

49





Information society ’20, October 5–9, 2020, Ljubljana, Slovenia Zupančič et al.

In this paper, we use data from the MojeUre system for the

Table 3: Attributes representing the workers profiles

year 2019 and we have timesheet attendance data for all 52 weeks.

The data instances are composed of three types of attributes: (1)

Attribute name

Type

Description

attributes describing workers profiles (See Table 1), (2) attributes describing timesheets absence profiles of each worker (See Table

VacationLeave

numeric Total days of vacation leave for

2), and (3) attributes that are aggregates from timesheets profiles TotalDays

all weeks, which are defined in

constructed using domain knowledge (more details about the

the timesheets data used for the

attributes is provided in Section 2.2). The timesheets attributes descriptive attribute space.

composing the absence profile of each worker are calculated

SickLeave

numeric Total days of sick leave for all

based on the logged presence and absence logging data aggre-

TotalDays

weeks, which are defined in the

gated on the week level.. The entire dataset for the whole year

timesheets data used for the de-

consists of 232 different attributes and 2363 employees which are

scriptive attribute space.

defined as each row.

ShortTerm

numeric A count of how many times an

VacationLeave3

employee was at vacation leave

for at least 3 days per week.

Table 1: Workers profile attributes

LongTerm

numeric A count of how many times an

VacationLeave5

employee was on vacation leave

Attribute name

Type

Description

for at last 5 days per week.

EmployeeID

numeric Unique employee identifier.

ShortTerm

numeric A count of how many times an

WorkHour

numeric Data indicating how many

SickLeave3

employee was on sick leave for

hours per day an employee is

at least 3 days.

employed by contract.

LongTerm

numeric A count of how many times an

CompanyType

nominal Company type by specific cate-

SickLeave5

employee was on sick leave for

gories.

at least 5 days.

EmploymentYears numeric Describes how many years the

WinterVacation

numeric The number of vacation leave

person has been employed by

LeaveAbsence

days that were used in winter.

the current company.

SpringVacation

numeric The number of vacation leave

JobType

nominal Describes type of job (e.g. per-

LeaveAbsence

days that were used in spring.

manent, part-time).

SummerVacation numeric The number of vacation leave

Region

nominal The region in which the em-

LeaveAbsence

days that were used in summer.

ployee’s company is located.

AutumnVacation numeric The number of vacation leave

LeaveAbsence

days that were used in autumn.

WinterSickLeave numeric The number of sick leave days

Table 2: Timesheet absence profile attributes

Absence

that were used in winter.

SpringSick

numeric The number of sick leave days

LeaveAbsence

that were used in spring.

Attribute name

Type

Description

SummerSick

numeric The number of sick leave days

WeekWNYTotal

numeric The number of all absences in

LeaveAbsence

that were used in summer.

a given week, including the

AutumnSick

numeric The number of sick leave days

sum of sick leave and (vacation)

LeaveAbsence

that were used in autumn.

leave.

WinterVacation

numeric The number of vacation leave

WeekWNY

numeric The number of absences with

LeaveHoliday

days that were used in winter

VacationLeave

type vacation leave in a given

during school holidays.

week.

SpringVacation

numeric The number of vacation leave

WeekWNY

nominal The number of absences with

LeaveHoliday

days that were used in spring

SickLeave

type sick leave in a given week.

during school spring holidays.

WeekWNY

nominal Value tells if employee was ab-

SummerVacation numeric The number of vacation leave

Absence

sent at least 1 day in whole

LeaveHoliday

days that were used in summer

week.

during school summer holidays.

AutumnVacation numeric The number of vacation leave

LeaveHoliday

days that were used in autumn

2.2

Data prepossessing and feature

during school holidays.

engineering

Feature Engineering is an art (Shekhar A, 2018) and involves

the process of using domain knowledge to create features with

The period we are considering in our analysis is one year,

the goal to increase the predictive power of machine learning

that is composed of 52 weeks. For construction of the aggregate

algorithms. In this section, we describe the newly constructed

attributes, we have defined our seasons by weeks, defined as

attributes using domain knowledge. Furthermore, we present the

follows: (1) the winter season is defined from week 51 in the

process of data cleaning. Before cleaning, the original dataset

previous year to week 12 in the New year; (2) the spring season

contains 2087 instances of individual employees. The engineered

is defined from week 13 to week 25; (3) the summer season is

aggregate attributes using domain knowledge from timesheets

defined from week 26 week to week 39; and (4) the autumn season

profiles are presented in Table 3.

is defined from week 40 week to week 49.

50





Absenteeism prediction from timesheet data: A case study Information society ’20, October 5–9, 2020, Ljubljana, Slovenia

In addition, we also defined the school holidays by weeks,

which are defined as follows: (1) the winter holidays are defined

Target

Descriptive attributes

from week 7 to 8; (2) the spring holidays are defined from week

attribute

18 to 19; (3) the summer holidays are defined from week 26 to

Timesheet

Worker

absence

Week K

week 35; and (4) the autumn holidays are defined from week 44

profile

binary profile

Absence

to week 45.

1-(K-1) week

After we cleaned up the initial dataset, we obtained a smaller

number of dataset instances. This resulted in a dataset with 961

(a) Without aggregate attributes

distinct rows or more precisely different employees. The main

Target

control statement for the data cleaning was a test if an employee

Descriptive attributes

attribute

has less than one VacationLeaveTotalDays in the defined period.

Timesheet

Timesheet

This would mean that: (1) an employee that fulfills this condition

Worker

absence

absence

Week K

doesn’t work any more in company; or (2) the company doesn’t

profile

binary profile

aggregates

Absence

use recording system anymore; or (3) the employee is student

1-(K-1) week

1-(K-1) week

and for students the vacation leave days are not recorded as they

(b) With aggregate attributes

are usually paid per working hour only.

The most of employees in the dataset are working in company

Figure 1: The structure of the data instances used for learn-

type called “Izobraževanje, prevajanje, kultura, šport” (Education,

ing predictive models

translation services, culture, sports). In addition, most of the em-

ployees are coming from the region “Osrednjeslovenska” (Central

Slovenia region). The largest number of absence vacation leave

or holiday leave was in week 52, which is the last week in year

2019 which is expected.

the aggregate attributes were calculated. The absence of the 13th

week was used a target attribute. For each quarter, we constructed

3

DATA ANALYSIS SCENARIOS AND

two different variants of datasets, one containing the aggregate

EXPERIMENTS

attributes and the other without the aggregate attributes. This

Research question. In general, in this paper we want to perform

procedure was done for both tasks: binary classification and re-

one-week ahead prediction of employee absence, using worker

gression.

profile data, historical timesheet data aggregated on a week level,

as well as aggregated attributes described in the previous sec-

Experimental setup. For our paper, we used Weka as main soft-

tion. We explore the task of predicting employee absence both

ware [2] to execute predictive modelling experiments. WEKA is as a binary classification task and as a regression task. In the

an open source software provides tools for data preprocessing,

experiments, we want to test if and how the aggregates attributes

implementation of several Machine Learning algorithms, and

influence the predictive power of the built models both for the

visualization tools so that one can develop machine learning

case of binary classification and regression.

techniques and apply them to real-world data mining problems.

In the experiments, for all methods we used the default method

Tasks. In the binary classification task, we want only to predict

settings from Weka mining software. The evaluation method

if an employee will be absent in a given week. For this case, we

used was 10 fold cross-validation.

use the boolean attribute WeekWNYAbsence as a target attribute

(WNY is the identifier of the target week). In the regression

Methods. Here, we used different predictive methods imple-

task, we want to predict the number of absence days. For this

mented in the WEKA software with different settings. For the

case, we use one of the following numeric attributes as targets

regression task, we compare the performance of the following

WeekWNYTotal (for predicting the total number of absence days),

methods Linear regression (LR), M5P (both regression and model

WeekWNYVacationLeave (for predicting the number of vacation

trees)[3], RandomForest (RF) [4] with M5P trees as base learners, leave days), or WeekWNYSickLeave (for predicting the number

Bagg (Bag) [5] having M5P trees as base learners, IBK (nearest of sick leave days).

neighbour classifier with different number of neighbours) [6]

and SMOreg (support vector regression) [7].

Construction of the experimental datasets For the purpose

For binary prediction, we compare the performance of the

of analysis, we construct two types of datasets: (1) the first type

following methods: jRIP (decision rules) J48 (decision trees) Ran-

contain worker profile and timesheet absence profiles as descrip-

domForest (RF), Bagging (Bagg) having J48 trees as base learners,

tive attributes (see Figure 1a); and (2) the second type includes RandomSubSpace (RS) [8] having J48 trees as base learners, SMO

also timesheets absence aggregates (see Figure 1b).

(support vector machines) [9], and IBK (nearest neighbour classi-In order to perform analysis, we need to properly construct the

fier with different number of neighbours).

datasets used for learning predicting models. For example, if we

want to predict workers absence for week 15, we use historical

Evaluation measures. To answer our research question for the

timesheets data from week 1-14 together with the aggregates

case of regression, we use several measures for regression anal-

calculated on this period as descriptive attributes.

ysis, such as: Mean Absolute Error (MAE), Root mean squared

We decided to split the year consisting of 52 weeks in four

error (RMSE), and Correlation coefficient (CC).

quarters (Q1: W1-W13, Q2: W14-W26, Q3:W27-W39, Q4:W40-

For the case of classification, we use several measures for clas-

W52), each containing 13 weeks. The absence data for the first

sification analysis, such as: the percentage of correctly classified

12 weeks were used as historical timesheet profiles, out of which

instances (classification accuracy), precision, and recall.

51





Information society ’20, October 5–9, 2020, Ljubljana, Slovenia Zupančič et al.

Table 4: Predictive performance results. The bold value denotes the highest value when we compare datasets with (A) or without (NA) added aggregate attributes. The gray cells denote the best performing method for each dataset.

(a) Performance results for the regression task - RMSE measure (less is better) Dataset

LR

MP5

M5P-R

RF

Bagg

IBK(K=1)

IBK(K=3)

IBK(K=7)

SMOreg

Q1-A

0.789

0.692

0.775

0.688

0.64

0.804

0.687

0.734

0.681

Q1-NA

0.723

0.674

0.767

0.729

0.647

0.798

0.693

0.724

0.659

Q2-A

1.692

1.369

1.422

1.412

1.438

1.894

1.476

1.382

1.617

Q2-NA

1.44

1.382

1.396

1.457

1.379

1.752

1.506

1.425

1.497

Q3-A

0.942

0.919

0.976

0.999

0.935

1.409

1.074

1.015

0.963

Q3-NA

0.911

0.929

0.956

0.968

0.927

1.223

1.046

1.017

0.969

Q4-A

0.977

0.947

0.961

0.923

0.922

1.222

1.029

1.005

0.984

Q4-NA

0.992

0.985

0.976

1.024

0.975

1.186

1.066

0.999

1.007

(b) Performance results for the classification task - Accuracy in% (more is better) Dataset

JRip

j48

RF

Bagg

RS

SMO

IBK(K=1)

IBK(K=3)

IBK(K=7)

Q1-A

87.429

90.810

90.357

90.833

89.881

92.762

87.452

91.810

90.810

Q1-NA

87.429

90.810

90.381

89.857

90.357

90.833

89.429

91.810

90.833

Q2-A

63.645

68.879

65.751

65.419

66.736

69.200

58.153

64.347

68.842

Q2-NA

66.466

68.177

67.118

66.441

66.429

66.773

65.049

62.291

67.463

Q3-A

84.429

84.404

83.288

83.061

84.409

86.677

77.182

82.616

85.333

Q3-NA

83.737

83.520

82.379

83.737

84.864

86.449

81.263

85.101

84.879

Q4-A

71.130

67.277

72.150

70.460

70.305

70.452

69.627

70.644

70.302

Q4-NA

70.455

68.266

66.774

67.441

69.791

69.466

66.093

67.610

68.960

4

RESULTS AND DISCUSSION

be absent in a given week). To see the difference in performance,

Regression task1. In Table 4a, we present the results for RMSE

we performed experiments on datasets constructed on different

measure. It indicates how close the observed data points are to

quarters of the year. The best prediction method in the case of re-

the model’s predicted values, and lower values indicate better fit.

gression is Bagging and in general we could say that predictions

From the results, we can observe that in general Bagging of M5P

are slightly better if we don’t use aggregate attributes. The best

trees obtains the best performance. Predicting absence in week

method in the case of classification is SMO. Again almost same

13 from Q1 is generally better without using aggregate attributes.

results with using or not using external aggregate attributes.

We have similar behaviour for predicting absence in week 26 (Q2)

In future work, we plan to perform selective analysis of absen-

and week 39 (Q3). Predicting absence for the last week in the

teeism using the same data based on different criteria, such as

year from Q4 is generally better done using additional aggregate

seasonality, closeness to holidays (before, after), critical weeks for

attributes. If we consider MAE, the best performing method is

certain professions etc. In addition, we plan to perform regional

SMOreg, and for Q1, Q2 better results are obtained without the

analysis and workers domain analysis which is based on com-

use of aggregate attributes, opposite to the Q3 and Q4. Finally, if

pany type. Moreover, more insight into absence patterns will be

we consider CC the best performing method is Bagging, and for

available after collecting several years of attendance data for each

Q1 and Q4 better results are obtained without using aggregate

employee. Finally, we plan to compare the different granularity

attributes, opposite to Q2 and Q3.

of prediction (day - based vs. week - based vs. half a month based

vs. month based analysis).

Classification task2. In Table 4b, we present the results for accuracy. From the results, we can observe that in general SMO

ACKNOWLEDGMENTS

obtains the best performance. For Q1, we obtain better results

We thank the company 1A Internet d.o.o., which provided us access to

if we do not include aggregate attributes. For Q2, Q3 and Q4

the data which were used in our research. Panče Panov is supported by

the best results are obtained by using the additional aggregate

the Slovenian Research Agency grant J2-9230.

attributes. If we consider precision the best performing methods

are SMO and JRip, while for recall the best performing method

REFERENCES

is IBK using 7 nearest neighbours.

[1] Malisetty, S., Archana, R. V., & Kumari, K. V. (2017). Predictive analytics in HR

management., Indian Journal of Public Health Research & Development, 8(3), 115-120.

5

CONCLUSION AND FUTURE WORK

[2] Witten, I. H., & Frank, E. (2002). Data mining: practical machine learning tools The main goal of the paper was to test if adding additional

and techniques with Java implementations., Acm Sigmod Record, 31(1), 76-77.

[3] Ross J. Quinlan. Learning with Continuous Classes. In: 5th Australian Joint timesheet aggregate attributes can influence the predictive power

Conference on Artificial Intelligence., Singapore, 343-348, 1992.

in the case of one-week ahead absenteeism prediction from

[4] Leo Breiman (2001). Random Forests., Machine Learning. 45(1):5-32.

timesheet data. The research was performed on data from year

[5] Leo Breiman (1996). Bagging predictors., Machine Learning. 24(2):123-140.

[6] D. Aha, D. Kibler (1991). Instance-based learning algorithms., Machine Learning.

2019, collected by the MojeUre work attendance register system.

6:37-66.

We used various predictive modelling methods formulating the

[7] S.K. Shevade, S.S. Keerthi, C. Bhattacharyya, K.R.K. Murthy. Improvements to the prediction task as regression (predicting the number of absent

SMO Algorithm for SVM Regression., In: IEEE Transactions on Neural Networks, 1999.

days in a week) and classification (predicting if an employee will

[8] Tin Kam Ho (1998) The Random Subspace Method for Constructing Decision Forests., IEEE Transactions on Pattern Analysis and Machine Intelligence.

1Complete results for regression are presented at the following URL

20(8):832-844. URL http://citeseer.ist.psu.edu/ho98random.html.

https://tinyurl.com/yyp85vfr

[9] J. Platt. Fast Training of Support Vector Machines using Sequential Minimal 2Complete results for classification are presented at the following URL

Optimization., In B. Schoelkopf and C. Burges and A. Smola, editors, Advances

https://tinyurl.com/y6o6h6d8

in Kernel Methods - Support Vector Learning, 1998.

52





Monitoring COVID-19 through text mining and visualization

M.Besher Massri

Joao Pita Costa

Andrej Bauer

Jožef Stefan Institute, Slovenia

Quintelligence, Slovenia

University of Ljubljana, Slovenia

besher.massri@ijs.si

joao.pitacosta@quintelligence.com

andrej.bauer@andrej.com

Marko Grobelnik

Janez Brank

Luka Stopar

Jožef Stefan Institute, Slovenia

Jožef Stefan Institute, Slovenia

Jožef Stefan Institute, Slovenia

marko.grobelnik@ijs.si

janez.brank@ijs.si

luka.stopar@ijs.si

ABSTRACT

The global health situation due to the SARS-COV-2 pandemic

motivated an unprecedented contribution of science and tech-

nology from companies and communities all over the world to

fight COVID-19. In this paper, we present the impactful role of

text mining and data analytics, exposed publicly through IRCAI’s

Coronavirus Watch portal. We will discuss the available technol-

ogy and methodology, as well as the ongoing research based on

the collected data.

KEYWORDS

Text mining, Data analytics, Data visualisation, Public health,

Figure 1: Coronavirus Watch portal

Coronavirus, COVID-19, Epidemic intelligence

1

INTRODUCTION

the lack of resolution of the data in aspects like the geographic

When the World Health Organization (WHO) announced the

location of reported cases, the commodities (i.e., other diseases

global COVID-19 pandemic on March 11th 2020 [25], following that also influence the death of the patient), the frequency of the

the rising incidence of the SARS-COV-2 in Europe, the world

data, etc. On the other hand, it was not common to monitor the

started reading and talking about the new Coronavirus. The ar-

epidemic through the worldwide news (with some exceptions as

rival of the epidemic to Europe scaled out the news published

the Ravenpack Coronavirus News Monitor [21]).

about the topic, while public health institutions and governmen-

The Coronavirus Watch portal suggests the association of

tal agencies had to look for existing reliable solutions that could

reported incidence with worldwide published news per country,

help them plan their actions and the consequences of these.

which allows for real-time analysis of the epidemic situation

Technological companies and scientific communities invested

and its impact on public health (in which specific topics like

efforts in making available tools (e.g. the GIS [1] later adopted mental health and diabetes are important related matters) but

by the World Health Organisation (WHO)), challenges (e.g. the

also in other domains (such as economy, social inequalities, etc.).

Kaggle COVID-19 competition [13]), and scientific reports and This news monitoring is based on state-of-the-art text mining

data (e.g. the repositories medRxiv [15] and Zenodo [27]).

technology aligned with the validation of domain experts that

In this paper we discuss the Coronavirus Watch portal [12],

ensures the relevance of the customized stream of collected news.

made available by the UNESCO AI Research Institute (IRCAI),

Moreover, the Coronavirus Watch portal offers the user other

comprehending several data exploration dashboards related to

perspectives of the epidemic monitoring, such as the insights

the SARS-COV-2 worldwide pandemic (see the main portal in

from the published biomedical research that will help the user

Figure 1). This platform aims to expose the different perspectives to better understand the disease and its impact on other health

on the data generated and trigger actions that can contribute to

conditions. While related work was promoted in [13] in relation a better understanding of the behavior of the disease.

with the COVID-19, and is offered in general by MEDLINE mining

tools (e.g., MeSH Now [16]), there seems to be no dedicated tool 2

RELATED WORK

to the monitoring and mining of COVID-19 - related research as

that presented here.

The many platforms that have been made publicly available over

the internet to monitor aspects of the COVID-19 pandemics are

mostly focusing on data visualization based on the incidence of

3

DESCRIPTION OF DATA

the disease and the death rate worldwide (e.g., the CoronaTracker

3.1

Historical COVID-19 Data

[3]). The limitations of the available tools are potentially due to To perform an analysis of the growth of the coronavirus, we need

Permission to make digital or hard copies of all or part of this work for personal to use the historical data of cases and deaths. This data is retrieved

or classroom use is granted without fee provided that copies are not made or from a GitHub repository by John Hopkins University[4]. The distributed for profit or commercial advantage and that copies bear this notice and data source is based mainly on the official data from the World

the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy Health Organization (WHO)[24] along with some other sources, otherwise, or republish, to post on servers or to redistribute to lists, requires prior like the Center for Disease and Control[2], and Worldometer[26],

specific permission and/or a fee. Request permissions from permissions@acm.org.

among others. This data provides the basis for all functionality

Information society ’20, October 5–9, 2020, Ljubljana, Slovenia

© 2020 Association for Computing Machinery.

that depended on the statistical information about COVID-19

numbers.

53





Information society ’20, October 5–9, 2020, Ljubljana, Slovenia

3.2

Live Data from Worldometer

Apart from historical data, live data about the COVID-19 number

of cases, deaths, recovered, and tests are retrieved from the worl-

dometer website. Although the cases might not be as official as

the one provided by John Hopkins University (which is based on

WHO data), this source is updated many times per day providing

the latest up-to-date data about COVID-19 statistics at all times.

3.3

Live News about Coronavirus

The live news is retrieved from Event Registry [10], which is a media-intelligence platform that collects news media from

around the world in many languages. The service analyzes news

from more than 30,000 news, blogs, and PR sources in 35 lan-

Figure 2: A snapshot of the 5D Visualization on March

guages.

23rd. Countries that were at the peak in terms of growth

are shown high up like Turkey. Whereas countries that

3.4

Google COVID-19 Community

mostly contained the virus are shown down like China.

Mobility Data

Google’s Community Mobility [11] data compares mobility pat-by clicking on the country name on the left table. As seen in

terns from before the COVID-19 crisis and the situation on a

figure 1.

weekly basis. Mobility patterns are measured as changes in the

frequency of visits to six location types: Retail and recreation,

4.3

Statistical Visualizations

Grocery and pharmacy, Parks, Transit stations, Workplaces, and

Residential. The data is provided on a country level as well as on

The following set of visualization all aims at displaying the statis-

a province level.

tics about COVID-19 cases and deaths in a visual format. While

they all provide countries comparison, each one focus on differ-

ent perspective; Some are more complex and focus on the big

3.5

MEDLINE: Medical Research Open

picture (5D evolution), and some are simple and focus on one

Dataset

aspect (Progression and Trajectory). Besides, all of them have

The MEDLINE dataset [14] contains more than 30 million cita-configuration options to tweak the visualization, like the ability

tions and abstracts of the biomedical literature, hand-annotated

to change the scale of the axes to focus on the top countries or

by health experts using 16 major categories and a maximum of

the long tale. Or a slider to manually move through the days for

13 levels of deepness. The labeled articles are hand-annotated by

further inspection. Furthermore, the default view compares all

humans based on their main and complementary topics, and on

the countries or the top N countries, depending on the visualiza-

the chemical substances that they relate to. It is widely used by

tion. However, it’s possible to track a single country or a set of

the biomedical research community through the well-accepted

countries and compare them together for a more focused view.

search engine PubMed [19].

This is done by selecting the main country by clicking on it on

the left table and proceeding to select more countries by pressing

4

CORONAVIRUS WATCH DASHBOARD

the ctrl key while clicking on the country.

The main layout of the dashboard displayed in figure 1 consists 4.3.1

5D Evolution. 5D Evolution is a visualization that displays

of two sides. It is split into the left table of countries, where a

the evolution of the virus situation through time. It is called like

simple table of statistics is provided about countries along with

that since it encompasses five dimensions: x-axis, y-axis, bubble

the total numbers of cases, deaths, and recovered. On the right

size, bubble color, and time, as seen in figure 2. By default, it il-side, there is a navigation panel with tabs, each representing a

lustrates the evolution of the virus in countries based on N. cases

functionality. Each functionality answers some questions and

(x-axis), The growth factor of N. Cases (y-axis), N. Deaths (bubble

provides insights about a certain type of data.

size), and country region (bubble color) through time. In addition,

a red ring around the country bubble is drawn whenever the first

4.1

Coronavirus Data Table

death appears. The growth rate represents how likely that the

The data table functionality is a simple table that shows the basic

numbers are increasing with respect to the day before. A growth

statistics about the new coronavirus. It’s taken from Worldometer

rate of 2 means that the numbers are likely to double in the next

as it’s the most frequently updated source for coronavirus. The

day. The growth rate is calculated using the exponential regres-

data table comes in two forms, one that is a simplified version

sion model. At each day the growth rate is based on the N. cases

which is the table on the left, and one contains the full information

from the previous seven days. The goal of this visualization to

in a separate tab.

show how countries relate to each other and which are exploding

in numbers and which ones managed to "flatten the curve", since 4.2

Coronavirus Live News

flattening the curve means less growth rate. It’s intended to be

one visualization that gives the user a big picture of the situation.

The second functionality is a live news feed about coronavirus

from around the world. The feed comes from Event Registry,

4.3.2

Progression. The progression visualization displays the

which is generated by querying for articles that are annotated

simple Date vs N. cases/deaths line graph. It helps to provide

with concepts and keywords related to coronavirus. The user can

a simplistic view of the situation and compare countries based

check for a country’s specific news (news source in that country)

on the raw numbers only. The user can display the cumulative

54





Monitoring COVID-19 through text mining and visualization

Information society ’20, October 5–9, 2020, Ljubljana, Slovenia

numbers where each day represents the numbers up to now, or

daily where at each date the numbers represent the cases/deaths

on that day only.

4.3.3

Trajectory. While the progress visualization displays the

normal date vs N. cases/deaths, this visualization seeks to com-

pare how the trajectory of the countries differ starting from the

point where they detect cases. This visualization helps to com-

pare countries’ situations if they all start having cases on the

same date. The starting point has been set to the day the country

reaches 100 cases, so we would compare countries when they

started gaining momentum.

4.4

Time Gap

The time gap functionality tries to estimate how the countries

are aligned and how many days each country is behind the other,

whether that is in the number of cases or deaths. This assumes

that the trajectory of the country will continue as it with taking

much more strict/loose measurements, which is a rough assump-

tion. It helps to estimate how bad or good the situation in terms

of the number of days. To see the comparison, a country has to

Figure 3: A snapshot of the Social Distancing Simulator.

be selected from the table on the left. However, not all countries

The canvas show a representation of the population. with

are comparable as they have very different trajectories or growth

red dots representing sick people, yellow dots represent-

rates.

ing immunized people, and grey dots represent deceased

The growth of each country is represented as an exponential

people.

function, the base is calculated using linear regression on the log

of the historical values (that is, exponential regression). Based

on that, the duplication N. days, or the N. days the number of

The simulator is controlled by three parameters. First, Social

cases/deaths will double is determined. two countries are compa-

distancing that controls to what extent the population enforces

rable if they have a reasonable difference is the base or doubling

social distancing. At 0% there is no social distancing and per-

factor. If they are comparable, we see where the country with the

sons move with maximum speed so that there is a great deal

smaller value fits in the historical values of the country with the

of contact between them. At 100% everyone remains still and

larger numbers, with linear interpolation if the number is not

there is no contact at all. Second, mortality is the probability

exact, hence the decimal values.

that a sick person dies. If you set mortality to 0% nobody dies,

while the mortality of 100% means that anybody who catches

the infection will die. Finally, infection duration determines how

4.5

Mobility

long a person is infected. A longer time gives an infected person

The mobility visualization is based on google community mo-

more opportunities to spread the infection. Since the simulation

bility data that describe how communities in each country are

runs at high speed, time is measured in seconds.

moving based on 6 parameters: Retail and recreation, Grocery

and pharmacy, Parks, Transit stations, Workplaces, and Residen-

4.7

Biomedical Research Explorer

tial. The data is then reduced to 2-dimensional data while keeping

To better understand the disease, the published biomedical sci-

the Euclidean proximity nearly the same. The visualization can

ence is the source that provides accurate and validated infor-

indicate that the closer the countries are on the visualization, the

mation. Taking into consideration a large amount of published

similar the mobility patterns they have. The visualization uses

science and the obstacles to access scientific information, we

the T-SNE algorithm for dimensionality reduction [23], which made available a MEDLINE explorer where the user can query

reduces high dimensional data to low dimensional one while

the system and interact with a pointer to specify the search re-

keeping the distance proximity between them proportionally

sults (e.g., obtaining results on biomarkers when searching for

the same as possible. The algorithm works in the form of iter-

articles hand-annotated with the MeSH class "Coronavirus").

ations, at each iteration, the bubbles representing the country

To allow for the exploration of any health-related texts (such as

are drawn. We used those iterations to provide animation to the

scientific reports or news) we developed an automated classifier

visualization.

[5] that assigns to the input text the MeSH classes it relates to. The annotated text is then stored in Elasticsearch [18], from where 4.6

Social Distancing Simulator

it can be accessed through Lucene language queries, visualized

The Social Distancing simulator is displayed in figure 3. Each over easy-to-build dashboards, and connected through an API

circle represents a person who can be either healthy (white),

to the earlier described explorer (see [8], [20] and [17] for more immune (yellow), infected (red), or deceased (gray). A healthy

detail).

person is infected when they collide with an infected person.

The integration of the MeSH classifier with the worldwide

After a period of infection, a person either dies or becomes per-

news explorer Event Registry allows us to use MeSH classes in

manently immune. Thus the simulation follows the Susceptible-

the queries over worldwide news promoting an integrated health

Infectious-Recovered-Deceased (SIRD) compartmental epidemio-

news monitoring [9] and trying to avoid bias in this context logical model.

[7]. An obvious limitation is a fact that the annotation is only 55





Information society ’20, October 5–9, 2020, Ljubljana, Slovenia available for news written in the English language, being the

[7] J. Pita Costa et al. 2019. Health news bias and its impact

unique language in MEDLINE.

in public health. In Proceedings of the Slovenian KDD con-

ference.

5

CONCLUSION AND FUTURE WORK

[8] J. Pita Costa et al. 2020. Meaningful big data integration for

In this paper, we presented the coronavirus watch dashboard as

a global covid-19 strategy. Computer Intelligence Magazine.

a use-case of observing pandemic. However, this methodology

[9] J. Pita Costa et al. 2017. Text mining open datasets to sup-

can be applied to other kinds of diseases given the availability of

port public health. In WITS 2017 Conference Proceedings.

similar data. For further development, we plan to implement a

[10] EventRegistry. 2020. Event Registry. https://eventregistry.

local dashboard for other countries as well which would provide

org. (2020).

local data in the local language. In addition, given the existence of

[11] Google. 2020. Google COVID-19 Community Mobility Re-

more than seven months of historical data, we would like to build

port. https://www.google.com/covid19/mobility/. (2020).

some predictive models to predict the number of cases/deaths in

[12] IRCAI. 2020. IRCAI coronavirus watch portal. http : / /

the next few days.

coronaviruswatch.ircai.org/. (2020).

Moreover, we are using the StreamStory technology [22] in

[13] Kaggle. 2020. Kaggle covid-19 open research dataset chal-

order to: (i) compare the evolution of the disease between coun-

lenge. https : / / www. kaggle. com / allen - institute - for -

tries by comparing their time-series of incidence; (ii) investi-

ai/CORD-19-research-challenge. (2020).

gate the correlation between the incidence of the disease with

[14] MEDLINE. 2020. MEDLINE description of the database.

weather conditions and other impact factors; and (iii) analyze

https://www.nlm.nih.gov/bsd/medline.html. (2020).

the dynamics of the evolution of the disease based on incidence,

[15] medRxiv. 2020. medRxiv covid-19 sars-cov-2 preprints

morbidity, and recovery. This technology allows for the anal-

from medrxiv and biorxiv. https://connect.medrxiv.org/

ysis of dynamical Markov processes, analyzing simultaneous

relate/content/181. (2020).

time-series through transitions between states, offering several

[16] MeSHNow. 2020. MeSHNow. https://www.ncbi.nlm.nih.

customization options and data visualization modules.

gov/CBBresearch/Lu/Demo/MeSHNow/. (2020).

Furthermore, following the work done in the context of the

[17] MIDAS. 2020. MIDAS COVID-19 portal. http : / / www.

Influenza epidemic in [6], we are using Topological Data Analysis

midasproject.eu/covid-19/. (2020).

methods to understand the behavior of COVID-19 throughout

[18] Elastic NV. 2020. Elasticsearch portal. https://www.elastic.

Europe. In it, we examine the structure of data through its topo-

co/. (2020).

logical structure, which allows for comparison of the evolution

[19] PubMed. 2020. PubMed biomedical search engine. https:

of the epidemics within countries through the encoded topology

//pubmed.ncbi.nlm.nih.gov/. (2020).

of their incidence time series.

[20] Quintelligence. 2020. Quintelligence COVID-19 portal.

http://midas.quintelligence.com/. (2020).

ACKNOWLEDGMENTS

[21] Ravenpack. 2020. Ravenpack coronavirus news monitor.

The first author has been supported by the Knowledge 4 All

https://coronavirus.ravenpack.com/. (2020).

foundation and the H2020 Humane AI project under the European

[22] Luka Stopar. 2020. StreamStory. http://streamstory.ijs.si/.

research and innovation programme under GA No. 761758), while

(2020).

the second author was funded by the European Union research

[23] Laurens van der Maaten and Geoffrey Hinton. 2008. Vi-

fund ’Big Data Supporting Public Health Policies’, under GA

ualizing data using t-sne. Journal of Machine Learning

No. 727721. The third author acknowledges that this material is

Research, 9, (November 2008), 2579–2605.

based upon work supported by the Air Force Office of Scientific

[24] WHO. 2020. WHO Coronavirus portal. https://www.who.

Research under award number FA9550-17-1-0326.

int/emergencies/diseases/novel-coronavirus-2019. (2020).

[25] WHO. 2020. World Health Organization who director-

REFERENCES

general’s opening remarks at the media briefing on covid-

19 - 11 march 2020. https://www.who.int/dg/speeches/

[1] ArcGIS. 2020. ArcGIS who covid-19 dashboard. https://

detail/who-director-general-s-opening-remarks-at-the-

covid19.who.int/. (2020).

media-briefing-on-covid-19---11-march-2020. (2020).

[2] CDC. 2020. Center for Disease Control and Prevention.

[26] WorldoMeters. 2020. WorldoMeters. https://www.worldometers.

https://www.cdc.gov/coronavirus/2019-ncov/index.html.

info/coronavirus/. (2020).

(2020).

[27] Zenodo. 2020. Zenodo coronavirus disease research com-

[3] CoronaTracker. 2020. CoronaTracker. https://www.coronatracker.

munity. https : / / zenodo . org / communities / covid - 19/.

com/analytics/. (2020).

(2020).

[4] CSSE. 2020. Covid-19 data repository by the center for

systems science and engineering (csse) at johns hopkins

university. https://github.com/CSSEGISandData/COVID-

19. (2020).

[5] J. Pita Costa et al. 2020. A new classifier designed to an-

notate health-related news with mesh headings. Artificial

Intelligence in Medicine.

[6] J. Pita Costa et al. 2019. A topological data analysis ap-

proach to the epidemiology of influenza. In Proceedings of

the Slovenian KDD conference.

56





Usage of Incremental Learning in Land-Cover

Classification

Jože Peternelj

Beno Šircelj

Klemen Kenda

Jožef Stefan Institute

Jožef Stefan Institute

Jožef Stefan Institute

Jamova 39, 1000 Ljubljana,

Jamova 39, 1000 Ljubljana,

Jožef Stefan International

Slovenia

Slovenia

Postgraduate School

joze.peternelj@ijs.si

beno.sircelj@ijs.si

Jamova 39, 1000 Ljubljana,

Slovenia

klemen.kenda@ijs.si

ABSTRACT

2.

DATA

In this paper we present a comparison of a variety of incre-

2.1

EO data

mental learning algorithms along with traditional (batch)

The Earth observation data were provided by the Sentinel 2

learning algorithms in an earth observation scenario. The

mission of the EU Copernicus programme, whose main ob-

approach was evaluated with the earth observation data

jectives are land monitoring, detection of land use and land

set for land-cover classification from Europe Space Agency’s

changes, support for land cover creation, disaster relief sup-

Sentinel-2 mission, the digital elevation model and the ground

port and monitoring of climate change [2]. The data com-truth data of land use and land cover from Slovenia. We

prise 13 multi-spectral channels in the visible/near- infrared

show that incremental algorithms can produce competitive

(VNIR) and short wave infrared (SWIR) spectral range with

results while using less time than batch methods.

a temporal resolution of 5 days and spatial resolutions of

10m, 20m and 60m [8]. The Sentinel’s Level-2A products Keywords

(surface reflections in cartographic geometry) were accessed

remote sensing, earth observation, incremental learning, ma-

via the services of SentinelHub1 and processed using eo-chine learning, classification

learn2 library. Additionally, a digital elevation model for Slovenia (EU-DEM) with 30m resolution3 was used.

1.

INTRODUCTION

2.2

LULC data

Land cover classification is one of the common and well re-

searched tasks of machine learning (ML) in the Earth Ob-

LULC (Land Use Land Cover) data for Slovenia is collected

servation (EO) community [1]. The challenge is to classify by the Ministry of Agriculture, Forestry and Food and is

land into different types based on remote sensing data such

publicly available [10]. The data is provided in shapefile for-as satellite images, radar data, information on weather [12]

mat, with each polygon representing a patch of land marked

and altitude. The most commonly used data are satellite

with one of the LULC classes.

Originally there were 25

images, which may vary in acquisition period, resolution or

classes, but we introduced a more general dataset by group-

wavelength. A plethora of algorithms have explored the po-

ing similar classes together.

The frequencies of 8 newly

tential of using a single-date image [3] and even time series grouped classes are shown in Figure 1.

of images for the task [11, 13]. Extensive work with state-of-the-art accuracy was performed using methods of deep

2.3

Feature Engineering

learning [14]. The latter report a high computational effort The EO data were collected for the whole year. 4 raw band

in the learning and forecasting phase, which reduces their

measurements (red, green, blue - RGB and near-infrared

potential for continuous tasks requiring a timely response.

- NIR) and 6 relevant vegetation- related derived indices

There have also been efforts to reduce learning and predic-

(normalized differential vegetation index - NDVI, normal-

tion times using intelligent feature selection [6, 7]. To the ized differential water index - NDWI, enhanced vegetation

best of our knowledge, no cases have been reported where

index - EVI, soil-adjusted vegetation index - SAVI, structure

stream models have been used in an EO scenario. The pri-

intensive pigment index - SIPI and atmospherically resis-

mary purpose of incremental learning would be to reduce the

tant vegetation index - ARVI) were considered. The derived

computational cost of classification, regression, or clustering

indices are based on extensive domain knowledge and are

techniques, which, when dealing with large data provided

used for assessing vegetation properties. One example is the

by Sentinel 2 and other sources, can be a significant cost to

NDVI index, which is an indicator of for vegetation health

organizations trying to extract knowledge from that data.

and biomass. Its value changes during the growth period

One of the advantages of incremental learning is that it is

of the plants and differs significantly from other unplanted

not necessary to load all the data into memory at once when

creating a model. We only need to store the model and the

1https://www.sentinel-hub.com/

part of the data we are processing. This could be especially

2https://github.com/sentinel-hub/eo-learn

useful in various EO scenarios, as the data from Copernicus

3https://www.eea.europa.eu/data-and-maps/data/

services is estimated to exceed 150PB.

eu-dem#tab-original-data

57





Figure 1: Frequencies of grouped classes for LULC

data from 2017 show that the new simplified clas-

sification preserves the most common classes sepa-

rated and merges the less common classes. Classes

with the lowest frequencies were selected for over-

sampling.

areas. The NDVI is calculated as:

N IR − red

N DV I = NIR + red

Figure 2:

Example of some of the timeless fea-

tures. ARVI_max_mean_len shows the length of max-

Timeless features were extracted based on Valero et al. [11].

imum mean value in a sliding temporal neighbour-

These features can describe the three most important crop

hood of ARVI index. BLUE_max_mean_surf shows the

stages: the beginning of greenness, the ripening period and

surface of the flat interval area containing the peak

the beginning of senescence [11, 13].

Annual time series

using the blue raw band. EVI_mean_val shows mean

have different shapes due to the phenological cycle of a crop

value of EVI index and SAVI_neg_sur shows the max-

and characterize the development of a crop. With timeless

imum surface of the first negative derivative interval

features, they can be represented in a condensed form.

of SAVI index.

For each pixel, 18 features per each of 10 time series were

generated. From elevation data, the raw value and maxi-

q

mum tilt for a given pixel were calculated as 2 additional

checks if the ratio is less than 1 − , where =

log 1/δ

2n

features. In total 182 features were constructed. From these

and 1 − δ is desired confidence. If the ratio is small enough,

features only a Pareto-optimal subset of 9 features was se-

meaning that attribute A is really better than attribute B,

lected [6].

then the algorithm divides the node by that attribute.

3.

METHODOLOGY

Bagging of HT (incremental )

Classification accuracy ( CA ) and F1 score were calcu-

Given a standard training set D of size n, bagging generates

lated for 11 different ML methods, 6 batch learning meth-

m new training sets Di, each of size n0, by uniform sampling

ods and 5 incremental learning methods. All incremental

from D. Because the sampling is done with replacement,

learning methods are available in the ml-rapids (MLR)4 li-some observations can be repeated in each Di. If n0 = n,

brary which has been developed in order to support the use

then for large n the set Di is expected to have the fraction

of incremental learning techniques within eo-learn [4] library.

(1 − 1/e)(≈ 63.2%) of the unique examples of D, the rest

being duplicates. Then, m HT models are fitted using the

Hoeffding Tree (incremental )

above m samples and combined by voting. To include a new

Hoeffding tree (HT) is an incremental decision tree that can

sample, a random subset of models are selected according

learn from massive streams. It assumes that the distribution

to Poisson distribution [9], and these models are updated of generating examples does not change over time. The Ho-with the sample in the same way as the HT model described

effding tree begins as an initially empty leaf. Each time the

above.

new example arrives, the algorithm sorts it down the tree

(it updates the internal nodes statistics ) until it reaches the

Na¨ıve Bayes (incremental)

leaf. When it reaches the leaf, it updates the leaf statistics of

Na¨ıve Bayes (NB) is a classification technique based on Bayes’s

all unused attributes. It then takes the best (A) and second-

Theorem. It lets us calculate the probability of data belong-

best (B) attributes based on standard deviation and calcu-

ing to a given class, given prior knowledge. Bayes’ Theorem

lates the ratio of their reductions. To find the best attribute

is:

to split a node the Hoeffding bound is used. First algorithm

P (data|class) timesP (class)

4

P (class|data) =

https://github.com/JozefStefanInstitute/ml-rapids

P (data)

58





where P (class|data) is the probability of class given the pro-

vided data. To add a new training instance, NB only needs

to update relevant entries in its probability table.

Logistic Regression (incremental )

Logistic regression is a statistical model that in its basic form

uses a logistic function to model a binary dependent variable.

A model with two predictors x1 and x2 and a binary variable

Y , denoted by p = P (Y = 1), which gives us the odds of the

values belonging to the class p. The relationship between

these terms can be modeled with the following equation:

1

p = 1 + e−(β0+β1x1+β2x2)

The parameters β0, β1, β2 can be determined by stochastic

gradient descend using logistic loss function.

Figure 4: F1 score vs.

inference time of different

Perceptron (incremental)

models for predicting LULC classes. *Denotes in-

Perceptron is very similar to Logistic regression. It models a

cremental algorithms.

binary variable with the same activation function. The only

difference is in the cost function that is used for gradient

descend.

Batch learning methods

We can observe that ml-rapid’s Na¨ıve Bayes, Hoeffding Tree,

Batch learning methods learn from the whole training set

Bagging of HT, Decision Trees, LGBM and Random Forest

and do not have to rely on heuristics (e.g. Hoeffding bound)

belong to the Pareto optimal set of algorithms according to

or incremental approaches (like SGD) for building the model.

the training time and F1 score. Regarding inference times

The following batch methods have been tested: decision

Logistic Regression, Decision Trees and Random Forest are

trees, gradient boosting (LGBM), random forest, percep-

the only Pareto optimal algorithms.

The choice of algo-

tron, multi-layer perceptron, and logistic regression [5].

rithm depends on the available processing power and time.

For a system that has a lot of time and resources available,

4.

RESULTS

it would be best to use Random Forest as it has the high-

est F1 score. In practice, this is not always feasible. For

Results of the experiments are summarised in Figures 3,

example, if the algorithm were used for an on-board system

4 and Table 1.

Figures depict dependency of algorithm-

on the satellite, we could not afford to save all the data and

specific F1 score vs. its training and inference times. An

would prefer to load only the model. With an incremental

ideal algorithm would be located in the top left corner,

algorithm, the data could be collected, processed and dis-

achieving full F1 score with a training and inference time of

carded while the acquired knowledge would be stored in the

0. Any algorithm that has no other algorithm in its top-left

model. Another preference for HT would be in a wrapper

quadrant (no algorithm is both more accurate and faster)

feature selection algorithm [6]. This type of algorithms do belongs to a Pareto front, which means that this algorithm

a lot of evaluations of the selected method. The main re-

is optimal for a certain set of use-cases.

sult is a subset of features that can later be used with other

algorithms. The acquired set of features might be biased

towards the method used, but the results would be obtained

much faster.

From the confusion matrix of the HT algorithm shown in

Figure 5, we can see that shrubland is often wrongly classified as forest, bareland or grassland and vice versa. This is

mainly due to the unclear distinction between these classes

(e.g. shrubland can be anything between bareland and for-

est) and poor ground truth data due to infrequent updates,

low accuracy, and lack of detail (e.g. patch of land labeled

as shrubland can also grassland and trees). The unclear dis-

tinction between certain classes may also explain confusion

between wetlands and shrubland or wetlands and grassland,

as wetlands may be covered with grass or shrubs. The lack

of detail also contributes to misclassification between grass-

land and artificial surface, as not every small grassy area,

such as park or lawn, is included in ground truth data. Fi-

Figure 3: F1 score vs.

training time of different

nally, grass cultures, unused land overgrown by grass and

models for predicting LULC classes. *Denotes in-

rotation of crops are likely some of the reasons for confusion

cremental algorithms.

between cultivated land and grassland.

59





7.

REFERENCES

[1] D4.7 stream-learning validation report, May 2020.

Perceptive Sentinel.

[2] Drusch, M., Del Bello, U., Carlier, S., Colin,

O., Fernandez, V., Gascon, F., Hoersch, B.,

Isola, C., Laberinti, P., Martimort, P., et al.

Sentinel-2: Esa’s optical high-resolution mission for

gmes operational services. Remote sensing of

Environment 120 (2012), 25–36.

[3] Gómez, C., White, J. C., and Wulder, M. A.

Optical remotely sensed time series data for land cover

classification: A review. ISPRS Journal of

Photogrammetry and Remote Sensing 116 (2016).

[4] H2020 PereptiveSentinel Project. Eo-learn

library.

https://github.com/sentinel-hub/eo-learn.

Accessed: 2019-09-06.

[5] Hastie, T., Tibshirani, R., and Friedman, J. The

elements of statistical learning: data mining,

inference, and prediction. Springer Science & Business

Media, 2009.

[6] Koprivec, F., Kenda, K., and Šircelj, B. Fastener

feature selection for inference from earth observation

Figure 5: Confusion matrix of HT based model for

data. Entropy (Sep 2020).

predicting LULC classes.

[7] Koprivec, F., Peternelj, J., and Kenda, K.

Feature Selection in Land-Cover Classification using

Training

Inference

EO-learn. In Proc. 22th International Multiconference

CA

F1

time

time

(Ljubljana, Slovenia, 2019), vol. C, Institut ”Jožef

LGBM

4.87

0.38

0.86

0.86

Stefan”, Ljubljana, pp. 37–40.

Decision Tree

4.18

0.02

0.82

0.82

[8] Koprivec, F., Čerin, M., and Kenda, K. Crop

Random Forest

7.53

0.14

0.87

0.87

Classification using Perceptive Sentinel. In Proc. 21th

MLP

264.67

0.07

0.81

0.81

International Multiconference (Ljubljana, Slovenia,

Logistic Regression

63.50

0.01

0.67

0.65

2018), vol. C, Institut ”Jožef Stefan”, Ljubljana,

Perceptron

24.05

0.01

0.45

0.38

pp. 37–40.

Hoeffding Tree*

0.44

0.06

0.79

0.79

[9] Oza, N. C. Online bagging and boosting. In 2005

Bagging of HT*

3.07

0.46

0.83

0.83

IEEE international conference on systems, man and

Na¨ıve Bayes*

0.18

0.15

0.64

0.62

cybernetics (2005), vol. 3, Ieee, pp. 2340–2345.

Logistic Regression*

0.31

0.08

0.15

0.07

[10] Slovenian ministry of agriculture. Mkgp -

Perceptron*

0.33

0.07

0.14

0.04

portal. http://rkg.gov.si/. Accessed: 2020-08-11.

[11] Valero, S., Morin, D., Inglada, J., Sepulcre, G.,

Table 1: Comparison of models for predicting LULC

Arias, M., Hagolle, O., Dedieu, G., Bontemps,

classes. *Denotes incremental algorithms.

S., Defourny, P., and Koetz, B. Production of a

dynamic cropland mask by processing remote sensing

image series at high temporal and spatial resolutions.

5.

CONCLUSIONS

Remote Sensing 8(1) (2016), 55.

In our approach we have concentrated on effective process-

[12] Čerin, M., Koprivec, F., and Kenda, K. Early

ing. Our goal was to provide methods and workflows which

land cover classification with Sentinel 2 satellite

can reduce the need for extensive hardware and processing

images and temperature data. In Proc. 22th

power. Our goal was focused on use cases where a near state-

International Multiconference (Ljubljana, Slovenia,

of-the-art accuracy can be achieved with only a fraction of

2019), vol. C, Institut ”Jožef Stefan”, Ljubljana,

the processing power required by the state-of-the-art. We

pp. 45–48.

have researched stream mining algorithms. We have shown

[13]

that these algorithms, even if they are not the most accurate

Waldner, F., Canto, G. S., and Defourny, P.

Automated annual cropland mapping using

or the fastest, take their place at the Pareto front in a multi-

knowledge-based temporal features. ISPRS Journal of

target environment, which means that some users might find

Photogrammetry and Remote Sensing 110 (2015).

them suitable for their needs and that they provide the best

[14]

results for particular computational demand.

Zhu, X. X., Tuia, D., Mou, L., Xia, G.-S., Zhang,

L., Xu, F., and Fraundorfer, F. Deep learning in

6.

ACKNOWLEDGMENTS

remote sensing: A comprehensive review and list of

resources. IEEE Geoscience and Remote Sensing

This work was supported by the Slovenian Research Agency

Magazine 5, 4 (2017), 8–36.

and the ICT program of the EC under project PerceptiveSen-

tinel (H2020-EO-776115) and project EnviroLENS (H2020-

DT-SPACE-821918).

60





Predicting bitcoin trend change using tweets

Jakob Jelencic

Artificial Intelligence Laboratory

Jozef Stefan Institute and Jozef International Postgraduate School

Ljubljana, Slovenia

jakob.jelencic@ijs.si

ABSTRACT

by people’s trust in it. Which means that possible up or

Predicting future is hard and challenging task.

Predict-

down trends could be predicted by understanding sentiment

ing financial derivative that one can benefit from is even

of people tweets related to Bitcoin and other cryptocurrencies.

more challenging. The idea of this work is to use informa-

Tweets data-set is combined with classical Open-High-Low-

tion contained in tweets data-set combined with standard

Close [OHLC] data-set for 5 minute time periods. OHLC

Open-High-Low-Close [OHLC] data-set for trend prediction

data-set contain information about opening and closing price

of crypto-currency Bitcoin [XBT] in time period from 2019-

of given time period, its maximum and minimum price during

10-01 to 2020-05-01. A lot of emphasis is put on text prepro-

observed time period and sum of volume and number of

cessing, which is then followed by deep learning models and

transactions made [4]. This present additional information concluded with analysis of underlying embedding. Results

how the market is behaving at any given point.

were not as promising as one might hope for, but they present

a good starting point for future work.

In financial mathematics derivatives are usually modeled

with some kind of stochastic process. Most commonly some

1.

INTRODUCTION

form of Brownian motion is used. In theory increment in

Twitter is an American microblogging and social network-

Brownian motion is distributed as N (µ, Σ) independent from

ing service on which users post and interact with messages

previous increment. This implies that prediction of a real

known as ”tweets”.

Registered users can post, like, and

time price change of a derivative is not possible, so the target

retweet tweets, but unregistered users can only read them.

goal should be changed accordingly. Instead of predicting the

Users access Twitter through its website interface, through

impossible, the goal of this work is to predict a change in a

Short Message Service (SMS) or its mobile-device application

trend. Trend is calculated with exponential moving average,

software. Tweets were originally restricted to 140 characters,

application of it can be observed in Figure 1.

but was doubled to 280 for non-CJK languages in Novem-

ber 2017. People might post a message for a wide range of

Definition: Exponential moving average:

reasons, such as to state someone’s mood in a moment, to

n−1

advertise one’s business, to comment on current events, or

X

EMA(TS , n) = α · (

(1 − α)iTSn−i ),

to report an accident or disaster [5].

i=0

Bitcoin is a cryptocurrency. It is a decentralized digital

2

currency without a central bank or single administrator that

α =

.

n + 1

can be sent from user to user on the peer-to-peer bitcoin

network without the need for intermediaries.

Bitcoin is

known for its unpredictable price movements, sometimes even

to 10% on the daily basis. Bitcoin also serve as an underlying

asset for various financial derivatives, which means that one

can profit from knowing the future price changes.

Tweets data offer a constant stream of new information about

people beliefs about Bitcoin. Since Bitcoin is very volatile

asset, without any real-world value, its value is mainly driven

Figure 1: Example of exponential moving average

61





Figure 2: Example of working dataset.

2.

DATA DESCRIPTION

• Escape characters were removed.

Collected tweets range from 01-10-2019 to 01-05-2020. We

• Tweet was split by ” ”.

have filtered tweets by crypto-related hashtags. Originally

tweets contained multilingual data, but only English one

• All non alphanumeric characters were removed, includ-

were extracted. Data-set still resulted in more than 5 000

ing ”#”.

000 tweets over a little more than a half year period. Dealing

with such big data-set has proven to be too difficult of a

• All characters were converted to lower case.

task. But since a lot of tweets are just pure noise, this data-

• Usual stop-words were removed.

set can be reduced. Idea is to extract the tweets with the

largest target audience. Since the data-set contain number

of tweet’s author friends and followers, we have extracted

At this point data-set contain over 200000 different tokens,

the tweets with maximum sum of both in a 5 minute period.

which is way to sparse for so limited data-set. At this point

Unfortunately, crypto world is relatively anonymous, so there

empirical cumulative distribution function was calculated and

is no Warren Buffet alike personalty, to whom we could gave

all tokens that have less than 50 appearances were removed.

extra weight.

The dictionary size is now 2150.

Then we concatenated the reduced tweets with 5-minute

Another thing to consider is how to process numbers that

OHLC data-set.

Snapshot can be observed in Figure 2.

appear in between text. Obviously a separate token for

Column names should be pretty self-explanatory, expect for

each number is not acceptable, since it would negate all the

”tw1”,”tw2”,”tw3”, which stands for metadata information

work it was done so far. The following function was applied

about tweets and ”ama”, which stand for current movement

to process numbers. 5 more tokens were created and then

of trend. Continuous features are then normalized, ”ama” is

numbers from a certain interval were assigned corresponding

shifted one step into the future so it forms the target variable.

token.

Regression task has the most success with predictions.

• Small number: X < 1000.

3.

TWEETS PROCESSING

• Medium number: X ∈ [1000, 10000).

Aim of this chapter is to focus on processing tweets. Tweets

differ from regular text data, since many of them consist

• Semi big number: X ∈ [10000, 100000).

hyperlink, hashtags, abbreviations, grammar mistakes and so

• Big number: X ∈ [100000, 1000000).

on. This excludes any pre-build preprocessing tools, like the

one available in deep learning library Tensorflow [1] which

• Huge number: X ≥ 1000000.

is used for building deep learning models. In the Figure 2

we can see an example of some tweets. The cleaning process

was executed in the same order as it is stated below. For

Additional masking token were assigned for missing data.

each tweet the following process was executed:

This wrap up dictionary, final length of dictionary is 2156.

62





Last thing in processing tweets is to handle their length. Not

• Stacked LSTM layer with 128 neurons.

all tweets have the same length. One idea is to take the

•

maximum length of all tweets, then mask the others so they

Stacked LSTM layer with 128 neurons.

all have the same length. Unfortunately this would take a lot

• Second input layer with 64 neurons (OHLC).

of unnecessary space, which is a problem. Also long tweets

does not mean informative tweet. In Figure 3 is plotted the

• Concatenation.

empirical cumulative distribution function of tweets’ length.

• Stacked dense layer with 64 neurons.

• Output dense layer with 1 neuron.

Loss process of benchmark model can be observed in Figure

4, while loss process of tweets model can be observed in

Figure 5. Orange color represent training set, while blue

validation set. It is clear that the tweets model behaved

a lot worse on training set than benchmark model, but on

test set it has slightly lower MSE (benchmark: 13.78, tweets:

13.74). This implies that there is a lot of reserve in fitting

of the tweets model, since the difference between the train

and validation loss is so big. That is good since otherwise it

seems that tweets do not contribute much for prediction. It

is also worth noting that tweets model took way longer to

learn, around 380 epochs compared to benchmark’s model

Figure 3: Histogram of tweets’ length.

40.

No additional manipulation of tokens were done. It is known

that tokens ”bitcoin” and ”btc” means the same, and they

could be join into one token, but they are left intact and the

deep learning model will decide either they are the same or

not.

4.

DEEP LEARNING MODELS

Obvious choice for text models are recurrent neural networks,

more specifically Long-Short-term-Memory [LSTM] recurrent

networks [2]. They are usually combined with embedding layers, which transform singular token to vector of arbitrary

size [6].

Since the task at hand is predicting the future, there is no

good benchmark metric or model which could serve as a

threshold for our model performance. So in order to see

if the tweets can contribute anything, we have decided to

build a shallow neural network of just OHLC data which

would serve as a benchmark model. 80% of the data-set was

taken as a training set, remaining was left out for validation.

Figure 4: Loss process of benchmark model.

Split was the same in both models. Both time we used

Adam optimizer [3] and mean-squared error [MSE] as a loss function. Training was stopped as soon as validation loss did

not improve for 10 epochs. Batch size was 256.

Structure of a benchmark model:

• Input dense layer with 32 neurons.

• Stacked dense layer with 32 neurons.

• Stacked dense layer with 32 neurons.

Figure 5: Loss process of tweets model.

• Output dense layer with 1 neuron.

5.

ANALYSIS OF UNDERLYING EMBED-

Structure of a tweets model:

DING MATRIX

We have extracted underlying embedding matrix from tweets

• Input embedding layer of size 64 (tweets).

model. Since the model tried to minimize mean-squared error

63





Figure 6: TSNE projection of embedding matrix.

[MSE] of predicted trend and actual trend, the embedding

6.

CONCLUSION

matrix accordingly to MSE derivative. For analysis we will

While the obtained model cannot be served as production

use cosine similarity as a metric. If 2 words are close in

model for automatic trading, it presents a nice future work

the embedding matrix, this does not mean that they are

opportunity. We will continue to collect tweets, and hopefully

semantically similar in concept of everyday language, but

with time build a more accurate data-set and with some

it means that they are similar in concept of Bitcoin trend

hyper-tuning of tweets models achieve improved prediction.

prediction. For example if model converged perfectly, and

tokens ”bitcoin” and ”eth” have cosine similarity near 1, that

7.

ACKNOWLEDGMENTS

would mean that they both have similar impact on Bitcoin

This work was financially supported by the Slovenian Re-

trend. Which is not so hard to believe since it is known that

search Agency.

all crypto-currencies are heavily correlated with one another.

On Table 1 it can be seen cosine similarity of some of the

8.

REFERENCES

most common tokens in the dictionary.

[1] TensorFlow. https://www.tensorflow.org/.

[2] I. Goodfellow, Y. Bengio, and A. Courville. Deep

Table 1: Cosine similarity pairs of most common

Learning. MIT Press, 2016.

tokens.

http://www.deeplearningbook.org.

Tokens Pair

Similarity

[3] D. Kingma and J. Ba. Adam: A Method for Stochastic

bitcoin, crypto

0.472

Optimization. 2014.

blockchain, entrepreneur

0.561

https://arxiv.org/abs/1412.6980.

crypto, cryptocurrency

0.519

[4] J. J. Murphy. Technical Analysis of the Financial

cryptocurrency, blockchain

0.560

Markets: A Comprehensive Guide to Trading Methods

volume, social media

0.508

and Applications. New York Institute of Finance Series.

ethereum, blockchain

0.557

New York Institute of Finance, 1999.

[5] R. Nugroho, C. Paris, S. Nepal, J. Yang, and W. Zhao.

We cannot be completely satisfied with results, but for such

A survey of recent methods on deriving topics from

limited data-set they are not that bad. As it is with any

twitter: algorithm to evaluation. Knowledge and

embedding evaluation, it comes to certain amount of subjec-

Information Systems, pages 1–35, 2020.

tivity what is good and what is not.

[6] S. Russell and P. Norvig. Artificial Intelligence: A

Modern Approach. Series in Artificial Intelligence.

In order to gain the better perspective of obtained embedding

Prentice Hall, Upper Saddle River, NJ, third edition,

we did a T-distributed stochastic neighbor embedding projec-

2010.

tion to 2 dimension and plotted 100 nearest pairs. Projection

can be observed in Figure 6.

64





Large-Scale Cargo Distribution





Luka Stopar, PhD

Luka Bradesko, PhD

Tobias Jacobs, PhD

Researcher

Researcher

Senior Researcher

Jozef Stefan Institute

Jozef Stefan Institute

NEC Laboratories Europe GmbH

Jamova cesta 39

Jamova cesta 39

Kurfürsten-Anlage 36

1000 Ljubljana, Slovenija

1000 Ljubljana, Slovenija

69115 Heidelberg

luka.stopar@ijs.si

luka.bradesko@ijs.si

tobias.jacobs@neclab.eu





Azur Kurbašić

Miha Cimperman, PhD

Researcher

Researcher

Jozef Stefan Institute

Jozef Stefan Institute

Jamova cesta 39

Jamova cesta 39

1000 Ljubljana, Slovenija

1000 Ljubljana, Slovenija

azurkurbasic@gmail.com

miha.cimperman@ijs.si





ABSTRACT

generalization of TSP where multiple vehicles are available. This

This study focuses on the design and development of methods for

class of routing problems is notoriously hard; it not only falls into

generating cargo distribution plans for large-scale logistics

the class of NP-complete problems, but also in practice it cannot be

networks. It uses data from three large logistics operators while solved optimally even for moderate instance sizes.

focusing on cross border logistics operations using one large graph.



Nevertheless, due to its practical importance, many heuristics and

The approach uses a three-step methodology to first represent the

approximation algorithms for the vehicle routing problem have

logistic infrastructure as a graph, then partition the graph into been proposed. Bertsimas et al. propose to an integer programming

smaller size regions, and finally generate cargo distribution plans

based formulation of the Taxi routing problem and present a

for each individual region. The initial graph representation has been

heuristic based on a max-flow formulation, applied in a framework

extracted from regional graphs by spectral clustering and is then which allows to serve 25,000 customers per hour. A heuristic based

further used for computing the distribution plan.

on neighborhood search has been presented by Kytöjoki et al. in [4]

and evaluated on instances with up to 20,000 customers. A large The approach introduces methods for each of the modelling steps.

number of natural-inspired optimization methods have been

The proposed approach on using regionalization of large logistics

applied to VRP, including genetic algorithms [7], particle swarm infrastructure for generating partial plans, enables scaling to

optimization [8], and honey bees mating optimization [9].

thousands of drop-off locations. Results also show that the

The particular approach of partitioning the input graph for VRP has

proposed approach scales better than the state-of-the-art, while been proposed by Ruhan et al. [5]. Here k-means clustering is preserving the quality of the solution.

combined with a re-balancing algorithm to obtain areas with

Our methodology is suited to address the main challenge in

balanced number of customers. Bent et al. study the benefits and

transforming rigid large logistics infrastructure into dynamic, just-

limitations of vehicle and customer based decomposition schemes

in-time, and point-to-point delivery-oriented logistics operations.

[6], demonstrating better performance with the latter.



Keywords

In this paper, we present a methodology for large-scale parcel Logistics, graph construction, vehicle routing problem, spectral

distribution, by utilizing optimization methods with large graph clustering, optimization heuristics, discrete optimization.

clustering. The paper is structured as follows. In Section 2, we present the technical details of the proposed methodology. We

explain the algorithms and data structures used in each of the steps

1.

INTRODUCTION

and discuss the interfaces required to link the steps into a working

The complexity of operations in the logistics sector is growing, so

system. In Section 3, we demonstrate the performance of our

is the level of digitalization of the industry. With data driven methodology on two real-world use cases and compare it to the logistics, dynamic optimization of basic logistics processes is at the

state-of-the-art on synthetic datasets. Finally, in Section 4 we forefront of the next generation of logistics services.

include key findings, summarizing the strengths and limitations of

the proposed approach.

Finding optimal routes for vehicles is a problem which has been studied for many decades from a theoretical and practical point of

view: see [2] for a survey. The most prominent case is the Traveling

Salesperson Problem (TSP), where the shortest route for visiting n

locations using a single vehicle has to be determined. What is typically associated with the Vehicle Routing Problem (VRP) is a

65



2.

METHODOLOGY

the rate of going from 𝑖 to 𝑗 is represented in terms of the number

of possible trips that the driver can make between the two locations

2.1

Overview

in one hour.

In this section, we present the details of the proposed methodology

The algorithm works by approximating the minimal 𝑘-cut of the for large-scale cargo distribution planning. The methodology,

graph, removing its edges and thus reducing the graph to 𝑘

illustrated in Figure 1, uses a three-step, divide and conquer disconnected components. We adapt a spectral partitioning

approach to cargo distribution, where we reduce the size of the algorithm introduced in [10] to graphs.

optimization problem by (i) abstracting the physical infrastructure

into a sparse graph representation, (ii) partitioning the graph into The algorithm first symmetrizes the transition rate matrix as 𝑄𝑠 =

smaller chunks (i.e. regions) and (iii) planning the distribution in

1 (𝑄 + 𝑄𝑇), to ensure real-valued eigenvalues, and computes its each region independently. This allows us to run the optimization

2

Laplacian:

on large graphs while producing better local results.

−1

𝐿 = 𝐼 − 𝑑𝑖𝑎𝑔(𝑄𝑠1⃗ ) 𝑄𝑠



Next, it computes the 𝑘 eigenvectors of 𝐿, corresponding to the smallest 𝑘 eigenvalues. It then discards the eigenvector

corresponding to 𝜆1 = 0 and assembles eigenvectors 𝑣2, 𝑣3, … , 𝑣𝑘

corresponding to eigenvalues 𝜆2 ≤ 𝜆3 ≤ ⋯ ≤ 𝜆𝑘 as columns of

matrix 𝑉. The rows of 𝑉 are then normalized and used as input to

the k-means clustering algorithm which constructs the final partitions.

2.4

Vehicle Routing

The vehicle routing step uses Tabu search [12] to construct the distribution plan. Starting with an initial solution, Tabu search constructs a linear search path by iteratively improving the solution

in a greedy fashion until a stopping criterion is met. To avoid converging to local minima, Tabu search blacklists recent moves and/or solutions for one or more iterations using design-time rules.

Figure 1: Three step methodology for logistics optimization.

In each iteration, the search process generates new possible

solutions by removing a node from its current route and placing it

Initially, we create a representation of the physical infrastructure as after one of the other nodes in the graph, possibly on a different an abstract graph, representing each pickup and drop-off location

route. To mitigate scaling problems associated with generating

as a node with edges as shortest connections on road in between.

𝑂(𝑛2) possible moves in each step, the algorithm only considers a

Next, we partition the abstract graph with a spectral partitioning handful of moves. Specifically, the probability of considering

approach. The method is an adaptation of [10] to graphs, where we

placing node 𝑖 after node 𝑗 is proportional to the inverse of the use the first k eigenvalues and eigenvectors of the graphs’

Euclidean distance 𝑑(𝑖, 𝑗) between the nodes.

Laplacian to construct the partitions. In each partition, we construct

Like other local search algorithms, Tabu search starts from an a distribution plan using an iterative search algorithm. From an initial feasible solution which is constructed using a construction-initial solution, the algorithm constructs a linear search path by based heuristic algorithm. The heuristic procedure iteratively

changing the position of a node in the distribution plan. To avoid

selects a node and places it after one of the other nodes in a way

local minima, it uses design-time blacklist rules which prevent the

that minimizes the travel distance. The procedure iterates until all

algorithm from oscillating in a local neighborhood. Each step is values are initialized.

described in more details in the following sections.



2.2

Graph Construction

For graph construction, the Dijkstra SPF algorithm [11] was

3.

DEMONSTRATION AND RESULTS

applied to identify neighbor relationships between the nodes in the

In this section, we demonstrate the effectiveness of the proposed OpenStreetMaps (OSM) dataset and construct the graph

methodology on two real-world use cases and compare the

representation. By mapping post offices to the closest node on methodology to the state-of-the-art in vehicle routing. The first OSM, we tag the post office nodes for SPF search.

pilot included two national logistics operators, namely Hrvatska Posta (Croatia) and Posta Slovenije (Slovenia). As the main focus

The search frontier is a baseline for the SPF procedure and

of future logistics in Europe is to operate as one large homogenous

represents the list of nodes whose graph neighbors are to be

logistics infrastructure, the two infrastructures were considered as

searched. The final graph is built by iterating with the SPF

one logistics graph. The second pilot included Hellenic Post

procedure through the list of all post offices in physical

(Greece) graph representation and data.

infrastructure (graph nodes), and consolidating results into final the

sparse matrix – each iteration computes one row of the matrix.

In initial testing, simulated data were used for modelling parcel flow with graph abstraction, graph processing, and optimization

2.3

Graph Partitioning

responses. The final instances were constructed from real

The partitioning step first represents the graph as a transition rate

infrastructure data to test the functionalities. The results are matrix (𝑄)𝑖𝑗 = 𝑞𝑖𝑗, where 𝑞𝑖𝑗 represents the rate of going from presented in the following subsections.

node 𝑖 to node 𝑗 and is computed as the inverse minimal travel time

(obtained from step 1) between the two nodes. With this approach,

66





3.1

Evaluation on Large Synthetic Graphs

For the experiments we used a Tabu list with a length of 5% of the We now demonstrate the scalability of the proposed methodology

entities (locations) that the algorithm must check, and terminated

by comparing its performance to the performance of the baseline

the algorithm when there was no improvement in the solution for

Tabu search algorithm on synthetic graphs of various sizes, more than 10 seconds.

comparing both algorithms’ running time and the total travel time

On large graphs, we see that the proposed methodology

in the generated cargo distribution plan. Our results show that the

significantly reduces the computation time while preserving the

proposed methodology enables fast generation of distribution plans

quality of the result. The proposed methodology reduces the

on graphs of up to 10,000 nodes, while also improving the quality

computation time on graphs larger than 5k nodes, providing a

of the generated result.

substantial saving of 91% on graphs with 10k nodes. We also

We simulate the logistics infrastructure by generating random

observe that the quality of the output slightly improved when

planar graphs representing the road network and drop-off locations.

applying our divide-and-conquer methodology over Tabu search.

First, we generate a cluster of 𝑛 drop-off locations by sampling a

The improvement ranges between 23% and 40% and is largely

Gaussian distribution around 𝑘 randomly chosen locations. Next, attributed to the significantly reduced search space in the partitions

we connect the locations with Delaunay triangulation [13],

as compared to the entire graph.

resulting in a planar graph. We compute the distance between two

3.2

Testing the instances on pilot use cases

locations using the Euclidean metric and assign a 50 𝑘𝑚/ℎ speed

The methods presented and tested on synthetic graphs were also limit to intra-city edges and a 90 𝑘𝑚/ℎ speed limit to inter-city tested on data from two pilot scenarios, namely Slovenian-Croatian

edges. Part of a synthetic graph with 10,000 nodes is shown in post (Pošta Slovenije & Hrvatska Pošta) and Hellenic Post

Figure 2 below.

(Greece). In the pilot use cases, the analytical pipeline is used to process ad-hoc events in the logistics infrastructure. The ad-hoc events included were structured into three categories: new parcel request (ad-hoc order), event on distribution objects (vehicle break

down) and events related to changes in border crossings – border

closed (cross border event).

The instances built on simulated data were loaded with

OpenStreetMaps data for abstraction of real infrastructure

description into graph representation, as illustrated in Figure 4.



Figure 2: Representation of simulated graph with 10,000

nodes.

Table 1 summarizes the computation times of the proposed method along with the quality of the generated distribution plan and

compares the results to Tabu search without prior clustering. We measure the quality of the generated distribution plan as the

distance travelled by all vehicles according to the plan. In each row,

we show the average of 10 trials on 10 different graphs.

Table 1: Comparison of efficiency of Tabu search and



proposed methodology.

Figure 4: A region of Posta Slovenia graph representation,

using OpenStreetMap.

Graph

Proposed Methodology

Tabu search

Size

A similar approach was used for the case of Hellenic Post, where

the OSM data for the region of Greece were loaded into the graph



Running

Travel

Running

Travel

abstraction instance. For traffic modelling of the vehicles, the Time

Distance

Time

Distance

SUMO simulator [14] was used with the regional map. For graph

[km]

[km]

manipulations, the SIoT infrastructure was used to generate the 1000

6.07min

64.7k

0.76min

85.5k

social graph when an ad-hoc event was triggered. The social graph

represented all entities (vehicles, etc.) in the infrastructure that are 2000

10.07min

122.9k

2.98min

160.8k

in the scope to be included in event processing. In this way, 5000

30.14min

259.2k

60.04mi

428.2k

distribution objects were mapped to physical infrastructure for

n

loading the objects into the graph representation for further

7000

39.29min

377.9k

166.79m

577.1k

optimization and distribution plan estimation

in

10000

55.64min

552.2k

10.78h

845.1k



67



6.

REFERENCES



[1] European Commission. (2015). Fact-finding studies in

support of the development of an EU strategy for freight

transport logistics. Lot 1: Analysis of the EU logistics sector.

[2] Kumar, Suresh Nanda, and Ramasamy Panneerselvam. "A

survey on the vehicle routing problem and its variants."

(2012).

[3] Bertsimas, Dimitris, Patrick Jaillet, and Sébastien Martin.

"Online vehicle routing: The edge of optimization in large-

scale applications." Operations Research 67.1 (2019): 143-

162.



Figure 4: Processing ad-hoc order on a pilot scenario, using

[4] Kytöjoki, Jari, et al. "An efficient variable neighborhood

SUMO simulator.

search heuristic for very large scale vehicle routing

problems." Computers & operations research 34.9 (2007):

An example of the social graph generation and ad-hoc event

2743-2757.

processing is presented in Figure 4, where a new ad-hoc request is

processed by SIoT and analytical pipeline.

[5] He, Ruhan, et al. "Balanced k-means algorithm for

partitioning areas in large-scale vehicle routing problem."

The results show that abstracting the logistics infrastructure and 2009 Third International Symposium on Intelligent

clustering the graph into regional structures enabled real-time

Information Technology Application. Vol. 3. IEEE, 2009.

processing of complex events in the logistics infrastructure. The response time for processing an ad-hoc event in regions of between

[6] Bent, Russell, and Pascal Van Hentenryck. "Spatial,

50 and 100 nodes was between 20 and 30 seconds. This is relatively

temporal, and hybrid decompositions for large-scale vehicle

fast compared to alternatively processing 1000 nodes or more

routing with time windows." International Conference on

Principles and Practice of Constraint Programming. Springer,



Berlin, Heidelberg, 2010.

4.

CONCLUSION

[7] Razali, Noraini Mohd. "An efficient genetic algorithm for

In this paper, we presented an approach for generating cargo

large scale vehicle routing problem subject to precedence

distribution plans on large logistic infrastructures. Our results show

constraints." Procedia-Social and Behavioral Sciences 195

that the proposed approach can scale to graphs of up to 10,000

(2015): 1922-1931.

nodes in practical time while preserving and even slightly

[8] Marinakis, Yannis, Magdalene Marinaki, and Georgios

improving the quality of the result.

Dounias. "A hybrid particle swarm optimization algorithm

Since the main use case of logistics is point-to-point regional for the vehicle routing problem." Engineering Applications

delivery and just-in-time delivery, these new services are oriented

of Artificial Intelligence 23.4 (2010): 463-472.

exactly to regional logistics optimization. More importantly, the

[9] Marinakis, Yannis, Magdalene Marinaki, and Georgios

approach enables to process ad-hoc events, such as new parcel

Dounias. "Honey bees mating optimization algorithm for the

delivery requests, events related to distribution vehicles, or to vehicle routing problem." Nature inspired cooperative

infrastructure. The ad-hoc event processing includes manipulating

strategies for optimization (NICSO 2007). Springer, Berlin,

the graph representation and running the optimization methods in

Heidelberg, 2008. 139-148.

real-time. Since our method clusters and regionalizes large graphs,

such approach can enable real-time processing of events on large

[10] Ng, Jordan, Weiss. “On Spectral Clustering: Analysis and an

graphs, by limiting the changes to the affected regional parts of the

algorithm”. Advances in Neural Information Processing

infrastructure.

Systems. MIT Press, 2001. 849-856.

However, while our approach can be combined with several state-

[11] Dijkstra, E. W. A note on two problems in connexion with

of-the-art methods, its main drawback remains the inability to

graphs. Numerische Mathematik, 1(1), 269–271, 1959

generate inter-region routes, making it suitable only for local and

[12] Handbook of Combinatorial Optimization, Fred Glover,

last-mile distribution plans. Future work will focus on investigating

Manuel Laguna, Vol. 3, 1998

the generation of inter-region plans and connecting multiple

[13] Computational Geometry: Algorithms and Applications,

regions into one distribution plan. Some of the options include Mark de Berg, Otfried Cheong, Marc van Kreveld, Mark

introducing border checkpoints where cargo can be handed over to

Overmars, Third Edition, 2008

vehicles of neighboring regions, using dedicated inter-region

“highway” channels, and using dedicated vehicles for cross-region

[14] http://sumo.sourceforge.net

deliveries.



5.

ACKNOWLEDGEMENTS

This paper is supported by European Union’s Horizon 2020

research and innovation programme under grant agreement No

769141, project COG-LO (COGnitive Logistics Operations

through secure, dynamic and ad-hoc collaborative networks).



68





Amazon forest fire detection with an active learning approach

Matej Čerin

Klemen Kenda

Jožef Stefan Institute

Jožef Stefan Institute

Jožef Stefan International

Jožef Stefan International

Postgraduate School

Postgraduate School

Jamova 39, 1000 Ljubljana,

Jamova 39, 1000 Ljubljana,

Slovenia

Slovenia

matej.cerin@ijs.si

klemen.kenda@ijs.si

ABSTRACT

ing satellite images [6, 11], they inspect changes on satellite Wildfires are a growing problem in the world. With climate

images to detect fires. Our solution to that problem is to use

change, the fires have a larger range an are harder to put

machine learning. Because we do not have prepared labeled

down. Therefore it is important to find a way to detect and

data-set active learning like approach is our next candidate.

monitor fires in real-time. In this paper, we explain how we

can use satellite images and combine it with knowledge of

Active learning is the approach used when the labeled data

active learning to get accurate classifier for forest fires. To

are unavailable, and labeling data is too expensive or time-

build the classifier we used active learning like approach. We

consuming. The algorithm starts with a small labeled data

train the classifier with one labeled image. Then used a clas-

set and then use its predictions to train itself again. That

sifier to classify the set of images. We manually inspected

way the algorithm can learn itself. Algorithms usually need

the images and relabeled wrongly classified examples and

additional input for some data points. In these cases, a hu-

build a new classifier. In the paper, we show that in a few

man should label those data, and the algorithm can then

iteration steps we can get a classifier that can with good

correct its predictions. The active learning approach is used

accuracy identify wildfires.

in many use cases (speech recognition, information extrac-

tion, classification, ...). Over the years, it proved to work

Keywords

relatively well [8].

remote sensing, earth observation, active learning, rain for-

est, wildfires, machine learning, feature selection, classifica-

In this paper we use active learning like approach to clas-

tion

sify wildfires. By the principle of active learning approach,

we label a small subset of data and then train the classi-

1.

INTRODUCTION

fier. Then we manually check the classification results and

correct the wrongly classified examples. We then use a new

In last years wildfires are a growing problem for the world.

bigger data-set to train the new classifier. We continue with

Each year the number of forest fires around the world grow.

iterations until we are satisfied with the results. That way

In recent years we had growing number of fires in Ama-

we can iteratively get a good classifier without labeling huge

zon, Australia, Africa and Siberia. Because of high global

amounts of data.

warming and high temperatures, the wildfires have a bigger

range and are also harder to put out. Forest fires are par-

tially responsible for the air pollution [12], loss of habitat 2.

DATA

for animals. Amazon rain forest is also called the lungs of

2.1

Data Acquisition

the world, because of oxygen production by the trees. The

In the article, we use data from ESA Sentinel-2 mission [3].

loos of forest also connects to a higher chance of floods and

The sentinel-2 mission produces satellite images in 13 differ-

landslides [6]. Therefore the classification and monitoring ent spectral bands with wave lengths of ligt observed from

of wildfires is an important task. It is important to know

approximately 440 nm to 2200 nm. The spatial resolution is

the time series of the spread of the fire. With that knowl-

between 10 and 60 m. It consists of two satellites that circle

edge we can create models for future fire events, and to plan

the earth with 180◦ phase. One point on the earth’s surface

measures in case of wildfire.

is visited at least once every five days. In future we could

use also use some other satellite data sources like available at

The satellite images are a good source for observation of

www.planet.com [1]. Those data have revisit time of 1 day land type [5]. Therefore they could be used for monitoring and might be even better candidate for accurate monitoring

forest fires. They can be detected on satellite images, but

of wildfires.

the area of Amazon is big and it would take a lot of time

to manually label burned areas by forest fires. Therefore we

To download data we use eo-learn library [9] that have inte-should develop an algorithm that can detect fires.

grated sentinel-hub[10] library used to access satellite data.

Data were downloaded for the year 2019, with a spatial res-

There are already existing algorithms for fire detection us-

olution of 30 m. The 30 m resolution was chosen because

69





burned areas usually extends through much bigger area than

30 m and a therefore higher resolution would not help us

identify forest fires. But the processing of each image would

take significantly more time than it did now.

2.2

Data Preprocessing

ESA already makes most of the preprocessing steps, like

atmospheric reflectance or projection [4]. Therefore data is already clean and ready for use. For our experimentation

purposes, we filtered out clouds for that purpose we used

models available in eo-learn library.

In our experiments, we used all spectral bands, but the

earth observation community developed many different in-

dices that can be calculated from raw spectral bands and use

them as a feature in our machine learning experiments. In-

dices that we used are NDVI, SAVI, EVI, NDWI, and NBR,

defined in papers [7, 2]. As our feature vector we used all 13

raw bands and mentioned indices.

3.

METHODOLOGY

In our experiments, we iteratively improved the classifier.

In each iterative step, we looked at the images and deter-

mine if the classification was good or not. To do that most

successfully we plotted the images in true color, where the

burned area is usually dark, and if the fire is active the smoke

Figure 1: The Figure shows the true color and false-

is also visible. The other figure that we checked was image

color images of the same area before, during and

with RGB colors plotted Sentinel-2 bands 12, 11, and 3 (false

after the fire. These kinds of images can be used to

color). Here most of the image is usually in shades of green.

manually determine burned areas.

The burned area is dark gray color and the area currently

burning is yellow or orange (Figure 2). With those two images, we have no problem checking if the area is burned or

only images, where the classifier classified fire. That is be-

not.

cause we noticed that the classifier already, in the beginning,

finds fire, but it picked up some other areas and objects as

We experimented with two different approaches. In the first

fire as well. Therefore we need to find those images and label

approach, we evaluated the results of classification for each

them as not fire.

pixel and in the second experiment, we evaluated the aver-

age result for a bigger area determined with the clustering

4. We used a false-positive set to add to data-set the pix-

algorithm.

els that the classifier classified wrongly and true positive

examples to keep the data-set balanced. We chose in each

The classifier used in our experiment was logistic regression.

iteration the two values for the probability of prediction in

We used it because it is quite an accurate classifier for earth

logistic regression. The first value was used to determine in

observation and it can assess how strong the prediction is.

false-positive images to find pixels that were classified with

a probability above that value to add those pixels in the

data set. And the second value was used to find pixels that

3.1

Experiment 1

contained forest fire. We changed those values because the

First, we manually searched the area of the Amazon forest to

algorithm is unreliable in the first iterations and low value in

find the first satellite image with a forest fire. Then we used

the images with fire would pick up a lot of noise in the data

that satellite image and labeled 270 pixels as fire area and

set.

But with each iteration the algorithm became more

270 pixels as not fire area. We trained the logistic regression

reliable, therefore we could pick lower probability without

classifier and used it as our initial classifier in our iteration.

much noise. The values are shown in the Table 1.

The iteration steps in our experiment were:

1. Use a classifier and classify pixels of a random images of

3.2

Experiment 2

the Amazon rain forest.

The formation of the initial classifier and the first three steps

in that experiment were the same as in the first experiment.

2. We took images that the classifier would classify with a

forest fire. The images were classified as containing a burned

Additional steps in the experiment are:

area if at least 3 % of pixels on the image were classified as

4. For the evaluation of the classifier, we first made cluster-

fire.

ing with the K-Means algorithm to group similar pixels on

each image. The idea of that step is to use a homogeneous

3. We checked those images and manually assigned them

group of pixels that probably represent the same ground

into two sets (true-positive and false-positive). We checked

cower. Those steps are useful because we noticed that K-

70





Iteration

FP

TP

F1 score

Iteration 1

0.0

0.80

Classifier from Experiment

Iteration 2

0.4

0.70

1 predicting on data-set

0.81

Iteration 3

0.4

0.70

from Experiment 2

Iteration 4

0.5

0.60

Classifier from Experiment

Iteration 5

0.5

0.60

2 predicting on data-set

0.78

Iteration 6

0.5

0.50

from Experiment 1

Table 1: The table shows the values of the minimum

average probability of a pixel being burned area for

Table 3: The F1 scores of classifiers.

false-positive images (FP) and true-positive images

(TP).

higher than they would be on real images. In both exper-

iments we used random images from the area of amazon,

therefore some images might be in both training and testing

Means usually grouped fire areas in one or two clusters. We

set.

clustered the pixels in 6 clusters. That number was chosen

because on most images that number split the area that way

Figure 3 depicts a time-lapse of a wildfire progress. We can that clusters with fire were separated from not burned area.

see that there are some small noise pixels that are classified

At the same time it did not split same ground types on too

wrongly, but they are relatively rare.

many clusters.

Figure 2: The figure shows how clustering groups

different pixels. The burned area is all in one cluster.

5. Calculate the average probability of pixel representing

forest fire for each cluster.

6. To choose what pixels to add in the data-set we once again

determined two values. They defined above what average

pixel probability should cluster have to add pixels from that

cluster in the data set. The used values for each iteration

are presented in Table 2.

Iteration

FP

TP

Iteration 1

-

0.75

Iteration 2

0.5

0.75

Figure 3: The sub-figures show the development of

Iteration 3

0.5

0.60

forest fire. On the left, we have true color satellite

Iteration 4

0.5

0.60

images and on the right, we have the classification

Iteration 5

0.5

0.60

result with our algorithm. yellow color depicts the

Iteration 6

0.5

0.5

burned area.

Table 2: The table shows the values of minimum

Another interesting thing to observe in our experiments is

average probability in the cluster for false-positive

what the classifier learned and how it improved in each it-

images (FP) and true-positive images (TP).

eration. We noticed that in the first iterations of our exper-

iments, the classifier did already find fire, but it also picked

up many other areas as fire. One of the first improvements of

4.

RESULTS

the classifier was that it did not classify water areas (rivers

We tested the classifiers from each experiment on data set

and lakes) as fire. The other later improvements classifier

form the other experiment. To evaluate results we calculated

were also some rocky areas. It also improved significantly in

F1 scores. The results are shown in Table 3.

the agricultural areas, but in some cases, we could not train

classifiers that there is no fire.

The F1 scores are relatively high, but those data sets were

constructed in a similar way, therefore the scores might be

The classifier learned wrongly and we could not remove com-

71





pletely some agricultural areas and some roads in the cities.

[2]

Bannari Abdou et al. “A review of vegetation indices”.

Most of the agricultural areas were classified correctly, but

In: Remote Sensing Reviews 13 (Jan. 1996), pp. 95–

there were present some fields that no matter what we did

120. doi: 10.1080/02757259509532298.

were not classified correctly. This might be due to the fact

[3]

ESA. https://www.esa.int/Our_Activities/Observing_

that the field might be on the place that was previously

the _ Earth / Copernicus / Sentinel - 2 / Satellite _

burned and the algorithm still pick that up even though it

constellation. Accessed 13 August 2018.

was not visible from the imagery to us.

[4]

ESA. https : / / sentinel . esa . int / web / sentinel /

5.

CONCLUSIONS

user-guides/sentinel-2-msi/processing-levels/

level-2. Accessed 13 August 2018.

The approach with active learning seems promising and we

can get relatively good classifiers in a short time. That way

[5]

Filip Koprivec, Matej Čerin, and Klemen Kenda. “Crop

we could train a classifier for any classification task of satel-

classification using PerceptiveSentinel”. In: (Oct. 2018).

lite images. With that approach we do not need to check all

[6]

Rosa Lasaponara, Biagio Tucci, and Luciana Gher-

images as we would if we would like to label all the data by

mandi. “On the Use of Satellite Sentinel 2 Data for

hand. In the end, we get a relatively good classifier.

Automatic Mapping of Burnt Areas and Burn Sever-

ity”. In: Sustainability 10 (Oct. 2018), p. 3889. doi:

In this paper, we showed that it is possible in a relatively

10.3390/su10113889.

small number of iterations to get a good and reliable clas-

[7]

David Roy, Luigi Boschetti, and S.N. Trigg. “Remote

sifier of forest fires. Because satellite images are more ac-

Sensing of Fire Severity: Assessing the Performance

cessible in last years than previously it could give us almost

of the Normalized Burn Ratio”. In: Geoscience and

real-time insight in the Amazon rain forest.

Remote Sensing Letters, IEEE 3 (Feb. 2006), pp. 112–

116. doi: 10.1109/LGRS.2005.858485.

In the feature one could use other satellite sources with bet-

ter time-resolution to monitor wildfires. That way we could

[8]

Burr Settles. “Active Learning Literature Survey”. In:

get more accurate view on the spread of fires.

(July 2010).

[9]

Sinergise. https://github.com/sentinel- hub/eo-

6.

ACKNOWLEDGMENTS

learn. Accessed 23 August 2019.

This work was supported by the Slovenian Research Agency

[10]

Sinergise. https://github.com/sentinel-hub/sentinelhub-and the ICT program of the EC under projects enviroLENS

py. Accessed 14 August 2018.

(H2020-DT-SPACE-821918) and PerceptiveSentinel (H2020-

[11]

Mihai Tanase et al. “Burned Area Detection and Map-

EO-776115). The authors would like to thank Sinergise for

ping: Intercomparison of Sentinel-1 and Sentinel-2 Based

their contribution to EO-learn library along with all help

Algorithms over Tropical Africa”. In: Remote Sensing

with data analysis.

12 (Jan. 2020), p. 334. doi: 10.3390/rs12020334.

References

[12]

G. R. van der Werf et al. “Global fire emissions es-

timates during 1997–2016”. In: Earth System Science

[1]

https : / / www . planet . com/. Accessed 1 September Data 9.2 (2017), pp. 697–720.

2020 .

doi: 10 . 5194 / essd -

9- 697- 2017. url: https://essd.copernicus.org/

articles/9/697/2017/.

72





Indeks avtorjev / Author index



Andrej Bauer ................................................................................................................................................................................ 53

Bradeško Luka ............................................................................................................................................................................. 65

Brank Janez .................................................................................................................................................................................. 53

Čerin Matej ................................................................................................................................................................................... 69

Cimperman Miha.......................................................................................................................................................................... 65

Eftimov Tome .............................................................................................................................................................................. 21

Erjavec Tomaž ......................................................................................................................................................................... 5, 17

Evkoski Bojan .............................................................................................................................................................................. 41

Grobelnik Marko .................................................................................................................................................................... 37, 53

Jacobs Tobias ............................................................................................................................................................................... 65

Jelenčič Jakob ............................................................................................................................................................................... 61

Jovanovska Lidija ......................................................................................................................................................................... 45

Kenda Klemen ........................................................................................................................................................................ 57, 69

Koroušič Seljak Barbara ............................................................................................................................................................... 21

Kralj Novak Petra ......................................................................................................................................................................... 41

Kurbašić Azur .............................................................................................................................................................................. 65

Lavrač Nada ................................................................................................................................................................................. 13

Ljubešić Nikola ............................................................................................................................................................................ 41

Luka Stopar .................................................................................................................................................................................. 53

Massri M.Besher .................................................................................................................................................................... 25, 53

Mileva Boshkoska Mileva ............................................................................................................................................................ 49

Mladenić Dunja ............................................................................................................................................ 5, 9, 17, 21, 25, 33, 37

Mladenić Grobelnik Adrian.......................................................................................................................................................... 37

Mozetič Igor ................................................................................................................................................................................. 41

Novak Erik ................................................................................................................................................................................... 29

Panov Panče ........................................................................................................................................................................... 45, 49

Peternelj Jože ............................................................................................................................................................................... 57

Petrželková Nela .......................................................................................................................................................................... 13

Pita Costa Joao ............................................................................................................................................................................. 53

Popovski Gorjan ........................................................................................................................................................................... 21

Šircelj Beno .................................................................................................................................................................................. 57

Sittar Abdul .................................................................................................................................................................................... 5

Škrlj Blaž ...................................................................................................................................................................................... 13

Stopar Luka .................................................................................................................................................................................. 65

Swati ....................................................................................................................................................................................... 17, 33

Zajec Patrik .................................................................................................................................................................................... 9

Žunič Gregor ................................................................................................................................................................................ 29

Zupančič Peter .............................................................................................................................................................................. 49





73





74





IS Odkrivanje znanja in podatkovna skladišča • SiKDD

Data Mining and Data Warehouses • SiKDD

20 Dunja Mladenić, Marko Grobelnik

20





Document Outline


02 - Naslovnica - notranja - C - TEMP

03 - Kolofon - C - TEMP

04, 05 - IS2020 - Predgovor & Odbori

07 - Kazalo - C

08 - Naslovnica - podkonferenca - C

09 - Predgovor podkonference - C

10 - Programski odbor podkonference - C

01 - A-Dataset-for-Information-Spreading-over-the-News Abstract

1 Introduction

2 Related Work

3 Data collection methodology

4 Semantic similarity between news articles 4.1 Dataset annotations

4.2 Evaluation of dataset





5 Conclusions

6 Acknowledgements





02 - Zajec_SiKDD Abstract

1 Introduction

2 Methodology 2.1 Problem Definition

2.2 Overview of the proposed method

2.3 Representing the entities

2.4 Selecting the topics

2.5 Using multiple languages

2.6 Assigning pseudo labels





3 Experiments 3.1 Dataset

3.2 Evaluation Settings

3.3 Results and discussion





4 Conclusion and future work

Acknowledgments





03 - Knowledge_graph_aware_text_classification__SiKDD_-6 Abstract

1 Introduction

2 Background and related work

3 Knowledge graph-based semantic feature construction 3.1 Feature selection

3.2 Microsoft Concept Graph

3.3 Proposed approach extending tax2vec





4 Experiments and results 4.1 Data sets

4.2 Results





5 Conclusion

Acknowledgments





04 - swati_eve_out Abstract

1 Introduction 1.1 Contributions





2 Dataset 2.1 Data Source

2.2 Data Generation Process





3 Availability 3.1 Reusability





4 Potential Use Cases 4.1 Examine Event-Selection Bias

4.2 Outlet Prediction





5 Statistics and Analysis

6 Related Work

7 Conclusions and Future Work

Acknowledgments





05 - Ontology_alignment_using_Named_Entity_Recognition_methods_in_the_domain_of_food-1 Introduction

Related work Hansard corpus

FoodIE

Wikifier





Data

Ontology alignment

Evaluation and experimental setup Match types

Evaluation metrics





Results and discussion

Conclusion and future work





06 - Extracting-structured-metadata-from-multilingual-textual-descriptions-in-the-domain-of-silk-heritage Abstract

1 Introduction

2 Description of Data

3 Methodology 3.1 Annotating datasets with slot values

3.2 Binary Classification Tasks

3.3 Multi-class Classification Tasks





4 Results 4.1 Experimental Datasets

4.2 Binary Classification Tasks

4.3 Multi Class Classification Class





5 Conclusion and Future Work

Acknowledgments





07 - SiKDD2020__Hierarchical_Classification_of_Educational_Resources-1 Abstract

1 Introduction

2 Related Work

3 Data Set

4 Methodologies 4.1 Feature Extraction

4.2 Multi-class SVM Classifier

4.3 Lecture Weights





5 Evaluation 5.1 Parameters and Specifications

5.2 Results





6 Discussion

7 Future Work

8 Conclusion

Acknowledgments

References





08 - swati_outlet_prediction Abstract

1 Introduction 1.1 contributions

1.2 Problem Statement





2 Literature Review

3 Data Description 3.1 Raw Data Source

3.2 Dataset





4 Materials and Methods 4.1 Problem Modeling

4.2 Methodology





5 Experimental Evaluation 5.1 Baselines

5.2 Evaluation Metric

5.3 Results and Analysis





6 Conclusions and Future Work

Acknowledgments





09 - MultiCOMET-FINAL-2

10 - A_Slovenian_Retweet_Network_2018_2020

11 - Semantic_annotation_of_food_and_nutrition_data__SiKDD_2020_-final Abstract

1 Introduction

2 Background

3 Critical overview of food and nutrition semantic resources

4 Proposal

5 Conclusion

Acknowledgments





12 - 23nd_international_multiconference___Information_Society_2020-1 Abstract

1 Introduction

2 Data 2.1 MojeUre system

2.2 Data prepossessing and feature engineering





3 Data analysis scenarios and experiments

4 Results and discussion

5 Conclusion and Future Work

Acknowledgments





13 - Monitoring-COVID-19-through-text-mining-and-visualization Abstract

1 Introduction

2 Related work

3 Description of Data 3.1 Historical COVID-19 Data

3.2 Live Data from Worldometer

3.3 Live News about Coronavirus

3.4 Google COVID-19 Community Mobility Data

3.5 MEDLINE: Medical Research Open Dataset





4 CORONAVIRUS WATCH DASHBOARD 4.1 Coronavirus Data Table

4.2 Coronavirus Live News

4.3 Statistical Visualizations

4.4 Time Gap

4.5 Mobility

4.6 Social Distancing Simulator

4.7 Biomedical Research Explorer





5 Conclusion and Future Work

Acknowledgments





14 - PRAZEN _ TREBA ZAMENJATI Abstract

1 Introduction

2 Data 2.1 MojeUre system

2.2 Data prepossessing and feature engineering





3 Data analysis scenarios and experiments

4 Results and discussion

5 Conclusion and Future Work

Acknowledgments

Blank Page

Blank Page

Blank Page

Blank Page





15 - skidd Introduction

Data description

Tweets processing

Deep learning models

Analysis of underlying embedding matrix

Conclusion

Acknowledgments

References





16 - SIKDDCogLo_2020_Final_V2_22_09_2020

17 - SI_KDD_2020__Amazon-forest-fire-detection-with-an-active-learning-approach Introduction

Data Data Acquisition

Data Preprocessing





Methodology Experiment 1

Experiment 2





Results

Conclusions

Acknowledgments





12 - Index - C

Blank Page

Blank Page

Blank Page

14 - SI_KDD_2020___Usage_of_Incremental_Learning_in_Land_Cover_Classification.pdf Introduction

Data EO data

LULC data

Feature Engineering





Methodology

Results

Conclusions

Acknowledgments

References





Blank Page