Zbornik 23. mednarodne multikonference INFORMACIJSKA DRUŻBA Zvezek C Proceedings of the 23rd International Multiconference .si INFORMATION SOCIETY Volume C .ijsI S http://is Odkrivanje znanja in podatkovna skladišča • SiKDD 20 Data Mining and Data Warehouses • SiKDD 20 Uredili / Edited byDunja Mladenić, Marko Grobelnik 5. oktober 2020 / 5 October 2020 Ljubljana, Slovenia Zbornik 23. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2020 Zvezek C Proceedings of the 23rd International Multiconference INFORMATION SOCIETY – IS 2020 Volume C Odkrivanje znanja in podatkovna skladišča - SiKDD Data Mining and Data Warehouses - SiKDD Uredila / Edited by Dunja Mladenić, Marko Grobelnik http://is.ijs.si 5. oktober 2020 / 5 October 2020 Ljubljana, Slovenia Urednika: Dunja Mladenić Department for Artificial Intelligence Jožef Stefan Institute, Ljubljana Marko Grobelnik Department for Artificial Intelligence Jožef Stefan Institute, Ljubljana Založnik: Institut »Jožef Stefan«, Ljubljana Priprava zbornika: Mitja Lasič, Vesna Lasič, Lana Zemljak Oblikovanje naslovnice: Vesna Lasič Dostop do e-publikacije: http://library.ijs.si/Stacks/Proceedings/InformationSociety Ljubljana, oktober 2020 Informacijska družba ISSN 2630-371X Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni knjižnici v Ljubljani COBISS.SI-ID=33077251 ISBN 978-961-264-192-4 (epub) ISBN 978-961-264-193-1 (pdf) PREDGOVOR MULTIKONFERENCI INFORMACIJSKA DRUŽBA 2020 Triindvajseta multikonferenca Informacijska družba (http://is.ijs.si) je doživela polovično zmanjšanje zaradi korone. Zahvala za preživetje gre tistim predsednikom konferenc, ki so se kljub prvi pandemiji modernega sveta pogumno odločili, da bodo izpeljali konferenco na svojem področju. Korona pa skoraj v ničemer ni omejila neverjetne rasti IKTja, informacijske družbe, umetne inteligence in znanosti nasploh, ampak nasprotno – kar naenkrat je bilo večino aktivnosti potrebno opraviti elektronsko in IKT so dokazale, da je elektronsko marsikdaj celo bolje kot fizično. Po drugi strani pa se je pospešil razpad družbenih vrednot, zaupanje v znanost in razvoj. Celo Flynnov učinek – merjenje IQ na svetovni populaciji – kaže, da ljudje ne postajajo čedalje bolj pametni. Nasprotno - čedalje več ljudi verjame, da je Zemlja ploščata, da bo cepivo za korono škodljivo, ali da je korona škodljiva kot navadna gripa (v resnici je desetkrat bolj). Razkorak med rastočim znanjem in vraževerjem se povečuje. Letos smo v multikonferenco povezali osem odličnih neodvisnih konferenc. Zajema okoli 160 večinoma spletnih predstavitev, povzetkov in referatov v okviru samostojnih konferenc in delavnic in 300 obiskovalcev. Prireditev bodo spremljale okrogle mize in razprave ter posebni dogodki, kot je svečana podelitev nagrad – seveda večinoma preko spleta. Izbrani prispevki bodo izšli tudi v posebni številki revije Informatica (http://www.informatica.si/), ki se ponaša s 44-letno tradicijo odlične znanstvene revije. Multikonferenco Informacijska družba 2020 sestavljajo naslednje samostojne konference: • Etika in stroka • Interakcija človek računalnik v informacijski družbi • Izkopavanje znanja in podatkovna skladišča • Kognitivna znanost • Ljudje in okolje • Mednarodna konferenca o prenosu tehnologij • Slovenska konferenca o umetni inteligenci • Vzgoja in izobraževanje v informacijski družbi Soorganizatorji in podporniki konference so različne raziskovalne institucije in združenja, med njimi tudi ACM Slovenija, SLAIS, DKZ in druga slovenska nacionalna akademija, Inženirska akademija Slovenije (IAS). V imenu organizatorjev konference se zahvaljujemo združenjem in institucijam, še posebej pa udeležencem za njihove dragocene prispevke in priložnost, da z nami delijo svoje izkušnje o informacijski družbi. Zahvaljujemo se tudi recenzentom za njihovo pomoč pri recenziranju. V 2020 bomo petnajstič podelili nagrado za življenjske dosežke v čast Donalda Michieja in Alana Turinga. Nagrado Michie-Turing za izjemen življenjski prispevek k razvoju in promociji informacijske družbe je prejela prof. dr. Lidija Zadnik Stirn. Priznanje za dosežek leta pripada Programskemu svetu tekmovanja ACM Bober. Podeljujemo tudi nagradi »informacijska limona« in »informacijska jagoda« za najbolj (ne)uspešne poteze v zvezi z informacijsko družbo. Limono je prejela »Neodzivnost pri razvoju elektronskega zdravstvenega kartona«, jagodo pa Laboratorij za bioinformatiko, Fakulteta za računalništvo in informatiko, Univerza v Ljubljani. Čestitke nagrajencem! Mojca Ciglarič, predsednik programskega odbora Matjaž Gams, predsednik organizacijskega odbora i FOREWORD INFORMATION SOCIETY 2020 The 23rd Information Society Multiconference (http://is.ijs.si) was halved due to COVID-19. The multiconference survived due to the conference presidents that bravely decided to continue with their conference despite the first pandemics in the modern era. The COVID-19 pandemics did not decrease the growth of ICT, information society, artificial intelligence and science overall, quite on the contrary – suddenly most of the activities had to be performed by ICT and often it was more efficient than in the old physical way. But COVID-19 did increase downfall of societal norms, trust in science and progress. Even the Flynn effect – measuring IQ all over the world – indicates that an average Earthling is becoming less smart and knowledgeable. Contrary to general belief of scientists, the number of people believing that the Earth is flat is growing. Large number of people are weary of the COVID-19 vaccine and consider the COVID-19 consequences to be similar to that of a common flu dispute empirically observed to be ten times worst. The Multiconference is running parallel sessions with around 160 presentations of scientific papers at twelve conferences, many round tables, workshops and award ceremonies, and 300 attendees. Selected papers will be published in the Informatica journal with its 44-years tradition of excellent research publishing. The Information Society 2020 Multiconference consists of the following conferences: • Cognitive Science • Data Mining and Data Warehouses • Education in Information Society • Human-Computer Interaction in Information Society • International Technology Transfer Conference • People and Environment • Professional Ethics • Slovenian Conference on Artificial Intelligence The Multiconference is co-organized and supported by several major research institutions and societies, among them ACM Slovenia, i.e. the Slovenian chapter of the ACM, SLAIS, DKZ and the second national engineering academy, the Slovenian Engineering Academy. In the name of the conference organizers, we thank all the societies and institutions, and particularly all the participants for their valuable contribution and their interest in this event, and the reviewers for their thorough reviews. For the fifteenth year, the award for life-long outstanding contributions will be presented in memory of Donald Michie and Alan Turing. The Michie-Turing award was given to Prof. Dr. Lidija Zadnik Stirn for her life-long outstanding contribution to the development and promotion of information society in our country. In addition, a recognition for current achievements was awarded to the Program Council of the competition ACM Bober. The information lemon goes to the “Unresponsiveness in the development of the electronic health record”, and the information strawberry to the Bioinformatics Laboratory, Faculty of Computer and Information Science, University of Ljubljana. Congratulations! Mojca Ciglarič, Programme Committee Chair Matjaž Gams, Organizing Committee Chair ii KONFERENČNI ODBORI CONFERENCE COMMITTEES International Programme Committee Organizing Committee Vladimir Bajic, South Africa Matjaž Gams, chair Heiner Benking, Germany Mitja Luštrek Se Woo Cheon, South Korea Lana Zemljak Howie Firth, UK Vesna Koricki Olga Fomichova, Russia Marjetka Šprah Vladimir Fomichov, Russia Mitja Lasič Vesna Hljuz Dobric, Croatia Blaž Mahnič Alfred Inselberg, Israel Jani Bizjak Jay Liebowitz, USA Tine Kolenik Huan Liu, Singapore Henz Martin, Germany Marcin Paprzycki, USA Claude Sammut, Australia Jiri Wiedermann, Czech Republic Xindong Wu, USA Yiming Ye, USA Ning Zhong, USA Wray Buntine, Australia Bezalel Gavish, USA Gal A. Kaminka, Israel Mike Bain, Australia Michela Milano, Italy Derong Liu, Chicago, USA prof. Toby Walsh, Australia Programme Committee Mojca Ciglarič, chair Andrej Gams Vladislav Rajkovič Bojan Orel, co-chair Matjaž Gams Grega Repovš Franc Solina, Mitja Luštrek Ivan Rozman Viljan Mahnič, Marko Grobelnik Niko Schlamberger Cene Bavec, Nikola Guid Špela Stres Tomaž Kalin, Marjan Heričko Stanko Strmčnik Jozsef Györkös, Borka Jerman Blažič Džonova Jurij Šilc Tadej Bajd Gorazd Kandus Jurij Tasič Jaroslav Berce Urban Kordeš Denis Trček Mojca Bernik Marjan Krisper Andrej Ule Marko Bohanec Andrej Kuščer Tanja Urbančič Ivan Bratko Jadran Lenarčič Boštjan Vilfan Andrej Brodnik Borut Likar Baldomir Zajc Dušan Caf Janez Malačič Blaž Zupan Saša Divjak Olga Markič Boris Žemva Tomaž Erjavec Dunja Mladenič Leon Žlajpah Bogdan Filipič Franc Novak iii iv KAZALO / TABLE OF CONTENTS Odkrivanje znanja in podatkovna skladišča (SiKDD) / Data Mining and Data Warehouses (SiKDD) ................ 1 PREDGOVOR / FOREWORD ................................................................................................................................. 3 PROGRAMSKI ODBORI / PROGRAMME COMMITTEES ..................................................................................... 4 A Dataset for Information Spreading over the News / Sittar Abdul, Mladenić Dunja, Erjavec Tomaž ................... 5 Learning to fill the slots from multiple perspectives / Zajec Patrik, Mladenić Dunja ............................................... 9 Knowledge graph aware text classification / Petrželková Nela, Škrlj Blaž, Lavrač Nada .................................... 13 EveOut: Reproducible Event Dataset for Studying and Analyzing the Complex Event-Outlet Relationship / Swati, Erjavec Tomaž, Mladenić Dunja ............................................................................................................ 17 Ontology alignment using Named-Entity Recognition methods in the domain of food / Popovski Gorjan, Eftimov Tome, Mladenić Dunja, Koroušič Seljak Barbara ............................................................................................. 21 Extracting structured metadata from multilingual textual descriptions in the domain of silk heritage / Massri M.Besher, Mladenić Dunja ............................................................................................................................... 25 Hierarchical classification of educational resources / Žunič Gregor, Novak Erik ................................................. 29 Are You Following the Right News-Outlet? A Machine Learning based approach to outlet prediction / Swati, Mladenić Dunja ................................................................................................................................................. 33 MultiCOMET – Multilingual Commonsense Description / Mladenić Grobelnik Adrian, Mladenić Dunja, Grobelnik Marko ................................................................................................................................................................ 37 A Slovenian Retweet Network 2018-2020 / Evkoski Bojan, Mozetič Igor, Ljubešić Nikola, Kralj Novak Petra .... 41 Toward improved semantic annotation of food and nutrition data / Jovanovska Lidija, Panov Panče ................ 45 Absenteeism prediction from timesheet data: A case study / Zupančič Peter, Mileva Boshkoska Mileva, Panov Panče................................................................................................................................................................ 49 Monitoring COVID-19 through text mining and visualization / Massri M.Besher, Pita Costa Joao, Andrej Bauer, Grobelnik Marko, Brank Janez, Luka Stopar ................................................................................................... 53 Usage of Incremental Learning in Land-Cover Classification / Peternelj Jože, Šircelj Beno, Kenda Klemen ..... 57 Predicting bitcoin trend change using tweets / Jelenčič Jakob ............................................................................ 61 Large-Scale Cargo Distribution / Stopar Luka, Bradeško Luka, Jacobs Tobias, Kurbašić Azur, Cimperman Miha .......................................................................................................................................................................... 65 Amazon forest fire detection with an active learning approach / Čerin Matej, Kenda Klemen ............................. 69 Indeks avtorjev / Author index ................................................................................................................................ 73 v vi Zbornik 23. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2030 Zvezek C Proceedings of the 23rd International Multiconference INFORMATION SOCIETY – IS 2020 Volume C Odkrivanje znanja in podatkovna skladišča - SiKDD Data Mining and Data Warehouses - SiKDD Uredila / Edited by Dunja Mladenić, Marko Grobelnik http://is.ijs.si 5. oktober 2020 / 5 October 2020 Ljubljana, Slovenia 1 2 PREDGOVOR Tehnologije, ki se ukvarjajo s podatki so v devetdesetih letih močno napredovale. Iz prve faze, kjer je šlo predvsem za shranjevanje podatkov in kako do njih učinkovito dostopati, se je razvila industrija za izdelavo orodij za delo s podatkovnimi bazami, prišlo je do standardizacije procesov, povpraševalnih jezikov itd. Ko shranjevanje podatkov ni bil več poseben problem, se je pojavila potreba po bolj urejenih podatkovnih bazah, ki bi služile ne le transakcijskem procesiranju ampak tudi analitskim vpogledom v podatke – pojavilo se je t.i. skladiščenje podatkov (data warehousing), ki je postalo standarden del informacijskih sistemov v podjetjih. Paradigma OLAP (On-Line-Analytical-Processing) zahteva od uporabnika, da še vedno sam postavlja sistemu vprašanja in dobiva nanje odgovore in na vizualen način preverja in išče izstopajoče situacije. Ker seveda to ni vedno mogoče, se je pojavila potreba po avtomatski analizi podatkov oz. z drugimi besedami to, da sistem sam pove, kaj bi utegnilo biti zanimivo za uporabnika – to prinašajo tehnike odkrivanja znanja v podatkih (data mining), ki iz obstoječih podatkov skušajo pridobiti novo znanje in tako uporabniku nudijo novo razumevanje dogajanj zajetih v podatkih. Slovenska KDD konferenca pokriva vsebine, ki se ukvarjajo z analizo podatkov in odkrivanjem znanja v podatkih: pristope, orodja, probleme in rešitve. FOREWORD Data driven technologies have significantly progressed after mid 90’s. The first phases were mainly focused on storing and efficiently accessing the data, resulted in the development of industry tools for managing large databases, related standards, supporting querying languages, etc. After the initial period, when the data storage was not a primary problem anymore, the development progressed towards analytical functionalities on how to extract added value from the data; i.e., databases started supporting not only transactions but also analytical processing of the data. At this point, data warehousing with On-Line-Analytical-Processing entered as a usual part of a company’s information system portfolio, requiring from the user to set well defined questions about the aggregated views to the data. Data Mining is a technology developed after year 2000, offering automatic data analysis trying to obtain new discoveries from the existing data and enabling a user new insights in the data. In this respect, the Slovenian KDD conference (SiKDD) covers a broad area including Statistical Data Analysis, Data, Text and Multimedia Mining, Semantic Technologies, Link Detection and Link Analysis, Social Network Analysis, Data Warehouses. 3 PROGRAMSKI ODBOR / PROGRAMME COMMITTEE Janez Brank, Department of Artificial Intelligence, Jožef Stefan Institute, Ljubljana Marko Grobelnik, , Department of Artificial Intelligence, Jožef Stefan Institute, Ljubljana Branko Kavšek, University of Primorska, Koper Aljaž Košmerlj, Qlector, Ljubljana Dunja Mladenić, Department of Artificial Intelligence, Jožef Stefan Institute, Ljubljana Inna Novalija, Department of Artificial Intelligence, Jožef Stefan Institute, Ljubljana Luka Stopar, Sportradar, Ljubljana 4 A Dataset for Information Spreading over the News Abdul Sittar Dunja Mladenić Tomaž Erjavec Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia abdul.sittar@ijs.si dunja.mladenic@ijs.si tomaz.erjavec@ijs.si ABSTRACT Table 1: List of events Analysing the spread of information related to a specific event in Selected events Other events (ordered by popularity) the news has many potential applications. Consequently, various Football Basketball, Baseball, Boxing, Tennis, Cycling systems have been developed to facilitate the analysis of infor- Earthquake Floods, Tsunamis, Landslides, Hurricane, Volcanic eruptions mation spreading, such as detection of disease propagation and Global warming CO2 emissions, Chemical consumption identification of the spreading of fake news through social media. The paper proposes a method for tracking information spread over news articles. It works by comparing subsequent articles via limited availability of datasets containing news text and metadata cosine similarity and applying a threshold to classify into three including time, place, source and other relevant information. classes: “Information-Propagated”, “Unsure” and “Information- When a piece of information starts spreading, it implicitly not-Propagated”. There are several open challenges in the process raises questions such as: of discerning information propagation, among them the lack of (1) How far does the information in the form of news reach resources for training and evaluation. This paper describes the out to the public? process of compiling corpus from the Event Registry global me- (2) Does the content of news remain the same or changes to dia monitoring system. We focus on information spreading in a certain extent? three domains: sports (i.e. the FIFA World Cup), natural disas- (3) Do the cultural values impact the information especially ters (i.e. earthquakes), and climate change (i.e. global warming). when the same news will get translated in other languages? This corpus is a valuable addition to currently available dataset This paper presents a corpus that focuses on information to examine the spreading of information about various kind of spreading over news and that hopes to answer some of the above events. questions (This corpus is published as an online resource at ). We present the use of a news repository to produce a corpus KEYWORDS and then analyze information propagation. We present a novel Datasets, Information propagation, News articles methodology for automatically assembling the corpus for this problem and validate it in three different domains. We focused 1 INTRODUCTION on a combination of rich- and low resource European languages, Information spreading has received significant attention due to in particular English, Portuguese, German, Spanish, and Slovene. its various market applications such as advertisement. did the in- Three different types of events are targeted in the data collection formation about a specific product reach to the public of a specific procedure to potentially involve different information spreading region? This could be one of the significant research questions. behaviors in our society. These events are sports (FIFA World Research in this area considers influential factors in the process Cup, 2,695 articles), natural disasters (earthquakes, 3,194 articles), of information spreading such as the economic condition of a and climate change (global warming, 1,945 articles). The three specific area related to how textual or visual content is helping to types of events were chosen based on their popularity and diver- advertise a product. Information spreading analytics can also be sity. A list of sub-events was observed from top websites related used in shaping policies, e.g., in media companies to understand to the three events and we selected those which were the most if there is a need to improve the content before publishing it. popular in the countries with the selected national languages. For Health organizations may be interested to know the patterns of sports, a list of countries with their national sports was fetched spreading of a cure for a certain disease. Environmental scien- and then filtered for national language1, 2. Based on popularity, tists are perhaps attentive to see whether spread of news about we selected the FIFA world cup. Similarly, for natural disasters, climate changes inside the country is similar to what is being lists of natural disasters were collected by country taking the na- reported internationally. tional language into account, for instance, for Slovenia we looked Domain-specific gaps in information spreading are ubiquitous, for this country in the natural disaster category on Wikipedia3. and may exist due to economic conditions, political factors, or Earthquakes4 and global warming5 were found to be the most linguistic, geographical, time-zone, cultural and other barriers. prevalent, thus a dataset for each was collected. Table 1 shows the These factors potentially contribute to obstructing the flow of selected events and other related events ordered by prevalence. local as well as international news. We believe that there is a lack The paper makes the following contributions to science: of research studies which examine, identify and uncover the rea- (1) a novel methodology to collect a domain-specific corpus sons for barriers in information spreading. Additionally, there is from news repository; (2) semantic similarity between news articles; Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and 1http://www.quickgs.com/countries-and-their-national-sports/ the full citation on the first page. Copyrights for third-party components of this 2https://www.topendsports.com/ work must be honored. For all other uses, contact the owner/author(s). 3https://en.wikipedia.org/wiki/Category:Natural_disasters_in_Slovenia Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia 4https://en.wikipedia.org/wiki/List_of_earthquakes_in_2020 © 2020 Copyright held by the owner/author(s). 5https://www.theguardian.com/environment/2011/apr/21/countries-responsible- climate-change, 6 5 Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia Abdul Sittar, Dunja Mladenić, and Tomaž Erjavec (3) an annotated dataset encoding the level of information spreading from an article. The rest of the paper is organized as follows: in Section 2 we discuss prior work about information spreading; in Section 3 we describe the data collection methodology; Section 4 describes semantic similarity and dataset annotation; and Section 5 gives the conclusions. 2 RELATED WORK Information spreading is prevalent in our society. It plays a vi- tal part in tasks that encompass the spreading of innovations [9], effects in marketing [6], and opinion spreading [4]. News spreading provides information to consumers that can be used for decision making and potentially contribute to shaping na- tional and international policies. There are several types of media Figure 1: Data collection methodology involved, such as print media, broadcast, and internet media. In- ternet is considered as a building block for connecting individuals worldwide, while news reflects current significant events for peo- ple [7]. Apart from news, online social media proved to be a remarkable alternative to support information spreading in an emergency [8, 5]. Social connection plays a vital role in news spreading. Especially the structure of network reflecting who is connected to whom, crucially increases the proportion of in- formation spreading. Network structure analysis comes with a hypothesis related to the strength of the connections, namely that information will spread further in a situation where there exist many weak connections rather than clusters of strong [2]. While, in general, there are not many dataset that would help in modelling information spreading, there are some corpora for detecting the spreading of information about diseases [3] and fake news in social media [10]. There is currently no multilingual dataset of news articles for analysis of information propagation composed from a variety of event-centric information such as Figure 2: Articles with metadata sports, natural disasters, and climate changes. This provides ad- ditional motivation for our work. Table 2: Statistics about dataset 3 DATA COLLECTION METHODOLOGY Dataset Domain Event type Articles per Language Total Articles Eng Spa Ger Slv Por In order to collect news originating from different sources, in 1 Sports FIFA World Cup 983 762 711 10 216 2682 2 Natural Disaster Earthquake 941 999 937 19 251 3147 different languages, and targeting diverse events, we used Event 3 Climate Changes Global Warming 996 298 545 8 97 1944 Registry, a platform that identifies events by collecting related articles written in different languages from tens of thousands of news sources [9]. Using Event Registry APIs 7, we fetched a list This service uses a page-rank based method to identify a coherent of articles about each event in the following languages: English, set of relevant concepts from Wikipedia [1]. We retrieved a list Spanish, German, Portuguese, and Slovenian. Figure 1 shows the of Wikipedia concepts for each article. After representing each data collection process. article with a list of Wikipedia concepts, the tf-idf score was com- Each article was parsed from the JSON response and stored in puted using the popular machine learning library Scikit-Learn9. CSV files. Each article was connected with the available list of Using the same library, cosine similarity was calculated between relevant information such as the language of the article, event tf-idf representation of news articles across all five languages. type, publisher, title, date, and time. Figure 2 shows the metadata In the process of computing similarity between the articles, for of articles. each article we calculated its cosine similarity to all other articles The number of collected articles in each domain varies consid- and stored the results in a CSV file. The results were then sorted erably, and also varies across the languages within each domain. based on the publishing time of articles and we kept only the cal- Table 2 shows statistics about each dataset. culations of similarity to articles that are published later that the article in hands. Since we are interested in information propaga- 4 SEMANTIC SIMILARITY BETWEEN NEWS tion, we do not need to compare an article to those articles which ARTICLES have been published before it. As a result, we had a multiple similarity score for each article where each score show the simi- We have represented the cross-lingual news articles by monolin- larity with other articles. Cosine similarity varies between zero gual (English) Wikipedia concepts using the Wikifier service8. and one, zero meaning no similarity and one meaning maximum 7https://github.com/EventRegistry/event-registry-python/blob/master/ similarity, i.e., a duplicate article. eventregistry/examples/QueryArticlesExamples.py 8http://wikifier.org/info.html 9https://scikit-learn.org/stable/ 6 A Dataset for Information Spreading over the News Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia Table 3: Selected articles for evaluation Domains Percentage of correctly labelled pairs Global Warming 100% Earthquake 93% FIFA World Cup 100 % for Portuguese, German, Slovene and Spanish to translate them into English. Evaluation results shown that the annotation was significantly related to information spreading. Articles in the "Information- Propagated" class show that most articles were an exact or para- phrased copy of each other, with some articles published within few hours after each other. Articles in the "Unsure" class were Figure 3: Class distribution for all domains typically also relevant to the event but involved extra and dif- ferent discussions. Lastly, in the third class "Information-Not- Propagated", articles involved only keywords related to event but discussion was about other topics. Moreover, here the gap in the 4.1 Dataset annotations publishing time was quite large. The results of the semantic similarity calculation were in the form of a table where rows shown the list of articles and columns shown the corresponding similarity score in the range 0..1 with 5 CONCLUSIONS all the other articles. This similarity score was calculated using This paper proposed a methodology and explained the process cosine between TF-IDF representation of news articles (See Sec- of data collection from a news repository to provide a corpus tion ??). First, we excluded those articles which had scored 1.0, for event-centric information propagation between news articles. as they were considered as a copy of the article. We then, for This corpus covers three domains and each dataset corresponds each article, chose an article which had the highest similarity to one event type (FIFA World Cup, Earthquake, and Global score to it from the list of all articles. After performing this step, Warming). The corpus is available to others for the evaluation we had one similarity score for each article which shows either of techniques for information spreading as it allows the analysis that the information spread to a certain extent (if >0) or not (if of cross-lingual news articles published by different publishers 0). To decide about the class label whether the information is located geographically in different places. spreading or not, we divided the scores into three intervals. The In the future, we plan to add more attributes to each dataset. first is Similarity ≥ 0.7, the second is 0.7 > Similarity ≥ 0.4, For instance, for now, we only know the publisher of a news and the third is Similarity < 0.4. Articles that have scores in article but in the future, we would like to include the publisher the first interval were labeled as "Information-Propagated". The profile and the economic condition of a country from where the second interval was considered as unclear whether the informa- information is published. Also, we plan to apply and evaluate tion from the article propagated or not such articles were labeled different techniques to analysis information propagation barriers. as "Unsure". The lowest interval was considered as a signal for no propagation and labeled "Information-not-Propagated". For 6 ACKNOWLEDGEMENTS instance, low similarity can be of an article about a sports ground which mentions the population of the city and another article This work was supported by the Slovenian Research Agency and that discusses the population itself. We have manually examined the project leading to this publication has received funding from concepts of articles in each class. Figure 3 shows the distribu-the European Union’s Horizon 2020 research and innovation tion of class labels in FIFA World Cup, Earthquake, and Global programme under the Marie Skłodowska-Curie grant agreement Warming dataset respectively. No 812997. REFERENCES 4.2 Evaluation of dataset [1] Janez Brank, Gregor Leban, and Marko Grobelnik. 2017. Each article was annotated with a label based upon the similarity Annotating documents with relevant wikipedia concepts. score threshold of each article with other articles (See Section In Proceedings of Slovenian KDD Conference on Data Mining 4.1). For evaluation of the dataset we have checked the content of and Data Warehouses (SiKDD). the corresponding articles which were responsible for a specific [2] Damon Centola. 2010. The spread of behavior in an online class label. We performed the evaluation of labelling by manually social network experiment. science, 329, 5996, 1194–1197. inspecting a subset of pairs of articles. If a pair, for instance, were [3] Emily Chen, Kristina Lerman, and Emilio Ferrara. 2020. labelled as "Information-Propagated" then two articles should Covid-19: the first public coronavirus twitter dataset. arXiv have text discussing more or less the same event, both in mono- preprint arXiv:2003.07372. and cross-lingual settings. [4] David Liben-Nowell and Jon Kleinberg. 2008. Tracing in- We have randomly chosen 10 articles with their corresponding formation flow on a global scale using internet chain-letter articles considering all languages in each class and in each dataset. data. Proceedings of the national academy of sciences, 105, In this way, we have manually checked 180 articles. Table 3 shows 12, 4633–4638. these pairs of articles for evaluation in each dataset. We scanned each article manually for all languages, using Google Translator 7 Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia Abdul Sittar, Dunja Mladenić, and Tomaž Erjavec [5] Kees Nieuwenhuis. 2007. Information systems for crisis crisis informatics: study of 2013 oklahoma tornado. Trans- response and management. In International Workshop on portation Research Record, 2459, 1, 110–118. Mobile Information Technology for Emergency Response. [9] Duncan J Watts and Peter Sheridan Dodds. 2007. Influen- Springer, 1–8. tials, networks, and public opinion formation. Journal of [6] Everett M Rogers. 2010. Diffusion of innovations. Simon consumer research, 34, 4, 441–458. and Schuster. [10] Zilong Zhao, Jichang Zhao, Yukie Sano, Orr Levy, Hideki [7] Sandeep Suntwal, Susan Brown, and Mark Patton. 2020. Takayasu, Misako Takayasu, Daqing Li, Junjie Wu, and How does information spread? an exploratory study of Shlomo Havlin. 2020. Fake news propagates differently true and fake news. In Proceedings of the 53rd Hawaii In- from real news even at early stages of spreading. EPJ Data ternational Conference on System Sciences. Science, 9, 1, 7. [8] Satish V Ukkusuri, Xianyuan Zhan, Arif Mohaimin Sadri, and Qing Ye. 2014. Use of social media data to explore 8 Learning to fill the slots from multiple perspectives Patrik Zajec Dunja Mladenič patrik.zajec@ijs.si dunja.mladenic@ijs.si Jožef Stefan Institute and Jožef Stefan International Jožef Stefan Institute and Jožef Stefan International Postgraduate School Postgraduate School Jamova cesta 39 Jamova cesta 39 Ljubljana, Slovenia Ljubljana, Slovenia ABSTRACT Furthermore, since the set of topics is not fixed and could expand We present an approach to train the slot-filling system in a fully over time, such a slot filling system should be able to adapt quickly automatic, semi-supervised setting on a limited domain of events to fill new slots and ideally should not be limited to the English from Wikipedia using the summaries in different languages. We language. use the multiple languages and the different topics of the events We believe that annotation work can be greatly minimized to provide several alternative views on the data. Our experiments if we rely on our limited domain to identify and annotate only show how such an approach can be used to train the multilingual informative examples and use the additional assumptions to prop- slot-filling system and increase the performance of a monolingual agate these labels. We also believe that simultaneous training of system. the system on multiple topics can be advantageous, as we can introduce additional supervision on the common slots and use KEYWORDS distinct slots as a source of negative examples. In this work we use Wikipedia and Wikidata [9] as the source information extraction, slot filling, machine learning, probabilis-of data. We treat the Wikidata entities that have the point-in-time tic soft logic property specified as events and summary sections of Wikipedia articles about the entity in different languages as news articles. 1 INTRODUCTION Each entity belongs to a single topic and we adopt the subset of This paper is addressing the slot filling task that aims to extract topic-specific properties as slot keys. An automatic exact match- the structured knowledge from a given set of documents using a ing of such values from Wikidata with named entities from model trained for a specific domain and the associated slots. For Wikipedia articles is rarely successful. We use the successful example, within a news article reporting on an earthquake, the and unambiguous matches as a set of labeled seed examples. task is to detect the earthquake’s magnitude, the number of peo- We formulate the task as a semi-supervised learning problem ple injured, the location of the epicentre and other information. [8] where the set of base learners is trained iteratively, starting We refer to those as a set of slot keys or slots, to their exact values with a small seed set of labeled examples and a larger set of unla-as a slot values and to the named entities from the documents beled examples. In each iteration, the most confident predictions corresponding to those values as target entities. on the examples from unlabeled set are used to increase the train- Slot filling is closely related to the task of relation extraction [1] ing set by assigning pseudo-labels. We introduce an additional and can be seen as a kind of unary relation extraction. Both tasks component which combines the confidences of multiple base can be formulated as classification and are usually approached learners for each example. by first training a classifier with a sentence and tagged entities at To the best of our knowledge, we are the first to use the limited the input and the prediction of relation or slot key as the output. domain of news events, which allows the additional assumptions, As there is a large number of relations between entities that such as the connection between slots of different topics and the we might be interested in detecting, there is also a large num- redundancy of reporting in multiple languages, to first train and ber of slot keys we seek the slot value for. In order to avoid the later boost the performance of a slot-filling system. resource-intensive process of annotating a large number of exam- The contributions of this paper are the following: ples for each possible slot/relation and to increase the flexibility • we combine the data from Wikidata and Wikipedia to of training procedures beyond the straight-forward supervised setup a learning and evaluation scenario that mimics the learning, many alternative approaches have been proposed, such learning on news events and articles, as bootstrapping [4], distant supervision [6] and self supervision • we demonstrate how simultaneous learning on multiple [5]. topics and languages can be used not only to train the As stated both tasks can be performed for different types of multilingual slot-filling system, but to also improve the documents. We limit our focus to news events on multiple topics performance of a monolingual system, (such as natural disasters and terrorist attacks), taking the articles • we show how an inference component can be used to com- reporting about events as the documents. Since the number of bine predictions from multiple base learners to improve news topics is large, and consequently so is the number of slots, the pseudo-labeling step of the semi-supervised learning we would like to minimize the need for manual annotations. process. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or 2 METHODOLOGY distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this 2.1 Problem Definition work must be honored. For all other uses, contact the owner/author(s). Given a collection of topics T (such as earthquakes, terrorist Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia attacks, etc.), where each topic 𝑡 has its own set of slot keys S , © 2020 Copyright held by the owner/author(s). 𝑡 the goal is to automatically extract values from the relevant texts 9 Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia Patrik Zajec and Dunja Mladenič to fill in the slots. For example, the members of S of the XLM Roberta model [3] using the implementation from 𝑒𝑎𝑟 𝑡 ℎ𝑞𝑢𝑎𝑘𝑒𝑠 are number of injured, magnitude and location. For each topic the Transformers 2 library. Note that the representation of each 𝑡 there is a set of events E , each of which took place at some entity remains fixed throughout the learning process because we 𝑡 point in time and was reported by several documents in different have found that the representation is expressive enough for our languages. purposes and it speeds up the training between iterations. Also The values of all or at least most slot keys (or slots) from S are note that since the entity is masked, it is not directly captured in 𝑑 represented in each of the documents as named entities, which the representation. we also refer to as target entities. We say most of the slots, since it is possible that an earthquake caused no casualties. It is also 2.4 Selecting the topics possible that some of the documents do not report about the Our assumption is that training the system to detect the slots on number of casualties as it may be too early to know if there were multiple topics simultaneously can provide additional benefits. any. In addition, the documents might contain different values for For two topics ′ 𝑡 and 𝑡 there is potentially a set of common slots the same slot key, as for example, the reported number of people and a set of topic-specific slots. injured by an earthquake can increase over time. There may also For slot ′ 𝑠 which appears in both topics the base learner trained be several different mentions of the same slot in a particular on ′ 𝑡 can be used to make predictions for examples from 𝑡 . By document, as for example one magnitude might refer to an actual combining predictions from learners trained on ′ 𝑡 and 𝑡 , we could earthquake that the event is about, while the other magnitude get a better estimate of the true labels of the examples. might refer to an earthquake that struck the same region years For the slot 𝑠, which is specific to the topic 𝑡 , all examples from ago. the topic ′ 𝑡 can be used as negative examples. Selecting reliable Our task is actually a two step process. In the first step, the negative examples from the same topic is not easy, as we may goal is to train a system capable of identifying the target entities inadvertently mislabel some of the positive examples. for a set of slot keys from the context, which in our case is limited to a single sentence. Such a system is not yet able to recognise 2.5 Using multiple languages the true value for a given slot if there are multiple different candidates, such as selecting the actual magnitude from several Articles from different languages offer in some ways different reported magnitude values. The goal of the second step is to views on the same event. The slot values we are trying to detect assign a single correct value to each of the slot keys. We assume should appear in all the articles, as they are highly relevant to that inferring the correctness of a value is a document-level task, the event. since it requires a broader context. Solving the first step is a kind The values for slots such as location and time should be con- of prerequisite for the second step, so we focus on it in this paper. sistent across all articles, whereas this does not necessarily apply to other slots such as the number of injured or the number of 2.2 Overview of the proposed method casualties. Matching such values across the articles is therefore not a trivial task, and although a variant of soft matching can be The system is trained iteratively and starts with a noisy seed set, performed, we leave it for the future work and limit our focus which grows larger with pseudo-labeled positive and negative only on the values that can be matched unambiguously. examples. Each of the base learners is trained on the set of la- We can combine the predictions of several language-specific beled examples from the topic (or multiple topics) and language base learners into a single pseudo-label for entities that can be assigned to it. The prediction probabilities for each of the unla- matched across the articles. beled examples are determined by combining the probabilities of all base learners. This is done either by averaging or by feeding 2.6 Assigning pseudo labels the probabilities as approximations of the true labels into the component, which attempts to derive the true value for each ex- Each iteration starts with a set of labeled examples 𝑋 , a set of 𝑙 ample and the error rates for each learner [7]. The examples with unlabeled examples 𝑋 and a set of base learners trained on 𝑋 . 𝑢 𝑙 probabilities above or below the specific thresholds are given a Base learners are simple logistic regression classifiers that use pseudo-label and added to the training set. vector representations of entities as features and classify each The seed set is constructed by matching the slot values ob- example 𝑥 as a target entity for the slot key 𝑠 or not. 𝑠 tained from Wikidata with named entities found in Wikipedia Each base learner ¯ 𝑓 is a binary classifier trained on the la- 𝑡 ,𝑙 articles for each event. There are only a handful of unambigu- beled data for the slot key 𝑠 from the topic 𝑡 and the language ous matches for each slot key, which are labeled as a positive 𝑙 . Such base learners are topic-specific as they are trained on a examples, while the negative examples are all other named en- single topic 𝑠 𝑡 . Base learners ¯ 𝑓 are trained on the labeled data 𝑙 tities from the articles in which they appeared. Figure 1 shows for the slot key 𝑠 from the language 𝑙 and all the topics with the a high-level overview of the proposed methodology. The entire slot key 𝑠. Such base learners are shared across topics, as they workflow is repeated in each iteration until no new examples are consider the examples from all the topics as a single training set. selected for pseudo-labelling. We use the classification probability of the positive class instead of hard labels, ¯𝑠 ¯𝑠 𝑓 (𝑥 ), 𝑓 (𝑥 ) ∈ [0, 1]. 𝑡 ,𝑙 𝑙 2.3 Representing the entities For each entity 𝑥 from a news article with the language 𝑙 Each named entity together with its context forms a single ex- reporting on the event 𝑒 from the topic 𝑡 we obtain the following ample. We annotate each article and extract the named entities predictions: with Spacy 1. To capture the context, we compute the vector • ¯𝑠 ′ 𝑓 ( and all such that ′ , that ′ 𝑥 ) for each 𝑠 ∈ S 𝑡 𝑠 ∈ S 𝑡 𝑡 𝑡 ,𝑙 representation of each entity by replacing it with a mask token is the probability that 𝑥 is a target entity for the slot key and feeding the entire sentence through a pre-trained version 1https://spacy.io/ 2https://huggingface.co/transformers/ 10 Learning to fill the slots from multiple perspectives Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia Figure 1: High-level overview of the proposed methodology. 𝑠 , where 𝑠 is a slot key from the topic 𝑡 , using the topic- We have collected the Wikipedia articles and Wikidata in- specific base learner trained on examples from the same formation of 913 earthquakes from 2000 to 2020 in 6 different language on the topic ′ 𝑡 that also has the slot key 𝑠, languages, namely English, Spanish, German, French, Italian and • ¯𝑠 𝑠 𝑓 ( ( and for each Dutch. We have manually annotated the entities of 85 English ′ 𝑥 ) which equals ¯ 𝑓 ′ 𝑦) for each 𝑠 ∈ S𝑡 𝑡 ,𝑙 𝑡 ,𝑙 language ′ 𝑙 such that there is an article reporting about articles using the slot keys number of deaths, (number of injured the same event 𝑒 in that language and contains an entity and magnitude, which serve as a labeled test set and are not in- 𝑦 which is matched to 𝑥 , cluded in the training process. In addition, we have collected the • ¯𝑠 𝑓 (𝑥 ) for each 𝑠 ∈ S , using the shared base learner, which data of 315 terrorist attacks from 2000 to 2020 with the articles 𝑡 𝑙 is on examples from all topics ′ from the same 6 languages. 𝑡 that have the slot key 𝑠. Predictions from multiple base learners for each 𝑥 and 𝑠 are 3.2 Evaluation Settings combined as a weighted average to obtain a single prediction 𝑠 The evaluation for each approach is performed on the labeled 𝑓 (𝑥 ). The weight of each base learner ¯ 𝑓 is determined by its error rate English dataset, where 76 entities are labeled as number of deaths, 𝑒 ( ¯ 𝑓 ) which is estimated using an approach from [7] using both unlabeled and labeled examples. This is done by introducing 45 as number of injured and 125 as magnitude. The threshold the following logical rules (referred to as ensemble rules in [7]) values for the pseudo-labeling are set to 𝑇 = 0.6 and 𝑇 = 0.05. 𝑝 𝑛 for each of the base learners ¯𝑠 The approaches differ by the subset of base learners used to form 𝑓 predicting for 𝑥: ¯ the combined prediction and by the weighting of the predictions. 𝑠 𝑠 𝑠 ¯𝑠 𝑠 𝑠 𝑓 (𝑥 ) ∧ ¬𝑒 ( ¯ 𝑓 ) → 𝑓 (𝑥 ), 𝑎𝑛𝑑 , 𝑓 (𝑥 ) ∧ 𝑒 ( ¯ 𝑓 ) → ¬𝑓 (𝑥 ), Single or multiple languages. In single language setting, only ¬ ¯𝑠 𝑠 𝑠 𝑠 𝑠 𝑠 𝑓 (𝑥 ) ∧ ¬𝑒 ( ¯ 𝑓 ) → ¬𝑓 (𝑥 ), 𝑎𝑛𝑑 , ¬ ¯ 𝑓 (𝑥 ) ∧ 𝑒 ( ¯ 𝑓 ) → 𝑓 (𝑥 ). English articles are used to extract the entities and train the base The truth values are not limited to Boolean values, but instead learners. In the multi-language setting, all available articles are represent the probability that the corresponding ground predicate used and the entities are matched across the articles from the or rule is true. For a detailed explanation of the method we refer same event. the reader to [7]. We introduce a prior belief that the predictions of base learners are correct via the following two rules: Single or multiple topics. In the single topic setting only the examples from the earthquake topic are used. In the multi-topic ¯𝑠 𝑠 𝑠 𝑠 𝑓 (𝑥 ) → 𝑓 (𝑥 ), 𝑎𝑛𝑑 , ¬ ¯ 𝑓 (𝑥 ) → ¬𝑓 (𝑥 ). setting, the examples from terrorist attacks are used as negative Since each examples for the slot key magnitude, the base learners for the 𝑥 can be target entity for at most one slot key, we introduce a mutual exclusion rule: slot keys number of deaths and number of injured are combined as described in the section 2.6. ¯ ′ 𝑠 𝑠 𝑠 𝑓 (𝑥 ) ∧ 𝑓 (𝑥 ) → 𝑒 ( ¯ 𝑓 ). Uniform or estimated weights. In the uniform setting all pre- The rules are written in the syntax of a Probabilistic soft logic dictions of the base learners contribute equally, while in the [2] program, where each rule is assigned a weight. We assign estimated setting the weights of the base learners are estimated a weight of 1 to all ensemble rules, a weight of 0.1 to all prior using the approach described in the section 2.6. belief rules and a weight of 1 to all mutual exclusion rules. The inference is performed using the PSL framework 3. As we obtain 3.3 Results and discussion the approximations for all 𝑥 ∈ 𝑋 , we extend the set of positive 𝑢 examples for each slot 𝑠 The results of all experiments are summarized in the table 1. Since 𝑠 with all 𝑥 such that 𝑓 (𝑥 ) >= 𝑇 and 𝑝 the set of negative examples with all 𝑠 the test set is limited to the topic earthquake and English, only a 𝑥 such that 𝑓 (𝑥 ) <= 𝑇 , 𝑛 for predefined thresholds subset of base learners was used to make the final predictions. We 𝑇 and 𝑇 . 𝑝 𝑛 report the average value of precision, recall and F1 across all slot 3 EXPERIMENTS keys. The threshold of 0.5 was used to round the classification probabilities. 3.1 Dataset Single iteration. Approaches in which base learners are trained To evaluate the proposed methodology, we have conducted ex- on the initial seed set for a single iteration achieve higher preci- periments on two topics: earthquakes and terrorist attacks. sion with the cost of a lower recall. We observe that they distin- 3https://psl.linqs.org/ guish almost perfectly between the slots from the seed set and 11 Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia Patrik Zajec and Dunja Mladenič Table 1: Results of all experiments. The column Single iteration reports the results of approaches where base learners were trained on the seed set only. Results where base learners were trained in the semi-supervised setting with different weightings of the predictions are reported in the columns Uniform weights and Estimated weights. The values of precision, recall and F1 are averaged over all slot keys. Single iteration Uniform weights Estimated weights Model P R F1 P R F1 P R F1 Single language, single topic 0.94 0.64 0.76 0.83 0.75 0.77 0.84 0.76 0.79 Multiple languages, single topic 0.94 0.64 0.76 0.82 0.74 0.76 0.83 0.75 0.77 Single language, multiple topics 0.91 0.76 0.83 0.83 0.83 0.83 0.86 0.83 0.84 Multiple languages, multiple topics 0.93 0.76 0.83 0.82 0.83 0.82 0.84 0.84 0.84 produce almost no false positives. Using one or more languages REFERENCES has almost no effect on the averaged scores when the number [1] Nguyen Bach and Sameer Badaskar. 2007. A Survey on Re- of topics is fixed. When using multiple topics, a higher recall is lation Extraction. Technical report. Language Technologies achieved without a significant decrease in precision. All incorrect Institute, Carnegie Mellon University. classifications of the slot number on injured are actually examples [2] Stephen H Bach, Matthias Broecheler, Bert Huang, and of the number of missing slot that is not included in our set and Lise Getoor. 2017. Hinge-loss markov random fields and likewise almost all incorrect classifications for the slot magnitude probabilistic soft logic. The Journal of Machine Learning are examples of the slot intensity on the Mercalli scale. This could Research, 18, 1, 3846–3912. easily be solved by expanding the set of slot keys and shows how [3] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav important it is to learn to classify multiple slots simultaneously. Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Semi-supervised. Approaches in which base learners are trained Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. iteratively trade precision in order to significantly improve recall. 2019. Unsupervised cross-lingual representation learning Most of the loss of precision is due to misclassification between at scale. arXiv preprint arXiv:1911.02116. slots number of deaths and number of injured, similar as the exam- [4] Tianyu Gao, Xu Han, Ruobing Xie, Zhiyuan Liu, Fen Lin, ple "370 people were killed by the earthquake and related building Leyu Lin, and Maosong Sun. 2020. Neural snowball for collapses, including 228 in Mexico City, and more than 6,000 were few-shot relation learning. In Proceedings of AAAI. injured." where 228 was incorrectly classified as number of injured [5] Xu ming Hu, Lijie Wen, Y. Xu, Chenwei Zhang, and Philip S. and not the number of deaths. The use of multiple topics reduces Yu. 2020. Selfore: self-supervised relational feature learning misclassification between these slots and further improves the for open relation extraction. ArXiv, abs/2004.02438. recall as new contexts are discovered by the base learners trained [6] Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. on terrorist attacks. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Uniform and estimated weights. Using the estimated error rates Annual Meeting of the ACL and the 4th International Joint as weights for the predictions of base learners shows a slight Conference on Natural Language Processing of the AFNLP, improvement in performance. It may be advantageous to estimate 1003–1011. multiple error rates for topic-specific base learners, as they tend to [7] Emmanouil Platanios, Hoifung Poon, Tom M Mitchell, and be more reliable in predicting examples from the same topic. We Eric J Horvitz. 2017. Estimating accuracy from unlabeled believe that more data and experimentation is needed to properly data: a probabilistic logic approach. In Advances in Neural evaluate this component. A major advantage is its flexibility, Information Processing Systems, 4361–4370. since we can easily incorporate prior knowledge of the slots or [8] Jesper E Van Engelen and Holger H Hoos. 2020. A survey additional constraints on the predictions in the form of logical on semi-supervised learning. Machine Learning, 109, 2, 373– rules. 440. [9] Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a 4 CONCLUSION AND FUTURE WORK free collaborative knowledgebase. Communications of the We presented an approach for training the slot-filling system ACM, 57, 10, 78–85. which can benefit from large amounts of data from Wikipedia. The experiments were performed on a relatively small dataset and show that the proposed direction seems promising. However, the right test of our approach would be to apply it to a much larger number of topics and events, which will be done in the immediate next step. Furthermore, the current approach needs to be evaluated in more detail. ACKNOWLEDGMENTS This work was supported by the Slovenian Research Agency and NAIADES European Unions project under grant agreement H2020-SC5-820985. 12 Knowledge graph aware text classification Nela Petrželková∗ Blaž Škrlj Nada Lavrač Jožef Stefan Institute Jožef Stefan Institute and Jožef Stefan Institute Ljubljana, Slovenia Jožef Stefan Int. Postgraduate School Ljubljana, Slovenia nela.petrzelkova@seznam.cz Ljubljana, Slovenia nada.lavrac@ijs.si blaz.skrlj@ijs.si ABSTRACT (2) The proposed method is extensively empirically evaluated, Knowledge graphs are becoming ubiquitous in many scientific indicating that the proposed semantic feature construc- and industrial domains, ranging from biology, industrial engi- tion aids the classification performance on many real-life neering to natural language processing. In this work we explore datasets. how one of the largest currently available knowledge graphs, the (3) The implemented method is freely available3 with a simple-Microsoft Concept Graph, can be used to construct interpretable to-use, scikit-learn API. features that are of potential use for the task of text classification. The paper is structured as follows. Section 2 presents the By exploiting graph-theoretic feature ranking, introduced as part background and related work. Section 3 presents the proposed of the existing tax2vec algorithm, we show that massive, real-life approach to semantic feature construction using the information knowledge graphs can be used for the construction of features, from a given knowledge graph. Section 4 describes the experi-derived from the relational structure of the knowledge graph mental setting and the results, followed by a summary and further itself. To our knowledge, this is one of the first approaches that work in Section 5. explores how interpretable features can be constructed from the Microsoft Concept graph with more than five million concepts 2 BACKGROUND AND RELATED WORK and more than 80 million IsA relations for the task of text classi- In text classification tasks, characterized by short documents fication. The proposed solution was evaluated on eight real-life or small amounts of documents, deep learning methods are fre- text classification data sets. quently outperformed by more standard approaches, including SVMs [4]. In such settings, it was shown that approaches capa-KEYWORDS ble of using semantic context may outperform the naïve learn- knowledge graphs, text classification, feature construction, se- ing approaches, the examples are among other based on Latent mantic enrichment Dirichlet Allocation [5], Latent Semantic Analysis [6] or word embeddings [7], which is referred to as first-level context. 1 INTRODUCTION Second-level context can be introduced by adding background Text classification is the process of assigning labels to text accord- knowledge into a learning process, which may help to increase ing to its content. It is one of the fundamental tasks in Natural performance and improve interpretability. Usage of knowledge Language Processing (NLP) with various applications such as graphs also helped in classification with extending neural net- spam detection, topic labeling, sentiment analysis, news catego- work based lexical word embedding objective function [8]. El-rization and many more [1]. In recent years, knowledge graphs— hadad et al. [9] present an ontology-based web document, while real-life graph-structured sources of knowledge—are becoming Kaur et al. [10] propose a clustering-based algorithm for docu-an interesting source of background knowledge, potentially use- ment classification that also benefits from knowledge stored in ful in contemporary machine learning [2]. Knowledge graphs, the underlying ontologies. Use of hypernym-based features was such as DBPedia1 or the Microsoft Concept Graph2 span tens of performed already in e.g., the Ripper rule learning algorithm [11]. millions of triplets of the form subject-predicate-object, and in- Wang and Domeniconi [12] used the derived background knowl-clude many potentially interesting relations, from which a given edge from Wikipedia for text enriching. In short document clas- machine learning algorithm can potentially benefit. sification, it was shown that the tax2vec algorithm (described In this work we propose an approach to scalable feature con- below) can help those classifiers gain better results by adding struction from one of the largest freely available knowledge extra semantic knowledge to the feature vectors. graphs, and demonstrate its utility on multiple real life data sets. The tax2vec [3] is an algorithm for semantic feature construc-The main contributions of this work are as follows: tion that can be used to enrich the feature vectors constructed by the established text processing methods such as the tf-idf. It (1) We propose an extension to the tax2vec [3] algorithm for takes as input a labeled or unlabeled corpus of documents and a semantic feature construction, adapting it to operate with word taxonomy, i.e. a directed graph to which parts of a given real-life knowledge graphs comprised of tens of millions document map to. It outputs a matrix of semantic feature vectors of triplets. where each row represents a semantics-based vector representa- 1https://wiki.dbpedia.org/ tion of one input document. It makes it by mapping the words 2https://concept.research.microsoft.com/Home/Introduction from the document to a given taxonomy, WordNet or in this work Microsoft Concept Graph, by which it creates the collection of Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or terms for each document and from it, a corpus taxonomy—a rela-distributed for profit or commercial advantage and that copies bear this notice and tional structure specific to the considered document space. The the full citation on the first page. Copyrights for third-party components of this terms presented in the corpus taxonomy represent the potential work must be honored. For all other uses, contact the owner/author(s). Information society ’20, October 5–9, 2020, Ljubljana, Slovenia features. © 2020 Copyright held by the owner/author(s). 3https://github.com/SkBlaz/tax2vec 13 Information society ’20, October 5–9, 2020, Ljubljana, Slovenia Petrželková et al. 3 KNOWLEDGE GRAPH-BASED SEMANTIC Table 1: Part of the Microsoft Concept Graph. The row is FEATURE CONSTRUCTION in form of hypernym - hyponym - frequency of relation Semantic features are constructed as follows. With the help of social network facebook 4987 spaCy library [13], we first find nouns in each document in the symptom fever 4966 corpus and for every noun we find all hypernyms in the associ- sport tennis 4964 ated knowledge graph. Next, we add the most frequent 𝑛 such fruit strawberry 4824 hypernyms to the document-based taxonomy (the number in activity fishing 4789 the third column in Table 1). We identified this step as critical, feature construction, how the text is being processed prior to as the crawl-based knowledge graphs are commonly noisy, and that and how are semantic features used after that. prunning out uncertain relations is of high relevance. After per- forming this for all documents in the corpus, document-based 3.2 Microsoft Concept Graph taxonomies are concatenated into corpus-based taxonomy. Next, we perform feature selection, discussed next. We are using Microsoft Concept Graph4 [15] [16] for obtaining the extra semantic information. This large relational graph con-3.1 Feature selection sists of more than 5.4 million concepts that are a part of more than 80 million triplets. It was created by harnessing billions of During feature selection we choose a predefined number of web pages, so it is very general and various, offering a lot knowl- features within the set of features with the goal to select the edge to add to our text we want to classify. It contains mostly IsA most useful or important features. Hence, from the set of hy- relations, which was the part we use to obtain hypernyms for pernyms which we constructed from the knowledge graph, we nouns in the input text and enrich the feature vectors by some choose only top 𝑑 features (= dimension of the space) based on of them. A part of the downloaded knowledge graph is shown one of the heuristics described below. Closeness centrality of in Table 1. The number in the third column is the count of times a node is a measure of centrality in a network, calculated as this relation was found when creating the knowledge graph, so 𝐶 (𝑥 ) = 1 , where 𝑑 (𝑦, 𝑥) is the distance (path length) be- Í 𝑑 ( 𝑦,𝑥 ) a frequency of the relation’s occurrence. We removed relations 𝑦 tween vertices 𝑥 and 𝑦. The bigger the closeness centrality value that had frequency of one, which immediately reduced the graph a given node has, the closer it is to all other nodes. The rarest approximately to half the size and removed mostly noisy rela- terms are the most document-specific and are more likely to tions. Later we used the NetworkX library [17] to transform the provide more information than the ones frequently occurring. Microsoft Knowledge Graph from bare text to a directed graph. Hence this heuristic simply takes overall counts of all the hy- This step makes the subsequent exploitation of the knowledge pernyms, sorts them in ascending order by their frequency of graph easier. occurrence and takes the top 𝑑. The mutual information be- tween two random discrete variables represented as vectors 𝑋𝑖 3.3 Proposed approach extending tax2vec (the 𝑖-th hypernym feature) and 𝑌 (the target binary class) is Firstly, we tokenize each document and assign part-of-speech defined as follows: tags to the tokens with the help of the spaCy library [13]. Then for each noun in the text, we find its hypernyms in the knowledge Õ 𝑝 (𝑋 = 𝑥, 𝑌 = 𝑦) 𝑖 𝑀 𝐼 (𝑋 , 𝑌 ) = 𝑝 (𝑋 = 𝑥, 𝑌 = 𝑦) log 𝑖 𝑖 2 graph. The number of hypernyms for each noun is a parameter 𝑝 (𝑋 = 𝑥 )𝑝 (𝑌 = 𝑦) 𝑖 𝑥 ,𝑦 ∈ {0,1 } chosen by the user, we choose those hypernyms based on the highest frequencies of relation between the current noun and where 𝑝 (𝑋 = 𝑥) and 𝑝 (𝑌 = 𝑦) correspond to marginal distribu- 𝑖 the hypernyms. As shown later in the paper, bigger number of tions of the joint probability distribution of 𝑋 and 𝑌 . Tax2vec 𝑖 hypernyms does not help a lot, but increases execution time sig- computes the mutual information (MI) between all hypernym nificantly, so it is more sensible to choose a smaller number. Then features and a given class. So for each target class a vector of we create a document-based taxonomy, which is a directed graph mutual information scores is obtained, corresponding to MI be- where edges are created as hypernym-noun for each hypernym tween individual hypernym features and a given target class. and each noun. We merge the document-based taxonomies into Then the MI scores for each target class are summed up and the one corpus-based taxonomy (maintaining unique nodes, merge- final vector is obtained. The features are sorted by MI scores in Graph method in the pseudocode) and on it we perform one of descending order and the first 𝑑 features are chosen as the final the above mentioned heuristics to choose the best 𝑑 hypernyms. semantic space. The personalized PageRank algorithm takes Those steps are outlined in Algorithm 1. as an input a network and a set of starting nodes in the network and returns a vector assigning a score to each node. The scores 4 EXPERIMENTS AND RESULTS are calculated as the stationary distribution of the positions of a random walker that starts its walk on one of the starting nodes This section presents the setting of the experiments and the data and, in each step, either randomly jumps from a node to one of sets on which the experiments were conducted. We also describe its neighbors (with probability the metrics used to estimate classification performance. 𝑝 ) or jumps back to one of the starting nodes (with probability 1-𝑝). In our experiments prob- ability 4.1 Data sets 𝑝 was set to 0.85. The tax2vec exploits the idea initially introduced in [14], where personalized PageRank scores are com-We conducted the experiments on eight different data sets, which puted w.r.t. the terms, present throughout the document space. are described below. They were chosen intentionally from differ- This way, a graph-based, completely unsupervised ranking is ent domains and the basic information about them can be seen obtained, and is used in similar manner to other feature selection in Table 2. heuristics discussed in the previous paragraphs. In this section we introduce how the knowledge graph is used for semantic 4https://concept.research.microsoft.com/ 14 Knowledge graph aware text classification Information society ’20, October 5–9, 2020, Ljubljana, Slovenia Data: corpus, knowledgeGraph, maxHypernyms some cases. We compare those results to the classification without corpusTaxonomy = [ ]; any semantic features which is plotted as a grey horizontal line. foreach 𝑑𝑜𝑐 ∈ 𝑐𝑜𝑟𝑝𝑢𝑠 do On the other hand, on the datasets CNN News, Medical Relation documentTaxonomy = [ ]; and SMS Spam we didn’t see any improvement with the addition 𝑡 𝑜𝑘𝑒𝑛𝑠 = tokenize(𝑑𝑜𝑐 ); of semantic features. Figure 2 shows the relation between feature foreach 𝑡𝑜𝑘𝑒𝑛 ∈ 𝑡𝑜𝑘𝑒𝑛𝑠 do space size and the execution times. if 𝑡𝑜𝑘𝑒𝑛 is 𝑛𝑜𝑢𝑛 then edges = knowledgeGraph.edgesFrom(𝑡𝑜𝑘𝑒𝑛); foreach 𝑒𝑑𝑔𝑒 ∈ 𝑒𝑑𝑔𝑒𝑠 do if 𝑙𝑒𝑛(𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑇 𝑎𝑥𝑜𝑛𝑜𝑚𝑦) >= 𝑚𝑎𝑥 𝐻 𝑦𝑝𝑒𝑟 𝑛𝑦𝑚𝑠 then break; documentTaxonomy.add(𝑒𝑑𝑔𝑒 ∈ 𝑒𝑑𝑔𝑒𝑠) corpusTaxonomy.mergeGraph(documentTaxonomy) featureSelection(corpusTaxonomy) Result: Selected semantic features Algorithm 1: Semantic feature construction. Table 2: Data sets used for evaluation of knowledge graph’s extra features impact on learning. Data set Classes Words Unique w. Documents PAN 2017 Gender 2 5169966 607474 3600 PAN 2017 Age 5 992742 185713 402 SMSSpam 2 86910 15691 5571 CNN-news 7 1685642 159463 2107 MedicalRelation 18 1136326 66235 22176 Articles 20 5524333 178443 19990 SemEval2019 2 295354 39319 13240 Yelp 5 1298353 88539 10000 PAN 2017 (Gender) Given a set of tweets per user, the task is to predict the user’s gender [18]. PAN 2017 (Age) Given a set of tweets per user, the task is to predict the user’s age group [19]. CNN News Given a news article (composed of a number of paragraphs), the task is to assign to it a topic from a list of topic categories. [20]. SMS Spam Given a SMS message, the task is to predict whether it is a spam or not. [21]. Medical Relations Given an article with biomedical topic, the task is to predict the relationship between the medical terms annotated. [22]. SemEval 2019 Given a tweet, the task is to predict whether it contains offensive content [23]. Articles Given an web article, the goal is to assign to it a topic. [24]. Yelp Given an review of a restaurant, the goal is to predict the ranking from one to five stars. Settings. In all the datasets the stop words were removed. Stop words are for example "the", "is", "are" etc. There is no uni-Figure 1: Results of text classification on data sets Yelp, versal list of stop words in NLP research, however we used NLTK pan-2017-age, pan-2017-gender, CNN News, SMSSpam, Se- (Natural Language Toolkit) [25] for filtering stop words. The doc-mEval 2019, Medical Relation and Articles with execution uments were tokenized with the help of spaCy’s NLP tool. The times as the numbers in the plot. data sets were divided into 90% training data and 10% test data by using random splits. Number of hypernyms for each noun was 10. We used linear SVM classifier for classification and 𝐹1 5 CONCLUSION measure for performance. We showed that information from a large, real-life knowledge graph can improve text classification. Our approach aims at short 4.2 Results texts like tweets, shorter articles, messages and similar. We firstly Figure 1 shows that on some datasets (namely Yelp, PAN 2017 Age, process the document with spaCy, find nouns with their corre-PAN 2017 Gender and on SemEval 2019 and Articles) the extra sponding hypernyms, from which we create a taxonomy and semantic features constructed from the knowledge graph help in from that we later choose the most helpful features with one 15 Information society ’20, October 5–9, 2020, Ljubljana, Slovenia Petrželková et al. [6] T. K. Landauer. 2006. Latent semantic analysis. [7] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. [n. d.] Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26. [8] A. Celikyilmaz, D. Hakkani-Tür, P. Pasupat, and R. Sarikaya. 2015. Enriching word embeddings using knowledge graph for semantic tagging in conversational dialog systems. In. [9] M. K. Elhadad, K. M. Badran, and G. I. Salama. 2018. A novel approach for ontology-based feature vector genera- tion for web text document classification. [10] R. Kaur and M. Kumar. 2018. Domain Ontology Graph Approach Using Markov Clustering Algorithm for Text Classification. Advances in Intelligent Systems and Com- puting, 632. [11] S. Scott and S. Matwin. 1998. Text classification using WordNet hypernyms. In Usage of WordNet in Natural Lan- guage Processing Systems. [12] P. Wang and C. Domeniconi. 2008. Building semantic ker- Figure 2: Results of text classification on data sets SMSS- nels for text classification using wikipedia. In (August pam and SemEval 2019 with execution times as the num- 2008). bers in the plot. [13] M. Honnibal and I. Montani. spaCy 2: natural language un- derstanding with Bloom embeddings, convolutional neu- of the heuristics. The result remains interpretable, which is an ral networks and incremental parsing. To appear, (2017). advantage of this approach. This approach could be potentially [14] J. Kralj, M. Robnik-Sikonja, and N. Lavrac. 2019. Netsdm: improved by performing some type of word sense disambigua- semantic data mining with network analysis. Journal of tion and by finding objects in texts, which consists of more than Machine Learning Research, 20, 32, 1–50. one word. Further, other knowledge graphs can be used for the [15] J. Cheng, Z. Wang, J.-R. Wen, J. Yan, and Z. Chen. 2015. hypernym search. Also, because the hypernym search in each Contextual text understanding in distributional semantic document is independent, the documents can be processed in par- space. In ACM International Conference on Information and allel; however, such processing can be memory-intensive, which Knowledge Management (CIKM). is to be addressed. [16] W. Wu, H. Li, H. Wang, and K. Q. Zhu. 2012. Probase: a probabilistic taxonomy for text understanding. In ACM In- ACKNOWLEDGMENTS ternational Conference on Management of Data (SIGMOD). The work of BŠ was financed via a junior research grant (ARRS). [17] A. A. Hagberg, D. A. Schult, and P. J. Swart. 2008. Ex- This paper is supported by European Union’s Horizon 2020 re- ploring network structure, dynamics, and function using search and innovation programme under grant agreement No. networkx. In Proceedings of the 7th Python in Science Con- 825153, project EMBEDDIA (Cross-Lingual Embeddings for Less- ference, 11 –15. Represented Languages in European News Media). The authors [18] F. Rangel, P. Rosso, M. Potthast, and B. Stein. [n. d.] Overview acknowledge also the financial support from the Slovenian Re- of the 5th author profiling task at pan 2017: gender and search Agency for research core funding for the programme language variety identification in twitter. Knowledge Technologies (No. P2-0103), the project TermFrame [19] F. Rangel, P. Rosso, B. Verhoeven, W. Daelemans, M. Pot- - Terminology and Knowledge Frames across Languages (No. thast, and B. Stein. 2016. Overview of the 4th author pro- J6-9372) and the ARRS ERC complementary grant SDM-Open. filing task at pan 2016: cross-genre evaluations. [20] M. Qian and C. Zhai. 2014. Unsupervised feature selection REFERENCES for multi-view clustering on text-image web news data, 1963–1966. [1] K. Kowsari, K. J. Meimandi, M. Heidarysafa, S. Mendu, [21] T. A. Almeida and J. M. G. Hidalgo. 2011. Sms spam col- L. E. Barnes, and D. E. Brown. 2019. Text classification lection v. 1. http : / / www . dt . fee . unicamp . br / ~tiago / algorithms: A survey. CoRR, abs/1904.08067. smsspamcollection/. (2011). [2] Q. Wang, Z. Mao, B. Wang, and L. Guo. 2017. Knowledge [22] 2015. Medical information extraction. https://appen.com/ graph embedding: a survey of approaches and applications. datasets / medical - sentence - summary - and - relation - IEEE Transactions on Knowledge and Data Engineering. extraction/. (2015). [3] 2020. Tax2vec: constructing interpretable features from [23] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, taxonomies for short text classification. Computer Speech and R. Kumar. 2019. Predicting the Type and Target of & Language. Offensive Posts in Social Media. In Proceedings of NAACL. [4] F. Rangel, P. Rosso, M. Potthast, and B. Stein. 2017. Overview [24] 2019. Text classification 20. https : / / www. kaggle. com / of the 5th author profiling task at pan 2017: gender and guiyihan/text-classification-20. (2019). language variety identification in twitter. Working Notes [25] S. Bird, E. Klein, and E. Loper. 2009. Natural Language Papers of the CLEF. Processing with Python. O’Reilly Media. [5] D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latent dirichlet allocation. 16 EveOut: Reproducible Event Dataset for Studying and Analyzing the Complex Event-Outlet Relationship Swati Tomaž Erjavec Dunja Mladenić swati@ijs.si tomaz.erjavec@ijs.si dunja.mladenic@ijs.si Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan International Jožef Stefan International Jožef Stefan International Postgraduate School Postgraduate School Postgraduate School Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia ABSTRACT relationship and impact of different features on the selection of events by the outlets. We present a dataset consisting of 77, 545 news events collected between January 2019 and May 2020. We selected the top five 1.1 Contributions news outlets based on Alexa Global Rankings and retrieved all the events reported in English by these outlets using the Event The paper makes the following three contributions to science: Registry API. Our dataset can be used as a resource to analyze • The dataset generation scripts, which provide a structured and learn the relationship between events and their selection reproducible approach to building a publicly available by the outlets. It is primarily intended to be used by researchers dataset of news events with varied features. This will not studying bias in event selection. However, it may also be used to only speed up the development of future versions of Eve- study the geographical, temporal, categorical and several other Out, but will also help to create custom datasets with the aspects of the events. We demonstrate the value of the resource desired outlets and features. in developing novel applications in the digital humanities with • The compilation of EveOut, a novel dataset with a rich motivating use cases. Website with additional details is available range of event features and spanning multiple news cate- at http:// cleopatra.ijs.si/ EveOut/ . gories. • Identification of possible use cases intended to facilitate KEYWORDS the creation of tools to improve digital journalism and to Dataset, News Event Analysis, Event selection bias, News cover- help researchers study the complex relationship between age events and news outlets. 1 INTRODUCTION 2 DATASET News outlets are constantly faced with the task of selecting events Several news outlets may cover a single world event as a story in they will report on, dependent on the perceived interest of the a variety of different ways. A collection of one or more stories, all event to their readership. This can be driven by various factors, of which describe the same world event, is referred to as an ‘event’ such as the geographical origin of the event, involvement of in the entire paper. In the following subsections, we define our well-known persons, etc. Such selection requires monitoring of data generation process and provide statistics on the resulting current affairs to determine their news value for the outlet. dataset. Machine learning tools may help outlets to deal with the large numbers of events, help them explore strategies for selecting 2.1 Data Source publishable events, and build dedicated decision support systems We use Event Registry1[4] as the data source which monitors, for this task. The effectiveness of these systems depends on the collects, and provides news articles from news outlets around the availability of news event collections complemented by relevant world in over 30 languages. It also identifies the major incidents event details such as date, category, country of occurrence, brief reported in the articles and aggregates them into clusters known description, etc. as events. For example, “missiles launched by Iran at US forces in In this paper we introduce EveOut, the first large publicly Iraq” is an event reported across the globe in over 3,200 news available data set of 77, 545 English news events with a variety of articles. features collected between January 2019 and May 2020. It includes To construct an event, Event Registry follows a series of steps. events in eight different categories of news, i.e. business, politics, News aggregation is the first step in which RSS feeds are con- technology, environment, health, science, sports, and arts-and- stantly monitored for new articles. The next major step is the entertainment. We hope that EveOut will encourage publishers semantic event information extraction, which retrieves informa- and others involved in the news production process to develop tion from the articles in a structured way to be used in subsequent tools to enhance digital journalism. The data set would also allow steps. Clustering algorithms are then used to group articles that researchers from digital humanities to study and analyze the describe the same event. In the last step, the article clusters are marked as events and are annotated with rich metadata such as Permission to make digital or hard copies of part or all of this work for personal a unique id to track the event coverage, categories to which it or classroom use is granted without fee provided that copies are not made or may belong, geographical location, sentiment, etc. As a result, its distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this extensive temporal coverage can be used effectively to study the work must be honored. For all other uses, contact the owner /author(s). complex correlation between events and news outlets. Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia © 2020 Copyright held by the owner/author(s). 1 https://eventregistry.org 17 Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia Swati, Tomaž Erjavec, and Dunja Mladenić 𝑄 . Next, we set the time limit 𝑄 = [𝑄 , 𝑄 ] for ex- 𝑡 𝑖𝑚𝑒 𝑡 𝑖𝑚𝑒 𝑠𝑑 𝑒𝑑 tracting events that occurred within the specified time where, Select Outlets Set Time Constraint 𝑄 = ‘2019-01-01’ and 𝑄 = ‘2020-05-31’ signify the event’s 𝑠𝑑 𝑒𝑑 Ex: Top 5 Global Newspapers Ex: 2019-01-01 to 2020-05-31 start date and end date. Since the outlet’s event selection pol- icy may change over time, we selected this time frame as re- cent data tends to be more reliable in predicting event cover- age patterns. We then set 𝑄 = {𝑄 , 𝑄 , 𝑄 } where, 𝑡 𝑒𝑥 𝑡 𝑜𝑢𝑡 𝑙 𝑎𝑛𝑔 𝑐𝑎𝑡 Generate Event List 𝑄 = {‘𝑛𝑦𝑡𝑖𝑚𝑒𝑠’, ‘𝑖𝑛𝑑𝑖𝑎𝑡𝑖𝑚𝑒𝑠’, ‘𝑤𝑎𝑠ℎ𝑖𝑛𝑔𝑡𝑜𝑛𝑝𝑜𝑠𝑡 ’, ‘𝑢𝑠𝑎𝑡𝑜𝑑- 𝑜𝑢𝑡 Ex: eng-4500343 𝑎𝑦 ’, ‘𝑐ℎ𝑖𝑛𝑎𝑑𝑎𝑖𝑙𝑦 ’}, 𝑄 = {‘𝑒𝑛𝑔’}, and 𝑄 = {‘𝑝𝑜𝑙𝑖𝑡𝑖𝑐𝑠’, ‘𝑏𝑢- 𝑙 𝑎𝑛𝑔 𝑐𝑎𝑡 𝑠𝑖𝑛𝑒𝑠𝑠 ’, ‘𝑠 𝑝𝑜𝑟 𝑡 𝑠 ’, ‘𝑎𝑟 𝑡 𝑠 𝑎𝑛𝑑 𝑒𝑛𝑡 𝑒𝑟 𝑡 𝑎𝑖𝑛𝑚𝑒𝑛𝑡 ’, ‘𝑠𝑐𝑖𝑒𝑛𝑐𝑒 ’, ‘𝑡 𝑒𝑐ℎ- 𝑛𝑜𝑙 𝑜𝑔𝑦 ’, ‘ℎ𝑒𝑎𝑙 𝑡 ℎ’, ‘𝑒𝑛𝑣𝑖𝑟 𝑜𝑛𝑚𝑒𝑛𝑡 ’} represent the outlets, languages Extract Event Info and news categories respectively. Ex: id, date, title, summary, ... From the extracted event list, we first excluded events that were not covered by any of the selected outlets. We then extracted individual outlets from the event’s outlet list and created a column Generate Outlet Label in the dataset to represent each of them. We use a binary scalar Ex: 0- Not Covered, 1- Covered value to indicate whether the outlets covered the event or not. The event coverage by the outlets is not uniform, which can be EveOut - Event Outlet visualized in Figure 2. nytimes Figure 1: EveOut dataset generation process. chinadaily indiatimes Table 1: Description of the dataset attributes. Attribute Description usatoday washingtonpost uri a unique event identifier title title of the event in English event_date date in yyyy-mm-dd format sentiment event sentiment categories Figure 2: Distribution of event coverage by the outlets. event categories loc_country country where the event occurred loc_continent continent where the event 3 AVAILABILITY occurred total_article_count total number of articles published The GitHub repository containing the scripts is available at article_count total number of articles published https:// github.com/ Swati17293/ EveOut. To facilitate discov-in English erability and preservation, the full data set is archived as an on- summary summary of the event line resource at https:// doi.org/ 10.5281/ zenodo.3953878. Eve-outlet_list list of outlets that reported the Out is available in three common formats ( JSON, XML, and CSV) event for direct download and use. The documentation meets the re- quirements of the FAIR Data principles3 with all necessary metadata defined. Under the Creative Commons Attribution 4.0 Interna- 2.2 Data Generation Process tional license, it is freely available to make it reusable for almost any purpose. A separate web page with detailed statistics and To generate the dataset we adopted an automated approach which illustrations can be found at http:// cleopatra.ijs.si/ EveOut/ is depicted in Figure 1. We use Event Registry API to collect event for in-depth analysis. related information mentioned in Table 1. The script is designed to simplify the release of future versions and to be able to replicate 3.1 Reusability the process of generating custom datasets. The outlined process The resource is currently being used for individual projects is the result of the resource’s core requirement to best address and as a contribution to the project’s deliverables of the Marie the potential use-cases referred to in Section 4. 4 Skłodowska-Curie CLEOPATRA Innovative Training Network . For data generation, we first selected the top five news out- 2 A major part of this project aims to provide a temporal, cross- lets based on Alexa Global Rankings . We then used an ex- lingual analysis of concepts around different events, exploring plicit temporal query (𝑄 ) to retrieve all events in all news cat- 𝑡 how language impacts the mediatic narratives built by the media. egories from the Event Registry API. 𝑄 = {𝑄 , 𝑄 } 𝑡 𝑡 𝑒𝑥 𝑡 𝑡 𝑖𝑚𝑒 It also aims to analyse news reporting bias and multiple media consists of the text component 𝑄 and the time component 𝑡 𝑒𝑥 𝑡 3 http://www.nature.com/articles/sdata201618/ 2 4 https://www.alexa.com/topsites/category/Top/News/Newspapers http://cleopatra- project.eu/ 18 EveOut: Reproducible Event Dataset for Studying and Analyzing the Complex Event-Outlet Relationship Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia Figure 3: Overview of the category-wise event coverage by the outlets. narratives which would enable to filter out appropriate informa- that category are high/low than usual, it will be reflected in the tion which then will be used to build information representation outlet’s coverage pattern. tools. Since EveOut serves as the basis for the study and analysis Figure 4 reveals that instead of favoring events with neutral of events and their attributes, it is ideally suited to the project sentiment, outlets tend to favor events with positive sentiment. needs. In addition, event coverage by ‘usatoday’ and ‘washingtonpost’ is quite diverse with respect to sentiments. 4 POTENTIAL USE CASES 4.1 Examine Event-Selection Bias It is important for a journalist to know which event is worthy enough to be published. Even readers would be interested to know the factors that affect this selection. An automated solution can be devised using EveOut to provide an overview of the event and to visualize differences in coverage. 4.2 Outlet Prediction EveOut is designed to predict the likelihood of an event being covered by the outlet. It would enable the publishers of the outlets to assess the significance of the event. In addition, it may also be used by independent editors who prefer to report on events Figure 4: Distribution of event coverage by the outlets with covered by mainstream outlets. respect to sentiments. 5 STATISTICS AND ANALYSIS In this section we provide further information about the data In terms of the sentiments used in each category as plotted in contained in EveOut, focusing explicitly on the distribution of Figure 5, it is worth noting that ‘technology’ and ‘sports’ events events between the outlets. are mostly positive. With regard to the distribution of event categories covered by the outlets, as shown in Figure 3, ‘politics’ is the most common category, while ‘environment’ is the least common category. It is also worth noting that each outlet focuses on the different categories of events aside from ‘politics’. For instance, ‘india- times’ focuses more on events related to ‘arts and entertainment’, whereas ‘chinadaily’ tends to cover more ‘business’ related events. As far as the coverage of the event over time is concerned, it is also inconsistent as depicted in Figure 6. Furthermore, the event-coverage of ‘usatoday’ and ‘washingtonpost’ is slightly inconsistent. It is also interesting to note the sharp decline in coverage by ‘usatoday’ in ‘Aug 2019’ and by ‘washingtonpost’ in ‘May 2020’. The drop in the graph for washingtonpost in ‘May 2020 is due to its event preference. It is evident from washingtonpost’s radial graph in Figure 3 that its coverage is biased towards politics and sports. These two categories alone represent around 50% of events in the dataset. However, this percentage dropped to 40% in ‘May 2020 and, as a result, the coverage of washingtonpost dropped significantly. Increase of event coverage in ‘Mar 2019 is also attributed to the fact that about 56% of events were from Figure 5: Distribution of category over sentiments. these two categories. In nutshell, if the outlet favors a certain category of events and, in a specific time frame, and events of 19 Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia Swati, Tomaž Erjavec, and Dunja Mladenić Figure 6: Distribution of the event coverage by the outlets over time. 6 RELATED WORK ACKNOWLEDGMENTS There are a number of datasets that focus on news articles [7]. As This work was supported by the Slovenian Research Agency and far as the availability of event-centric datasets is concerned, there the European Union’s Horizon 2020 research and innovation is a scarcity of publicly available datasets. There are few related program under the Marie Skłodowska-Curie grant agreement No research on the event data [3, 1], but the extracted/generated 812997. datasets for the experiments is also not publicly accessible. GDELT [5] is the most popular, very large and publicly avail-REFERENCES able event-oriented news dataset. It contains data in multiple [1] Dylan Bourgeois, Jérémie Rappaz, and Karl Aberer. 2018. languages from a wide range of online publications. It’s collection Selection bias in news coverage: learning it, fighting it. In of world events is centered on location, network and temporal Companion Proceedings of the The Web Conference 2018, 535– attributes. There is no attribute defining the outlet list for the 543. event in the dataset. As a result, there is a lack of knowledge [2] Cindy Cheng, Joan Barceló, Allison Spencer Hartnett, Robert essential to the analysis of the event-outlet relationship that is Kubinec, and Luca Messerschmidt. 2020. Covid-19 govern- the foundation of our dataset. ment response event dataset (coronanet v. 1.0). Nature Hu- In addition, the existing event datasets [6, 2] are category-man Behaviour, 1–13. dependent (politics/healthcare/disaster etc.) which renders them [3] Felix Hamborg, Norman Meuschke, and Bela Gipp. 2018. useful for specific research purposes only. Therefore, by providing Bias-aware news analysis using matrix-based news aggre- a generalized event-centric news dataset, EveOut addresses the gation. International Journal on Digital Libraries, 1–19. stated dataset bottleneck. [4] Gregor Leban, Blaz Fortuna, Janez Brank, and Marko Gro- belnik. 2014. Event registry: learning about world events 7 CONCLUSIONS AND FUTURE WORK from news. In Proceedings of the 23rd International Confer- In this paper, we introduced the EveOut dataset, which covers ence on World Wide Web, 107–110. events reported by the top five global news outlets for over 17 [5] Kalev Leetaru and Philip A Schrodt. 2013. Gdelt: global data months. We have ensured that the dataset complies with the on events, location, and tone, 1979–2012. In ISA annual FAIR principles. In conjunction with the data set, we provide the convention. Volume 2, 1–49. source code for reproducing the dataset with varied features. [6] Clionadh Raleigh, Andrew Linke, Håvard Hegre, and Joakim For instance, it is possible to generate a reduced version of Eve- Karlsen. 2010. Introducing acled: an armed conflict location Out, focused on just one category, say ‘politics’. Specific outlets, and event dataset: special data feature. Journal of peace dates, and languages can also be specified in accordance with research, 47, 651–660. the requirements. We illustrate potential use cases to show how [7] Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu, the dataset could be used to study the pattern of event coverage Tao Qi, Jianxun Lian, Danyang Liu, X. Xie, Jianfeng Gao, of an individual outlet and to predict whether or not the outlet Winnie Wu, and M. Zhou. 2020. Mind: a large-scale dataset will cover a specific event. Researchers from digital humanities for news recommendation. In Proceedings of the 58th Annual can also use it for an in-depth analysis of complex event-outlet Meeting of the Association for Computational Linguistics, relationships. In the future , we intend to extend the dataset to 3597–3606. doi: 10 . 18653 / v1 / 2020 . acl - main . 331. https : include events described in different languages. //www.aclweb.org/anthology/2020.acl- main.331. 20 Ontology alignment using Named-Entity Recognition methods in the domain of food Gorjan Popovski1,2∗ , Tome Eftimov1 , Dunja Mladenić1,2 and Barbara Koroušić Seljak1,2 1Jožef Stefan Institute, 1000 Ljubljana, Slovenia 2Jožef Stefan International Postgraduate School, 1000 Ljubljana, Slovenia {gorjan.popovski, tome.eftimov, dunja.mladenic, barbara.korousic}@ijs.si Abstract Terminology-driven NER methods, also called dictionary- based NER methods [Zhou et al., 2006], match text phrases In recent years, a great amount of research has against concept synonyms that exist in the terminological re- been done in predictive modeling in the domain sources (dictionaries). The main disadvantage of these meth- of healthcare. Such research is facilitated by the ods is that only the entity mentions that exist in the resources existence of various biomedical vocabularies and will be recognized, but the benefit of using them is related to standards which play a crucial role in understand- the frequent updates of the terminological resources with new ing healthcare information. In addition, the Unified concepts and synonyms. Medical Language System (UMLS) links together Rule-based NER methods [Hanisch et al., 2005] use regu-biomedical vocabularies to enable interoperability. lar expressions that combine information from terminological However, in the food domain such resources are resources and characteristics of the entities of interest. The scarce. To address this issue, this paper explores a main disadvantage of these methods is the manual construc- methodology for ontology alignment in the domain tion of the rules, which is a time-consuming task and depends of food by leveraging Named-Entity-Recognition on the domain. (NER) methods based on different semantic re- Corpus-based NER methods [Alnazzawi et al., 2015; Lea- sources. It is based on a recently published rule- man et al., 2015] are based on an annotated corpus provided based NER method named FoodIE, whose seman-by subject-matter experts as well as the use of ML tech- tic annotations are based on the Hansard corpus, niques to predict the entities’ labels. These methods are less as well as a NER tool called Wikifier, from which affected by terminological resources and manually created DBpedia URIs are extracted. To perform the align- rules. However, their limitation is their dependence on an ex- ment we use the FoodBase corpus, which consists istence of an annotated corpus for the domain of interest. The of recipes annotated with food entities and includes construction of the annotated corpus for a new domain is a a ground truth version which is additionally used time consuming task and requires effort by the subject-matter for evaluation. experts to produce it. To exploit unlabelled data in constructing NER methods, 1 Introduction AL can be used [Settles, 2010; Tran et al., 2017]. This represents semi-supervised learning in which an algorithm is Information Extraction (IE) is the task of automatically ex- able to interactively query the user to obtain the desired la- tracting information from unstructured data and, in most bels/outputs at new data points. Which examples are sent cases, is concerned with the processing of human language to the user for labelling is chosen by the algorithm and their text by means of natural language processing (NLP) [Aggar- number is often much lower than the number of examples re- wal and Zhai, 2012]. The main idea behind IE is to provide quired for supervised learning. It usually consists of three a structure to the information extracted from the unstructured components: (1) the annotation interface, (2) the corpus- data. based NER, and (3) component for querying samples. One of the core IE tasks is named-entity recognition (NER), which addresses the problem of identification and classification of predefined concepts [Nadeau and Sekine, 2 Related work 2007]. It aims to determine and identify words or phrases in text into predefined labels (classes) that describe concepts 2.1 Hansard corpus of interest in a given domain. Various NER methods ex- ist: terminology-driven, rule-based, corpus-based, methods The Hansard corpus is a collection of text and concepts cre- based on active learning (AL), and methods based on deep ated as a part of the SAMUELS project [Alexander and An- neural networks (DNNs). derson, 2012; Rayson et al., 2004]. It contains 37 higher level semantic groups, one of which is our topic of interest — Food ∗Contact Author and Drink. 21 2.2 FoodIE Having annotated the recipes with both methods, we can FoodIE is a rule-based food Named-Entity Recognition perform the ontology alignment by using the location infor- method [Popovski et al., 2019a]. As it is rule-based, it con-mation for each annotation in each recipe. Each unique con- sists of a rule-engine in which the rules are based on compu- cept from both methods (semantic resources) is assigned its tational linguistics and semantic information that describe the unique ID, and then a table is constructed for each concept food entities. mapping containing the IDs. 2.3 Wikifier 5 Evaluation and experimental setup Wikifier is a tool that uses an efficient approach for annotating 5.1 Match types documents with relevant concepts from Wikipedia [Brank et • al., 2017]. It is based on a pagerank method to identify a set of True Positives (TP) — these are matches where the relevant concepts. As it provides the location in the document whole food concept is correctly annotated; where the annotation occurs, it is effectively a Named-Entity • False Positives (FP) — these are matches where a non- Recognition method. It provides Wikipedia concepts as anno- food concept is annotated as a food concept; tations, additionally assigning DBpedia concepts if they exist. • False Negatives (FN) — these are matches where a food entity is not properly annotated; 3 Data • Partial match — these are matches where only some to- A recent publication provides one of the first annotated cor- kens from a food concepts are properly annotated. pora, named FoodBase [Popovski et al., 2019b], containing food entities. It consists of two version, a ground truth set 5.2 Evaluation metrics referred to as “curated” (containing 1,000 annotated recipes), Using the concept of True Positives, False Positives and False as well an “un-curated” version, consisting of around 22,000 Negatives, we compute the widely used evaluation metrics: recipes. The recipe categories that are included are: Appe- Precision (P), Recall (R) and F1 Score (F1). They are defined tizers and snacks, Breakfast and Lunch, Dessert, Dinner, and as: Drinks. In this paper, we use the curated version to perform • the ontology alignment as well as evaluate the methodology. P = T P T P +F P This version was manually checked by subject-matter ex- • R = T P perts, so the false positive food entities were removed, while T P +F N the false negative entities were manually added in the corpus. • F 1 = 2 P ·R P +R An example of a recipe can be found on Figure 1. 6 Results and discussion 4 Ontology alignment After running the evaluation, we obtain the following results. Using FoodIE and the Wikifier tool, we obtain annotations The matches for both methods are presented in Table 1, while for all 1,000 recipes from the FoodBase. the evaluation metrics are presented in Table 2. FoodIE extracts and annotates each recipe with semantic tags from the Hansard corpus. Each annotation contains the Table 1: Match types. location of the extracted entity, i.e. where in the raw text the surface form representing the concept occurs, and its corre- FoodIE Wikifier sponding semantic tags from the Hansard corpus. TPs 11461 6380 The Wikifier tool is used to annotate the recipes with DB- FNs 684 4121 pedia URIs. As these are general DBpedia concepts, ad- FPs 258 5861 ditional information to filter out food concepts from non- Partial 359 3297 food concepts is required. Webscraping the pages for the URIs provides useful information that can be used to dis- tinguish food from non-food concepts, such as the broader Table 2: Evaluation metrics. concept/class to which the concept of interest belongs. The post-processing of the DBpedia URIs checks the entity type FoodIE Wikifier of the concept and checks if it is one of: “FOOD”, “FOODS”, F1 Score 0.9605 0.5611 “DISH”, “INGREDIENT”, “FOOD AND DRINK”, “BEV- Precision 0.9780 0.5212 ERAGE”, “PLANT”, “ANIMAL”, or “FUNGUS”. If it does Recall 0.9437 0.6076 not belong to one of the above entity types, the page is checked for mentions of other URIs which are semantically From the results in the tables it is evident that FoodIE pro- related to food: “FOOD”, “PLANT”, “ANIMAL”, or “FUN- vides more promising results. However, this was expected as GUS”. These URI mentions can occur anywhere in the page this NER method was specifically constructed to only cater and if one of these matches is satisfied, the entity is assumed to the domain of food. Of especial interest are the matches of to be a food entity. type partial, since they represent a match where only a subset A post-processed example of such an annotation can be of the tokens in a food entity are correctly recognized. For found on Figure 2. example, looking at Figure 1, the first extracted food entity 22 Figure 1: Example recipe from the “curated” part of FoodBase. Figure 2: Wikifier annotation example on a single recipe 23 should be “dry ranch salad dressing”, which is correctly ex- [Alnazzawi et al., 2015] Noha Alnazzawi, Paul Thompson, tracted by FoodIE. Looking at Figure 2, the same food entity Riza Batista-Navarro, and Sophia Ananiadou. Using text is only extracted as “salad”. Such match types do not factor mining techniques to extract phenotypic information from in the calculation of the evaluation metrics, as it is debatable the phenochf corpus. BMC medical informatics and deci- whether to count them as TPs or FNs. Nevertheless, they sion making, 15(2):1, 2015. are interesting to compare, since even partial matches con- [Brank et al., 2017] Janez Brank, Gregor Leban, and Marko vey at least some semantic meaning regarding the food entity. Grobelnik. Annotating documents with relevant wikipedia Moreover, FP annotations on the same figure are “bowl” and concepts. Proceedings of SiKDD, 2017. “shape” which are not food entities. Additionally, a recent comparison of existing food NER methods can be found in [Hanisch et al., 2005] Daniel Hanisch, Katrin Fundel, [Popovski et al., 2020], where the authors compare the per-Heinz-Theodor Mevissen, Ralf Zimmer, and Juliane formance of FoodIE with NER methods using other food on- Fluck. Prominer: rule-based protein and gene entity tologies available in the BioPortal. recognition. BMC bioinformatics, 6(1):S14, 2005. Regarding the mapping of the concepts, a total of 348 ex- [Leaman et al., 2015] Robert Leaman, Chih-Hsuan Wei, plicit concept mappings were discovered by the methodology. Cherry Zou, and Zhiyong Lu. Mining patents with tm- An example mapping for the concept “garlic” would be: chem, gnormplus and an ensemble of open systems. In • A000016: ‘garlic’, AG.01.h.02.e [Onion/leek/garlic]. Proce. The fifth BioCreative challenge evaluation work- shop, pages 140–146, 2015. • E000029: ‘garlic’, http://dbpedia.org/resource/Garlic [Nadeau and Sekine, 2007] David Nadeau and Satoshi 7 Conclusion and future work Sekine. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1):3–26, In this work we propose a methodology for ontology align- 2007. ment by using Named-Entity Recognition methods in the do- main of food. It utilizes the newly proposed FoodIE NER [Popovski et al., 2019a] Gorjan Popovski, Stefan Kochev, method and the Wikifier text annotation tool. Our experimen- Barbara Koroušić Seljak, and Tome Eftimov. Foodie: A tal results show that FoodIE provides more promising results rule-based named-entity recognition method for food in- than Wikifier, achieving an F 1 score of 0.9605, compared formation extraction. In Proceedings of the 8th Inter- to 0.5611. This is expected since FoodIE is specifically de- national Conference on Pattern Recognition Applications signed for the food domain, while Wikifier uses general vo- and Methods, (ICPRAM 2019), pages 915–922, 2019. cabulary and annotates text with Wikipedia concepts. [Popovski et al., 2019b] Gorjan Popovski, Barbara Koroušić For future work, recursive webscraping can be performed Seljak, and Tome Eftimov. FoodBase corpus: a new re- to more accurately distinguish between food and non-food source of annotated food entities. Database, 2019, 11 annotated concepts from the Wikifier tool. Specifically, this 2019. baz121. would mean repeating the steps to check if the entity is a [Popovski et al., 2020] G. Popovski, B. K. Seljak, and T. Ef- food entity or not on the parent nodes in DBpedia. Addition- timov. A survey of named-entity recognition methods ally, more food semantic resources can be included to provide for food information extraction. IEEE Access, 8:31586– mapping between multiple ontologies. Doing this is depen- 31594, 2020. dent on the existence of a NER method that works with con- cepts from the desired food semantic resource. [Rayson et al., 2004] Paul Rayson, Dawn Archer, Scott Piao, and AM McEnery. The ucrel semantic analysis system. Acknowledgements 2004. This research was supported by the Slovenian Research [Settles, 2010] Burr Settles. Active learning literature sur- Agency (research core grant number P2-0098), and the Eu- vey. University of Wisconsin, Madison, 52(55-66):11, ropean Union’s Horizon 2020 research and innovation pro- 2010. gramme (FNS-Cloud, Food Nutrition Security) (grant agree- [Tran et al., 2017] Van Cuong Tran, Ngoc Thanh Nguyen, ment 863059). The information and the views set out in this Hamido Fujita, Dinh Tuyen Hoang, and Dosam Hwang. A publication are those of the authors and do not necessarily re- combination of active learning and self-learning for named flect the official opinion of the European Union. Neither the entity recognition on twitter using conditional random European Union institutions and bodies nor any person acting fields. Knowledge-Based Systems, 132:179–187, 2017. on their behalf may be held responsible for the use that may [Zhou et al., 2006] Xiaohua Zhou, Xiaodan Zhang, and Xi- be made of the information contained herein. aohua Hu. Maxmatcher: Biological concept extraction us- ing approximate dictionary lookup. In Pacific Rim Interna- References tional Conference on Artificial Intelligence, pages 1145– [Aggarwal and Zhai, 2012] Charu C Aggarwal and ChengX- 1149. Springer, 2006. iang Zhai. Mining text data. Springer Science & Business Media, 2012. [Alexander and Anderson, 2012] Marc Alexander and J An- derson. The hansard corpus, 1803-2003. 2012. 24 Extracting structured metadata from multilingual textual descriptions in the domain of silk heritage M.Besher Massri Dunja Mladenić Jožef Stefan Institute, Slovenia Jožef Stefan Institute besher.massri@ijs.si Jožef Stefan International Postgraduate School Ljubljana, Slovenia dunja.mladenic@ijs.si ABSTRACT processing and annotation, we generated 24 binary datasets and 19 multi-class datasets (four for English, two for Spanish, and In this paper, we present a methodology for extracting structured one for French). Using machine learning techniques we trained metadata from museum artifacts in the field of silk heritage. The classifiers on the labeled data examples to predict the labels (slot main challenge was to train on a relatively small and noisy data values) based on the textual descriptions. Despite relatively small corpus with highly imbalanced class distribution by utilizing a and unbalanced data corpora, using sampling techniques and variety of machine learning techniques. We have evaluated the weighted loss function helped mitigate the issue. In an experi- proposed approach on real-world data from five museums, two mental evaluation, we observed that on our data using traditional English, two Spanish, and one French. The experimental results methods might be as good as using deep learning models when show that in our setting using traditional machine learning al- the data is scarce. However, using deep learning allows for build- gorithms such as Support Vector Machines gives comparable ing multilingual models that scale across different languages. and in some cases better results than multilingual deep learning The main contribution of this paper is in proposing an ap- algorithms. The study presents an effective approach for catego- proach to adding metadata to historical artifacts based on ap- rization of text described artifacts in a niche domain with scarce plying machine learning on multilingual textual descriptions of data resources. the artifacts. Moreover, we have defined the learning problem in KEYWORDS collaboration with domain experts and performed evaluations on real-world data in English, Spanish, and French. The rest of this Information extraction, Text classification, Silk heritage, Trans- paper is structured as follows. Section 2 provides a description of formers, Support Vector Machines. the data, Section 3 describes the proposed methodology, Section 4 gives the results of the evaluation and Section 5 concludes the 1 INTRODUCTION paper summarizing the approach and the findings. When looking to improve the understanding of silk heritage we find that the data available in the museums often lack seman- tic information on the artifacts or have them to some extent 2 DESCRIPTION OF DATA included in textual descriptions. To facilitate automatic analysis We used the SilkNow knowledge graph [8] as our source of data. of silk heritage data and support digital modeling of the weaving The source consists of records of different museums in different techniques, we propose multilingual metadata extraction from languages as shown in Table 1. The largest are MET with8364 textual descriptions provided by the museums. artifacts in English, VAM with 7231 artifacts in English, and Ima- We propose the usage of machine learning techniques to model tex with 6799 artifacts in Spanish. We have used a subset of the the target variables, referred here as slots to align with the ter- data that contain artifacts with provided metadata and textual minology of information extraction. Using machine learning descriptions in related fields that were pointed out as relevant by methods we build a model for each of the target variables in the domain experts. Each record consists of the basic information order to annotate the text. This enabled us to add metadata to about the object, such as the title and the museum it belongs to, the silk heritage artifacts of the museums. The domain experts along with two other sets of attributes, textual attributes, and collaborating on Silknow project [9] have identified four kinds categorical attributes. Textual attributes hold a textual descrip-of metadata information that would be useful and are contained tion of the object in several fields, such as physical description in texts of at least some of the targeted museums. We treat these and a technical description. The categorical description holds as four slots for information extraction, where the list of possible metadata information, such as technique or materials used. How- slot values for each of the four was defined by the domain experts. ever, the data quality varies across the museums and records. Based on that we formed a multi-class dataset for each slot. Some museums are rich in both textual and categorical attributes, The corpora of text included were in three different languages like the VAM museum, and others have short/low-quality textual (English, Spanish, and French) from five different museums, with attributes like Imatex. Also, some records have a text description a total of 500 museum records used in the study. After the data in their categorical attributes instead of a single category value. The metadata fields that we have considered are weaving Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or technique, weave, motifs, and style. The list of labels or slot distributed for profit or commercial advantage and that copies bear this notice and values for each of the metadata field (i.e. slot for information the full citation on the first page. Copyrights for components of this work owned extraction) were compiled by the domain experts. These values by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior describe the silk artifacts’ nature and structure. Each of those specific permission and /or a fee. Request permissions from permissions@acm.org. slot values is represented by a term and a list of alternatives, up Information society ’20, October 5–9, 2020, Ljubljana, Slovenia to four alternatives per term. Examples of slot values are satin, © 2020 Association for Computing Machinery. twill, and tabby, representing possible values of the weave slot. 25 Information society ’20, October 5–9, 2020, Ljubljana, Slovenia Museum Language Count The features were generated from sequences of words, referred CER Spanish 1296 to as n-grams, of length 1, 2, and 3. The remaining parameters Garin Spanish 3101 were left unchanged from their default values. We used nltk [1] Imatex Spanish 6799 library for tokenization, SpaCy [4] for lemmatization, and Snow Joconde French 376 Ball Stemmer [6] for stemming. MAD French 763 Due to the methodology of data labeling, we sometimes ended MET English 8364 up with a highly imbalanced datasets having a lot more negatives MFA English 3297 than positives. Therefore, in the binary dataset, we took a random MTMAD French 663 subset from the negative examples to match the positive count. In RISD English 3338 addition, some examples were generated from the same records, by having more than one textual record with mentions of the VAM English 7231 Table 1: Museums from the Silknow knowledge graph same class’s term/alternatives, therefore, corrections have been showing the language of the artifacts and the number of applied to the dataset by putting all examples of the same record artifacts included in the knowledge graph. in either train or test but not in both. This process was done to ensure no leakage occurs by potentially having highly similar textual text in train and text. 3.3 Multi-class Classification Tasks 3 METHODOLOGY For multi-class classification, we used a deep learning approach. The architecture consists of a pre-trained transformer, an LSTM 3.1 Annotating datasets with slot values layer, a dropout layer, a dense (linear) layer, and finally a soft-max Based on the data and target variables, two types of datasets activation layer. For the transformer we used BERT [3], multi-were formed for two types of text classification tasks. The first lingual BERT, and XLM-ROBERTA [2]. The loss function used type is binary classification dataset, in which the target class was a cross-entropy loss with Adam as the optimizer. We used is one of the slot values. The other is multi-class classification PyTorch framework [7] and hugging-face transformers library dataset, in which a dataset is formed for each of the four slots in [10]. each museum, where the target classes are the slot values that fall Considering that some of the datasets have a large class imbal- under the selected slot in addition to extra "other" class indicating ance, which can be a couple of thousand examples of the majority that the example doesn’t fall under any of them. class and only a few examples of the minority classes, we exper- For forming the binary classification dataset we used a simple imented with several class-weighting schemas. First, we tried string matching approach. For each target class in each museum, assigning weights to the classes in the loss function is inversely examples were formed out of textual attributes of the museum proportional to the number of examples of each class. In addi- records that contain a mention of either one of the possible value tion, when we used weighted sampling with return for loading terms or its alternatives. Categorical attributes of the same record the examples into batches. This had the effect of over-sampling were used to determine the label of the example. The task is to the minority classes and under-sampling the majority classes to classify whether the example has the slot value against the other achieve as balanced batch representation as possible. Finally, we slot values of the same slot. Each item is classified as True if tried a derivable version of F1 Macro as a loss function where the the categorical attributes contain only the target value or one prediction matrix is taken as a probability rather than a binary of its alternatives but not any of the other slot values’ terms value. or their alternatives. If there is no mention of the slot value term or alternatives, then it’s classified as false. If it contains 4 RESULTS this slot value’ term along with other slot values’ terms then it’s 4.1 Experimental Datasets considered as indeterminate and the example is removed. To form the multi-class datasets, we merged the datasets of The dataset collection methodology was applied to 10 museums the same museum with target classes representing slot values and 4 categories holding more than 150 class values overall. How- that fall under the same slot. The true items of each slot value ever, most of the datasets have no positive items. In this research, dataset formed the set of the examples with that slot value as the we have selected datasets with at least 10 positive examples for labels. The items that are false in each slot value dataset formed binary classification tasks and at least 10 non-other in multi- the "Other" class in the multi-class dataset. class tasks. This final list consists of 24 binary datasets and 19 multi-class datasets. These datasets are used for training machine 3.2 Binary Classification Tasks learning classifiers. For binary classification, we used TFIDF word-vector represen- 4.2 Binary Classification Tasks tation for generating the feature vectors and trained a Linear For binary Classification, we applied the described methodology Support Vector Machines (SVM) as the classifier using scikit- on all the datasets with at least 10 positive examples. The results learn library [5]. All dataset were split into train and test using of binary classification are consolidated in Table 2. 80-20 stratified split. We performed a grid search with 5-fold The graph in figure 1 displaying the correlation between the cross validation on the training part using the following options: number of examples and the F1 score reveals a weak correlation • stemming, lemmatisation, or none of 0.19. We can see that when having more than 600 examples, we • max document frequency: [0.95.1.0] achieve F1 over 0.8. Upon closer inspection on the museum level, • min document frequency: [0,0.05] we found that the best results are achieved in the MFA museum on • SVM tolerance: [1e-4,1e-5] motifs and weaving technique and Joconde museums on weave. 26 Extracting structured metadata from multilingual textual descriptions in the domain of silk heritage Information society ’20, October 5–9, 2020, Ljubljana, Slovenia Museum Slot value Slot Language #Exs Accuracy Precision Recall F1 cer bordado weaving technique Spanish 278 0.89 0.87 0.93 0.9 cer motivo vegetal motifs Spanish 146 0.57 0.56 0.6 0.58 cer tafetán weave Spanish 581 0.77 0.9 0.6 0.72 cer terciopelo weaving technique Spanish 118 0.67 0.67 0.67 0.67 garin brocatel weaving technique Spanish 932 0.88 0.85 0.92 0.89 garin damasco weaving technique Spanish 1748 0.9 0.92 0.87 0.89 garin espolÃn weaving technique Spanish 972 0.88 0.89 0.88 0.88 joconde Satin weave French 159 0.91 0.9 0.95 0.93 joconde Taffetas weave French 110 0.95 0.92 1 0.96 mfa Lace motifs English 190 0.92 0.9 0.95 0.92 mfa plain weaving technique English 130 1.00 1.00 1.00 1.00 vam brocade weaving technique English 634 0.87 0.87 0.87 0.87 vam damask weaving technique English 480 0.84 0.85 0.83 0.84 vam Ear motifs English 262 0.83 0.84 0.81 0.82 vam Edge motifs English 178 0.81 0.87 0.72 0.79 vam embroidery weaving technique English 1614 0.85 0.86 0.83 0.84 Table 2: Results for the binary classification task. Overall the best results are achieved by MFA and Joconde with because of the large fluctuation in F1 macro value across training an average F1 of .96 and .95 respectively followed by Garin, VAM, epochs caused by having minority classes with few examples. and CER with the average F1 of .89, .81, and .72 respectively. Model configuration Accuracy F1 Base model 84.6 43.1 Weighted loss 82.1 47.2 Weighted sampling 82.6 52.2 F1 loss function 77.5 59.1 weighted sampling and f1 loss 52 22.8 Weighted loss and weighted sampling 84.8 54.7 + Learning rate 1e-4 − → 5𝑒 − 6 86.1 57.9 Multi-Lingual BERT 85.3 55.2 XLM-ROBERTA 87.5 53.6 Table 3: Comparison between different model configura- tion on the Weave Slot detection in VAM Dataset Figure 1: F1 score vs #Examples showing good perfor- mance on the largest datasets, when the number of exam- ples is at least 600. Comparing the learning curves of BERT and multi-lingual BERT in figure 2 reveals that despite the comparable results, the multi-lingual BERT took double the number of epochs to 4.3 Multi Class Classification Class stabilize and finish training compared to its BERT counterpart. 4.3.1 Use Case: Detecting Weave Slot from VAM museum. We This can be due to the fact that Multi-lingual BERT is trained in selected the VAM Weave slot as a use case dataset to perform many languages and it needs more fine-tuning to adapt to any hyperparameter tuning and select the best configurations for certain language, whereas the BERT transformer was trained in weighting. The dataset contains 2760 items with a baseline of English-only documents. 52.9% distributed across 4 classes: Satin, Tabby, Twill, and Other. The dataset was split into train, test, and validation in the form 4.3.2 Generalizing towards all datasets. After we experimented of 60-20-20 split. The results in Table 3 show that using class with different parameter settings, we decided to use the follow-weighting in both loss function and sampling provides the best ing parameters on all the datasets: Weighted Loss function and −6 results w.r.t both classification accuracy and F1. Using F1 as a loss Weighted Sampling for batches; learning rate of 5 ∗ 10 ; batch function sometimes provided good results but was discarded as size of 16 for BERT and 12 for multi-lingual BERT and XLM- it was not stable across different datasets. In addition, decreasing ROBERTA, due to memory limits; 1024 Units for LSTM Layer; the learning rate improved results and stabilized the training dropout layer of 0.5. curve. Finally, using the XLM-ROBERTA transformer showed an Moreover, the datasets were tested against three types of trans- improvement in accuracy. The number of epochs was determined former: Language-Specific BERT, Multilingual BERT, and XLM- based on the accuracy performance of the validation dataset. The ROBERTA, as well as the SVM classifier. The accuracy results in training would stop when the accuracy did not improve for the Table 4 show that on most of the datasets SVM performs better last 15 epochs. The accuracy (F1 micro) was chosen over F1 macro or comparable to the deep learning models. 27 Information society ’20, October 5–9, 2020, Ljubljana, Slovenia Museum Lang Slot Baseline # Cls # Exs SVM BERT Multi BERT XLM-ROBERTA VAM English Weave 52.9 4 2760 82.8 86 85.3 87.5 VAM English Weaving Technique 35.9 14 3525 77.6 80.1 78 78 VAM English Motifs 84.8 9 5500 91 90.6 87.4 87 CER Spanish Weave 59.3 5 945 75.1 75.1 64 72 CER Spanish Weaving Technique 61.1 11 720 74.3 74.1 71.5 66 Joconde French Weave 55.6 4 180 66.7 30.6 86.1 91.7 Joconde French Weaving Technique 60 5 150 97.2 70 76.7 63.3 Table 4: Results for the multi-class classification task. ACKNOWLEDGMENTS This work was supported by the Slovenian Research Agency and SilkNow European Unions Horizon 2020 project under grant agreement No 769504. REFERENCES [1] Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python. O’Reilly Media. [2] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Figure 2: Comparison of a learning curve between BERT Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard and Multi-Lingual BERT as a transformer in the deep Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. learning model trained on the VAM museum Weave Slot 2020. Unsupervised cross-lingual representation learning dataset. at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, (July 2020), 8440–8451. doi: 10.18653/v1/2020.acl- main.747. https://www.aclweb. org/anthology/2020.acl- main.747. 5 CONCLUSION AND FUTURE WORK [3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: pre-training of deep bidirectional We propose an approach to extracting metadata from a multilin- transformers for language understanding. arXiv preprint gual text description of silk heritage domain museum artifacts. arXiv:1810.04805. The datasets had several specifics that made the model devel- [4] Matthew Honnibal and Ines Montani. spaCy 2: natural opment a non-trivial task. First, the size of the dataset some- language understanding with Bloom embeddings, con- times was too small to train a model. Second, some class values volutional neural networks and incremental parsing. To have considerably more examples than others, which caused appear, (2017). the datasets to be imbalanced. Finally, in the preparation phase, [5] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. the datasets were labeled to accommodate the described issues, Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, which in itself is an approximation and carries an inherent error V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. rate. We have improved the performance of the model by over- Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: sampling minority classes, under-sampling majority classes, and machine learning in Python. Journal of Machine Learning using a class-weighted loss function. In addition, by perform- Research, 12, 2825–2830. ing cross-validation in the binary classification case or adding a [6] Martin F. Porter. 2001. Snowball: a language for stemming dropout layer and validating based on a validation dataset, we algorithms. Published online. Accessed 11.03.2008, 15.00h. managed to mitigate some of the over-fitting behavior caused by (2001). http : / / snowball . tartarus . org / texts / introduction . having a little amount of data. We believe that the over-fitting html. could be mitigated further by using regularization on the LSTM [7] [n. d.] Pytorch: an imperative style, high-performance layer, as well as using weight-decaying in the optimizer. deep learning library. In. The experimental results show that with low data quality and [8] 2020. Silknow knowledge graph data. https://github.com/ having not enough data, traditional methods such as SVM in silknow/converter/tree/master/output. (2020). some cases outperform deep neural network models. We expect [9] 2020. SilkNow project. https://silknow.eu/. (2020). that the results could be improved by having an assembly of [10] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. those models instead of using one of them only, which is a part Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, of the future work. Furthermore, one can fine-tune each model S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, independently to achieve better performance. T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. In future work, we plan to test cross-museum learning by Rush. [n. d.] Huggingface’s transformers: state-of-the-art training on one museum and predicting other museums both in natural language processing. the same language and in different languages using multi-lingual transformers. This has practical value for labeling the data in the museums that do not contain metadata information but do have suitable textual descriptions of the artifacts. 28 Hierarchical classification of educational resources Gregor Žunič Erik Novak Jožef Stefan Institute Jožef Stefan Institute Ljubljana, Slovenia Jožef Stefan International Postgraduate School gregor.zunic@ijs.si Ljubljana, Slovenia erik.novak@ijs.si ABSTRACT 2 RELATED WORK This paper describes an approach to automate the process of la- There are two approaches to hierarchically classify the data: (1) the belling hierarchically structured data. We propose a top-down level- Big-bang, and (2) the Top-down level-based approach [4, 8, 9]. based approach with SVMs to classify the data with scientific do- The big-bang approach works by training (complex) global main labels. The model was trained on labeled open education classifiers which consider the entire class hierarchy as a whole. lectures and returns high accuracy predictions for lectures in the Each global classifier is binary and decides if the material fits the English language. We found that our model performs better with entire hierarchy (entire hierarchy is for example “Computer Sci- the traditional text extraction method TF-IDF than with pre-trained ence/Machine Learning/Support Vector Machine”). The advantage language model XLM-RoBERTa. of this approach is that it avoids class-prediction inconsistencies across multiple levels. The major drawback of this approach is the KEYWORDS high complexity due to the enforcing the model to correctly predict hierarchical classification, support vector machine, multi-class clas- the whole hierarchy branch, which can be difficult to achieve. sification, machine learning, open educational resources The top-down level-based approach works by training local classifiers at each level to distinguish between its child nodes. An ACM Reference Format: example will first, at the root level, be classified into a second- Gregor Žunič and Erik Novak. 2020. Hierarchical classification of educa-level category. It will then be further classified at the lower level tional resources. In Proceedings of Slovenian KDD Conference (SiKDD’20). category until it reaches one or more final categories where it can ACM, New York, NY, USA, Article 4, 4 pages. https://doi.org/10.475/123_4 not be classified any further. The main advantage of this model is its simplicity. The disadvantage is the difficulty to detect an error 1 INTRODUCTION in the parent category which could lead to false classification. Manually labeling data can be tedious work; one must have suf- The most common implementation of a local classifier [3] is the ficient background knowledge about the data and have clear in-support vector machine [7, 11]. In the later papers they propose to structions in the labeling process. This becomes even more difficult train separate SVMs for every level of a branch in the hierarchy. when the data needs to be annotated with hierarchically structured labels. 3 DATA SET In this paper we present a top-down level-based approach us- ing support vector machines (SVMs) for labeling open education The data set used in the experiment consists of 28,769 OER lec- resources (OERs). The labels are in a hierarchical structure and tures available at Videolectures.NET [10], an award winning video represent different scientific domains. We compare different lecture OER repository. For each lecture we collected the following meta- representations using TF-IDF and XLM-RoBERTa and find that the data: title, description, labels, language, authors, date published and TF-IDF representations yield better results. Even though the paper the length of the lecture. The description is present in 58% of the focuses on OERs the method can be generalized to any textual data lectures. The data set contains 24532 lectures in English, 3930 in set. Slovene and 307 lectures in other 16 languages. The remainder of the paper is structured as follows. Section 2 Preprocessing. For our methodology we used only the lecture’s describes the related work done on the topic of hierarchical classifi- title, description, language and categories. Each lecture is labeled cation. Next, we present the data used in the evaluation in Section 3. with one or more scientific (sub-)domains most relevant for the The methodology is described in Section 4. The evaluation setting lecture (e.g. “Computer Science”, "Computer Science/Crowd Sourc-and its results are described in Section 5 followed by a discussion ing"). Figure 1 shows the distribution of lectures per number of in Section 6. We present the future work in Section 7 and conclude labels. the paper in Section 8. Almost half of the lectures have more than one label. Lectures with no labels are placed under the “No Labels” category. These lectures are mostly introductory speakers’ presentations in confer- Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed ences. We focus on predicting a single label with high accuracy. We for profit or commercial advantage and that copies bear this notice and the full citation prescribed to only have one label per lecture. We achieve this by on the first page. Copyrights for third-party components of this work must be honored. duplicating a lecture For all other uses, contact the owner/author(s). 𝑛 times, where 𝑛 is the number of labels of SiKDD’20, October 2020, Ljubljana, Slovenia the lecture and assign a distinct label to each duplicate. Although © 2020 Copyright held by the owner/author(s). the duplicates may reduce the performance of the models we do ACM ISBN 123-4567-24-567/08/06. not reduce the already small number of lectures used during the https://doi.org/10.475/123_4 29 SiKDD’20, October 2020, Ljubljana, Slovenia Gregor Žunič and Erik Novak XLM-RoBERTa. The model is based on the RoBERTa model released in 2019. It is a large language model trained on 2.5 TB of CommonCrawl data [2]. The model achieves state-of-the-art performance on cross-lingual classification, sequence labeling and question answering. The most useful feature of the model is that it does not require the sentence language as an input. In theory, it extracts the same vectors for similar words in 100 languages. The length of the vector that the model outputs is 768. To ex- tract the features a CUDA-enabled GPU is required and the model training is very slow. 4.2 Multi-class SVM Classifier Figure 1: Distribution of lectures per number of correspond- We chose the top-down level-based approach for our classifier. The ing labels. Most of the lectures have only one label. raw text input is firstly vectorized following one of the two feature extraction approaches described in Section 4.1. The vector is then training process. Figure 2 shows the top scientific domain labels in input to the main SVM which determines the first category. Then the data set. the input is handled by the second SVM, trained specifically for sub- labels of first classified category. If a sub-label tops the threshold of 0, this step is repeated, otherwise the model outputs the lowest level parent category. For example “Computer Science” is the first determined cate- gory. Then the input is handled by the SVM trained on sub-labels of “Computer Science”, which determines that the input does not match with any of the sub-labels. The model puts the lecture in the “Computer Science” category. This is visually explained in figure 3. Input ... “Machine Feature “Computer 0 . 1 - 0 . 2 Learning” SVM Figure 2: Top scientific domain labels in the data set. The extraction Science” most frequent label is Computer_Science. SVM “Semantic - 0 . 7 “Business” - 0 . 7 SVM . . . Web” The most frequent label is “Computer Science”. In addition, a “Social - 1 . 0 SVM . . . Sciences” large number of lectures are not labeled; this is because a lot of ... lectures are presentations that do not correspond to any of the scientific domains. The data set is unbalanced on both domain and Figure 3: Visual representation of hierarchical SVM classi- sub-domain levels. fier. The example shows a lecture classified as belonging to the “Computer Science” category 4 METHODOLOGIES In this section we describe the methods used to perform the feature Each SVM is an implementation of a multi-class classifier using extraction of the text, the implementation of multi class classifier the one-vs-rest approach. Predicted class should always be domi- model and the lectures’ weights. nant otherwise the recommendation is not relevant. The input to the classifier is a raw string created by concatenating the title and the description if the description is available. It is then 4.3 Lecture Weights converted to a vector. In this paper we experimented with two Each lecture is assigned a weight of 1 , 𝑥 = 4, where 𝑛 is the 𝑥 𝑛 approaches: TF-IDF and XLM-RoBERTa. number of total labels in the original lecture and 𝑥 is a parameter. If 𝑥 < 4 the accuracy is greatly reduced, if 𝑥 > 4 the accuracy is 4.1 Feature Extraction increased by a small margin. It converges when 𝑥 → ∞. When TF-IDF. Each lecture is represented with a vector of its TF-IDF increasing the parameter 𝑥 the weight comes closer to 0 which values [6]. TF measures how frequently a term occurs in a lecture’s means that the model accounts for data less during training. This text. The IDF is a measure of how much information the word means that the 4th power is a sufficient balance between excluding provides. If it is common across all lectures its value is close to 0. some data and reducing the accuracy. The terms with the highest TF-IDF scores are usually the ones that The other approach could be to ignore multi-label lectures during characterize the topic of the lecture best. testing phase ( 1∞ ). 𝑛 The size of the lecture’s vector representation is exactly the same Because some labels are so scarce, we limit ourselves to labels as the total number of unique words. Since most of the features are with at least 20 lectures. This reduces the total number of labels in zero the lecture vectors are sparse. the data set from 502 to 244. 30 Hierarchical classification of educational resources SiKDD’20, October 2020, Ljubljana, Slovenia 5 EVALUATION the model would opt for SVMs trained on features extracted using 5.1 Parameters and Specifications TF-IDF, because of the better performance. All other languages would be handled by SVMs trained by XLM-RoBERTa, because the SVM. The SVM implementation used in the evaluation is the Lin- classifier performs much better than random. earSVC [1] with the default parameters. The TD-IDF method could also be used to classify lectures that XLM-RoBERTa. The model used for representation generation are in the non-english languages by firstly translating the text to is the hugging face’s pretrained model [5] which was trained on English before using them during training. With this approach the default parameters found in the paper [2]. The training was exe-model could work in all languages and retain the simplicity of TF- cuted on the Google Colab (online hosted Jupyter notebook) free IDF. Note that that this approach would be strongly dependant on tier machine (12GB RAM, dual core CPU, NVIDIA K80). the quality of the translations. Weighting the errors during the training process. We did 5.2 Results not use the hierarchy structure for calculating the error between Table 1 shows the performance of the different models with linear the predicted and the actual labels hence all the errors types during kernel. We have also evaluated other kernels (polynomial, RBF, training were the same. This is not ideal because the error should sigmoid), but the performance was worse than using linear kernel. be more significant when the classifier incorrectly predicts the That is why we omitted them from the performance table. main branch versus when it incorrectly predicts a lower level label. TF-IDF with linear kernel SVM. Using the TF-IDF method for For example, if we take a lecture that is labeled as “Computer feature extraction we found that the SVMs performed the best with Science/Machine Learning” then the error should be bigger if our linear kernel. One explanation for such results is that the dimension classifier predicts the “Biology” label rather than the “Computer of the features is large (more than 60k), which means that other Science/Semantic Web” label. more advance kernels might lead to over-fitting. XLM-RoBERTa with linear kernel SVM. The model’s perfor- 7 FUTURE WORK mance was worse than using TF-IDF. The accuracy of the main We intend to improve the performance of the XLM-RoBERTa and classifier was 19% compared to 70% when using TF-IDF. The other to experiment with other language models and try to achieve better SVM kernels (polynomial, RBF, sigmoid) performed worse com- performance. pared to linear kernel. Table 1 shows the performance of the model. One additional direction for future work might be training a SVM. The problem with current SVM implementation is that it multiclass classifier to predict more than one label to a given lecture. can only put the lecture in one category. One way to solve the issue We tried implementing the multi label output classifier using the of only one label would be to firstly predict one label. Then, if the MultiOutputClassifier wrapper on SVM but the precision of the user (editor) wants another prediction, the model can output the model was noticeably lower. prediction with second highest certainty. The model is ready to be used in production in Videolectures.NET TF-IDF vs XLM-RoBERTa. The advantage of choosing XLM- as a recommender engine to help the editors. The service could RoBERTa over of TF-IDF is that it works with 100 languages. The either be wrapped in a Flask microservice or directly into Videolec- vector outputs are similar [2] for all languages. This was proven tures.NET’s backend. by translating the same text input into multiple languages (using Google Translate) and the predicted category did not change. When 8 CONCLUSION using TF-IDF you have to split the original data set into subsets In this paper we explore a top-down level-based approach for clas- containing a single language and train the model from scratch. That sifying OER lectures with scientific domain labels. We used over- would be possible with enough data. For some languages (German, sampling to handle label unbalance and experimented with two French) the the data set contains less than 30 lectures, which means text representation approaches, TF-IDF and XLM-RoBERTa. We that you can not train an SVM sufficiently. found that the model using the TF-IDF representations gives better results. 6 DISCUSSION ACKNOWLEDGMENTS Unbalanced Data Set. We found the SVM trained on an over- sampled data set to be working better than the SVM trained on the This work was supported by the Slovenian Research Agency and raw data set. Due to the unbalanced data if the data set is not re- X5GON European Unions Horizon 2020 project under grant agree- sampled the bias towards the strongest category (Computer Science) ment No 761758. is strongly presented. For example neutral words such as “ ”, “the” etc. are classified as belonging in REFERENCES Computer Science category. Comparing Word Embedding Techniques. The TF-IDF ap- [1] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, proach performs much better than XLM-RoBERTa which is surpris-Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gaël ing. Pre-trained models usually perform better than legacy feature Varoquaux. 2013. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining extractors. The reason could be that the hyper parameters of the and Machine Learning. 108–122. model were not set correctly, but we did not find the right balance [2] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guil-for the model to perform any better. The production versions could laume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised Cross-lingual Representation Learning include both models. For languages with a lot of data in the data set, at Scale. arXiv preprint arXiv:1911.02116 (2019). 31 SiKDD’20, October 2020, Ljubljana, Slovenia Gregor Žunič and Erik Novak parent TF-IDF XLM-RoBERTa materials category acc. recc. F prec. acc. recc. F prec. Root 70% 69% 72% 75% 19% 11% 19% 68% 27009 Computer Science 59% 59% 60% 61% 9% 4% 8% 50% 12935 Machine Learning 60% 55% 59% 64% 11% 5% 9% 26% 3260 Semantic Web 75% 71% 75% 79% 23% 20% 31% 68% 454 Computer Vision 82% 79% 81% 83% 57% 55% 59% 63% 140 Social Sciences 73% 72% 73% 74% 35% 24% 34% 60% 2928 Society 74% 72% 72% 72% 36% 28% 38% 60% 890 Politics 76% 66% 75% 86% 59% 43% 54% 73% 83 Law 96% 96% 96% 96% 57% 41% 51% 67% 112 Journalism 100% 100% 100% 100% 91% 88% 90% 92% 53 Technology 84% 82% 82% 82% 50% 43% 50% 60% 970 Nanotechnology 69% 59% 69% 83% 46% 37% 46% 62% 78 Business 74% 72% 73% 74% 43% 36% 43% 54% 1009 Transportation 63% 53% 61% 71% 33% 22% 32% 56% 267 Humanities 85% 83% 84% 85% 55% 48% 55% 65% 873 Biology 71% 66% 67% 68% 23% 17% 22% 31% 430 Science 78% 77% 78% 79% 53% 51% 52% 53% 656 Medicine 89% 88% 89% 90% 39% 34% 48% 83% 326 Computers 83% 83% 83% 83% 55% 48% 53% 59% 731 Mathematics 89% 87% 89% 91% 41% 36% 38% 40% 421 Physics 86% 81% 85% 89% 36% 32% 38% 46% 227 Arts 88% 87% 85% 83% 45% 40% 49% 63% 338 Visual Arts 100% 100% 100% 100% 62% 56% 70% 92% 159 Design 52% 46% 55% 68% 23% 9% 14% 30% 104 Chemistry 100% 100% 100% 100% 85% 83% 91% 100% 161 Environment 94% 94% 93% 92% 71% 66% 73% 81% 161 Earth Sciences 73% 67% 74% 82% 50% 51% 50% 49% 27 Table 1: Comparison of model performance using the linear kernel. The performance of the TF-IDF approach is better than that of XLM-RoBERTa. [3] Susan Dumais and Hao Chen. 2000. Hierarchical Classification of Web Content. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’00). Association for Computing Machinery, New York, NY, USA, 256–263. https://doi.org/10.1145/345508.345593 [4] A. D. Gordon. 1987. A Review of Hierarchical Classification. Journal of the Royal Statistical Society: Series A (General) 150, 2 (1987), 119–137. https://doi.org/10. 2307/2981629 arXiv:https://rss.onlinelibrary.wiley.com/doi/pdf/10.2307/2981629 [5] huggingface. 2020. huggingface.co - pretrained models. https://huggingface.co/ transformers/pretrained_models.html. [6] J.D. Rajaraman, A.; Ullman. 2011. Mining of Massive Datasets. pp. 1–17. http: //i.stanford.edu/~ullman/mmds/ch1.pdf. [7] Ahmad Shalbaf, Reza Shalbaf, Mohsen Saffar, and Jamie Sleigh. 2020. Monitoring the level of hypnosis using a hierarchical SVM system. Journal of Clinical Monitoring and Computing 34, 2 (2020), 331–338. https://doi.org/10.1007/ s10877-019-00311-1 [8] Carlos N. Silla and Alex A. Freitas. 2011. A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery 22, 1 (2011), 31–72. https://doi.org/10.1007/s10618-010-0175-9 [9] Aixin Sun, Ee-Peng Lim, and Wee-Keong Ng. 2003. Performance measurement framework for hierarchical text classification. Journal of the American Society for Information Science and Technology 54, 11 (2003), 1014–1028. https://doi.org/10. 1002/asi.10298 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/asi.10298 [10] VideoLectures.Net. 2020. VideoLectures.NET - VideoLectures.NET. https:// videolectures.net/. Accessed: 2020-08-20. [11] S. V. M. Vishwanathan and M. Narasimha Murty. 2002. SSVM: a simple SVM algorithm. 3 (2002), 2393–2398 vol.3. 32 Are You Following the Right News-Outlet? A Machine Learning based approach to outlet prediction Swati Dunja Mladenić swati@ijs.si dunja.mladenic@ijs.si Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan International Postgraduate School Jožef Stefan International Postgraduate School Ljubljana, Slovenia Ljubljana, Slovenia ABSTRACT outlet is forced to select a set of reporting events. Several factors, such as the geographical origin of the event, the involvement of In this work, we propose a benchmark task of outlet prediction an elite person or country, etc. influences such selection. Also and present a dataset of English news events tailored to the the procedure requires rigorous monitoring of current affairs to proposed task. Addressing this problem would not only allow determine the news value, and may result in event selection bias readers to choose and respond to relevant and broader facets also known as gatekeeping bias. of events but also enable the outlets to examine and report on their work. We also propose a neural network based approach However, no well-established automated method reveals to to recommend a list of probable outlets covering an event of users the outlets that will cover the event of their interest. This interest. Evaluation results reveal that even in its simplest form, drives the motivation of this study. The aim is to predict a list of our model is capable of predicting the outlet significantly better outlets reporting on a given event. Addressing this problem would than the existing rule based approaches. The proposed model not only allow readers to choose and respond to relevant and will also serve as a baseline for evaluating approaches intended broader facets of events but also enable the outlets to examine and to address the task. Implementation scripts can be found at https: // github.com/ Swati17293/ outlet-prediction report on their work. For instance, some outlets tend to publish . events covered by well-established outlets. Instead of waiting for KEYWORDS the news to be published, the proposed system will help them to get an insight into the degree of predictability of event selection News bias, Event Selection bias, News coverage, News Event by the major outlets. Analysis, Recommendation System 1 INTRODUCTION 1.1 contributions We make the following contributions in this context: The advancement in the field of Natural Language Processing [9, 10, 5, 4] over the last decade, has made solutions to complex • We propose a benchmark task of outlet prediction and machine learning problems more convenient. The problems such present a dataset of English news events tailored to the as machine translation, text summarization, and segmentation proposed task. are being solved much more efficiently than ever before. Conse- • We provide a neural network model that can serve as a quently, it offered the researchers the opportunity to use these baseline for evaluating approaches intended to address advanced techniques to solve problems in a variety of contexts the task. such as news bias analysis. This analysis task is poised as the The GitHub repository containing our code is available at identification of the inherent bias present in the news production https:// github.com/ Swati17293/ outlet-prediction. and its coverage process. It occurs when a news outlet publishes a news story selectively or incorrectly. 1.2 Problem Statement The problem is addressed as an outlet prediction task in which the If the news is biased, then it can bias the thought process bias is examined by comparing the learning ability of a classifier and decision making of the person listening, watching, and/or trained to predict the probability of event coverage by an outlet. reading it [12]. It can have several direct or indirect implications whether political or social. For example, if the news shows only 2 LITERATURE REVIEW the positive or negative side of a political party; it has been ob- During the different stages of news production, various forms of served to influence the public vote [2]. Not only politics but also news bias arise as described by Baker et al. [1]. The first stage the news about the disaster or spread of viral disease affects the begins with the selection of events also called gatekeeping, where belief system of the general public. an outlet selects or rejects an event for reporting. The selection process is driven by a number of factors, such as the geographical There are numerous events that happen continuously, and origin of the event, the involvement of an elite person or country, any form of bias can arise in numerous possible ways. It is not etc., and requires rigorous monitoring of current affairs to de- possible for any single outlet to capture every event. Thus, an termine the news value. To our knowledge, only a few methods Permission to make digital or hard copies of part or all of this work for personal have been suggested that explicitly attempt to examine this bias. or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this Saez-Trumper et al. [11] attempted to identify bias in online work must be honored. For all other uses, contact the owner /author(s). news sources and social media groups surrounding them. They Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia studied the disparity in the selection of events based on the quan- © 2020 Copyright held by the owner/author(s). tity and exclusivity of stories published by 80 mainstream news 33 Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia Swati and Dunja Mladenić outlets across the globe over a span of two weeks. From the re- 3.2 Dataset view, it is found that there is a weak correlation between the For our experiments, we first selected the top three news outlets quantity and exclusivity of news articles published by the outlets. 3 based on Alexa Global Rankings . We then used the Event Reg- It is also discovered that both the news and social media follow istry API to collect all news events reported in English between the same pattern of selection of events in similar geographical January 2019 and May 2020. We excluded events that were not areas. However, media in the same region often choose the same covered by any of the selected outlets. We ended up with 51, 409 events and publish similar-length posts. events for which we extracted basic information such as event id, title, summary, and source. Since the event coverage by these out- Bourgeois et al. [3] used a matrix factorization method to ex-lets is not uniform, which can be visualized in Figure 1, we used tract latent factors that determine the selection of the event by a stratified split to mimic this imbalance across the generated an outlet. They combined the method with a BPR optimization train-valid-test sets. scheme developed by Rendle et al.[8]. They used the events derived from the GDELT dataset and arranged the outlets in rows and their reported events in columns to form a matrix. Each cell value of the resulting matrix describes the selection/rejection of the event by the outlet. nytimes washingtonpost For the bias analysis, they chose affiliation, ownership, and geographic proximity of the different outlets as the major factors. They suggest that each outlet follows its own latent preferences structure which facilitates the outlet to rank events. They also indiatimes suggested that events should be selected such that the selected list should be diverse and should include a wide range of actively reported events. They thus adopted the method of Maximum Marginal Relevance which facilitates ranking based on the rel- Figure 1: Distribution of event coverage by the outlets. evance and diversity of the events. It is discovered that event selection favors the most discussed topics rather than the unique ones. 4 MATERIALS AND METHODS F. Hamborg et al. [6] uses a matrix similar to the one created 4.1 Problem Modeling by Bourgeois et al.[3] Each cell in the matrix represent the most For an event 𝐸 and its associated pair (𝑇 , 𝑆 ), the task is to generate representative topic of the article reported by one country about a list of outlets 𝑂 expected to cover 𝐸 . Here 𝑇 is the event title the other. By spanning the matrix through outlets and topics in and 𝑆 is a short summary of the event as provided by the Event a region, the bias can be examined. They used a collection of 1.6 Registry. Mathematically, the task can be formulated as, million articles from more than 100 countries over a two-month 1 span from the Europe Media Monitor (EMM) as their dataset. 𝑂 = 𝑓 (𝑇 , 𝑆, 𝛼 ) (1) Authors in [6] aggregates the related articles and then out-where, 𝑓 is the outlet prediction function and 𝛼 denotes the source the task of bias identification to the users, forcing them model parameters. 𝑂 can have a well-thought-out variable length 𝑙 to determine the bias on their own. While the rest of the existing response generated from the list unique outlets 𝑂 . For this work, 𝑙 work analyzes the selection bias, it certainly does not present an |𝑂 | = 3. automated approach suited to the outlet prediction task, unlike our work. 4.2 Methodology We extract feature vectors from 𝑇 and 𝑆 . We fuse them together to 3 DATA DESCRIPTION create a fused vector which is then passed through several layers to finally generate 𝑂 . Figure 2 illustrates the entire prediction 3.1 Raw Data Source process. We further outline these tasks with more details in the Event Registry2 [7] monitors, collects, and provides news arti-following subsections. cles from news outlets around the world. It also aggregates them 4.2.1 Feature Extraction and Fusion. We used Google’s Univer- into clusters that are referred to as events. Each event is then sal Sentence Encoder 4(USE) to extract 128-dimensional feature annotated with several metadata such as unique id to track the ′ ′ ′ ′ vectors 𝑇 and 𝑆 . For feature fusion, we concatenated 𝑇 and 𝑆 event coverage, categories to which it may belong, geographical and applied 𝑡 𝑎𝑛ℎ activation to generate 𝐹 . We then used batch- location, sentiment, etc. As a result, its large-scale temporal cov- normalization to increase the stability of the network and for erage can be used effectively to study the event selection process regularization. of news outlets. ′ ′ 𝐹 = 𝐵𝑁 (𝑡 𝑎𝑛ℎ (𝑇 ⊕ 𝑆 )) (2) In Eq 2, 𝐵 𝑁 and ⊕ represents batch-normalization and concatenation respectively. 1 3 https://ec.europa.eu/knowledge4policy/ https://www.alexa.com/topsites/category/Top/News/Newspapers 2 4 https://eventregistry.org https://tfhub.dev/google/universal- sentence- encoder/ 34 A Machine Learning based approach to outlet prediction Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia USE T Event Title T′ tanh F softmax Batch Norm FC Outlet (Ô) Event S S′ Summary USE Figure 2: Outlet prediction process. 4.2.2 Outlet Prediction. Table 1: Multiple correct predictions. We solve the problem using a multi-label classification model for which we create a separate outlet-index dictionary for outlets 𝐷 = {𝑜 : 1 : 2 : 1 , 𝑜 2 . . . 𝑜 𝑛 }, where 𝑛 𝑛 indiatimes nytimes washingtonpost 𝑙 is the total number of unique outlets in 𝑂 . To predict the list indiatimes washingtonpost nytimes of outlets we pass 𝐹 to the fully-connected layer (FC) having 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 activation with 𝑛 output neurons. Since an event can be covered by more than one outlet, we formulate the recursive • Subset Accuracy (𝑎): It measures the percentage of in- prediction procedure as, stances in which all of the outlets are correctly classified. ˆ 𝑜 = P (𝑜 |𝐹 , ˆ 𝑜 + 𝑏 ) (3) 𝑁 𝑖 𝑖 −1, 𝑏 ) = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 (𝐹 𝑤𝑖 𝑖 1 Õ Subset Accuracy (𝑎) = ( ˆ 𝑜 − 𝑜 ) (6) 𝐹 𝑤 +𝑏 𝑖 𝑖 𝑒 𝑖 𝑖 𝑁 = (4) 𝑖 =1 Í𝑛 𝐹 𝑤 +𝑏 𝑒 𝑗 𝑗 𝑗 =1 • Hamming Loss (ℓ): It measures the fraction of the incor- 𝑡 ℎ rectly predicted outlet to the total number of outlets. Since where, ˆ 𝑜 is the probability of selecting the 𝑖 outlet (𝑜 ) given 𝐹 , 𝑖 it is a loss function, its ideal value is 0. bias (𝑏 ), and the set of probabilities of previously predicted outlets ( ˆ 𝑜 ), and 𝑤 is the weight. We use categorical cross entropy as 𝑁 𝑖 −1 1 Õ ∩ ˆ 𝑜 𝑜 𝑖 𝑖 the loss function as follows: Hamming Loss (ℓ ) = (7) 𝑁 ˆ 𝑜 ∪ 𝑜 𝑖 𝑖 𝑛 𝑥 𝑖 =1 Õ Õ L (𝑜, ˆ 𝑜 ) = − (𝑜 ∗ log( ˆ 𝑜 )) (5) 𝑖 𝑗 𝑖 𝑗 5.3 Results and Analysis 𝑗 =1 𝑖 =1 Table 2 shows the comparison of our model with the baseline 𝑡 ℎ In Eq (5), for 𝑖 outlet in the output sequence of length 𝑥 , 𝑜𝑖 𝑗 models in terms of subset accuracy and hamming loss. and ˆ 𝑜 denotes the actual and predicted probability of selecting 𝑖 𝑗 𝑡 ℎ the 𝑗 outlet from 𝐷 . Table 2: Comparison between the baseline models and our 4.2.3 Hyper-parameters. 5 We used Categorical accuracy as the proposed model. metrics to calculate the mean accuracy rate for multilabel classi- fication problems across all the predictions. We consider a batch Subset Accuracy Hamming Loss of size 128 and number of epocs as 100 for training. To optimize Uniform 0.140 0.526 the weights during training we use Adam optimizer. Stratified 0.286 0.422 5 EXPERIMENTAL EVALUATION Ours 0.546 0.275 5.1 Baselines Quantitative analysis of the experimental results shows that, We use the following well-known and simplified methods as our our model outperforms the Uniform and Stratified models by a baseline models. margin of 0.41 and 0.26 points for subset accuracy and by 0.25 • Uniform: Generate predictions randomly using a uniform and 0.15 points for hamming loss respectively. The performance distribution. difference is clearly visible in Figure 3. • Stratified: Generates predictions by respecting the class distribution of the training set. The intersection that we find among the different outlet pairs differs considerably as evident in Figure 1. This can be best seen 5.2 Evaluation Metric by assessing the conditional probability of an event covered by an We aim to predict the list of outlets in this work. However, it is outlet given that it is covered by another outlet as listed in Table 3. not necessary to predict the sequence in which outlets appear on For example, we can note that the 𝑃 (𝑤 𝑎𝑠ℎ𝑖𝑛𝑔𝑡𝑜𝑛 |𝑛𝑦𝑡𝑖𝑚𝑒𝑠 ) = this list. This is explained with an example given in Table 1. In 0.492 which is quite high and indicates that 𝑤 𝑎𝑠ℎ𝑖𝑛𝑔𝑡 𝑜𝑛𝑝𝑜𝑠𝑡 tends other cases, a combination of correct and incorrect outlets may to cover most of the events covered by 𝑛𝑦𝑡 𝑖𝑚𝑒𝑠 . It is also inter-be predicted by the model. esting to note that 𝑖𝑛𝑑𝑖𝑎𝑡 𝑖𝑚𝑒𝑠 do not follow 𝑤 𝑎𝑠ℎ𝑖𝑛𝑔𝑡 𝑜𝑛𝑝𝑜𝑠𝑡 or 𝑛𝑦𝑡 𝑖𝑚𝑒𝑠 , and vice versa. We used the following metrics to evaluate the effectiveness of our model where, ˆ 𝑜 is the predicted outlet, 𝑜 is the true outlet, 6 CONCLUSIONS AND FUTURE WORK and 𝑁 is the total number of instances. It is important for a journalist to know which event is worthy 5 https://github.com/keras-team/keras/blob/master/keras/metrics.py enough to be published. Even readers would be interested to know 35 Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia Swati and Dunja Mladenić Table 3: Conditional probability of an event to be covered by an outlet, provided it is covered by another outlet. P(x|y) nytimes indiatimes washingtonpost nytimes 1.000 0.067 0.364 indiatimes 0.034 1.000 0.023 washingtonpost 0.492 0.063 1.000 [3] Dylan Bourgeois, Jérémie Rappaz, and Karl Aberer. 2018. Selection bias in news coverage: learning it, fighting it. In Companion Proceedings of the The Web Conference 2018, 535–543. [4] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, 13042–13054. [5] Zihao Fu. 2019. An introduction of deep learning based word representation applied to natural language process- ing. In 2019 International Conference on Machine Learning, Figure 3: Comparison between the baseline models and Big Data and Business Intelligence (MLBDBI). IEEE, 92–104. our proposed model. [6] Felix Hamborg, Norman Meuschke, and Bela Gipp. 2018. Bias-aware news analysis using matrix-based news aggre- gation, 1–19. the outlets that are going to cover the event of their interest. Yet [7] Gregor Leban, Blaz Fortuna, Janez Brank, and Marko Gro- it is certainly not an automated approach, therefore in this work, belnik. 2014. Event registry: learning about world events we propose an approach to address the outlet prediction task from news. In Proceedings of the 23rd International Confer- given the event title and description. We also find that even in its ence on World Wide Web, 107–110. simplest form, our model is capable of predicting the outlet. In [8] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, the future, we intend to enhance our proposed model to better and Lars Schmidt-Thieme. 2009. Bpr: bayesian personal- predict the outlets and to work in a cross-lingual setting. We ized ranking from implicit feedback. In Proceedings of the plan to include a few more metadata provided by Event Registry Twenty-Fifth Conference on Uncertainty in Artificial Intelli- (refer Section 3.1) along with Wikipedia concepts. We also plan gence (UAI ’09). AUAI Press, Montreal, Quebec, Canada, to analyze the speed of reporting, time-span, and importance 452–461. isbn: 9780974903958. given to the events by the outlets. In addition, we will also be [9] Sebastian Ruder. 2019. Neural transfer learning for natural looking into how the outlets change their coverage style over language processing. PhD thesis. NUI Galway. time. [10] Sebastian Ruder, Matthew E Peters, Swabha Swayamdipta, ACKNOWLEDGMENTS and Thomas Wolf. 2019. Transfer learning in natural lan- guage processing. In Proceedings of the 2019 Conference of This work was supported by the Slovenian Research Agency and the North American Chapter of the Association for Compu- the European Union’s Horizon 2020 research and innovation tational Linguistics: Tutorials, 15–18. program under the Marie Skłodowska-Curie grant agreement No [11] Diego Saez-Trumper, Carlos Castillo, and Mounia Lalmas. 812997. 2013. Social media news communities: gatekeeping, cov- REFERENCES erage, and statement bias. In Proceedings of the 22nd ACM international conference on Information & Knowledge Man- [1] Brent H Baker, Tim Graham, and Steve Kaminsky. 1994. agement, 1679–1684. How to identify, expose & correct liberal media bias. [12] Rune J Sørensen. 2019. The impact of state television on [2] Matthew Barnidge, Albert C Gunther, Jinha Kim, Yang- voter turnout. British Journal of Political Science, 257–278. sun Hong, Mallory Perryman, Swee Kiat Tay, and Sandra Knisely. 2020. Politically motivated selective exposure and perceived media bias, 82–103. 36 MultiCOMET – Multilingual Commonsense Description Adrian Mladenic Grobelnik Dunja Mladenic Marko Grobelnik Artificial Intelligence Laboratory Artificial Intelligence Laboratory Artificial Intelligence Laboratory Jozef Stefan Institute Jozef Stefan Institute Jozef Stefan Institute Ljubljana Slovenia Ljubljana Slovenia Ljubljana Slovenia adrian.m.grobelnik@ijs.si dunja.mladenic@ijs.si marko.grobelnik@ijs.si ABSTRACT The main contributions of this paper are (1) a new multilingual approach to annotating natural language sentences with This paper presents an approach to generating multilingual commonsense descriptors, (2) implementation of the proposed commonsense descriptions of sentences provided in natural language. We have expanded on an existing approach to automatic approach that is made publicly available as an online service knowledge base construction in English to work on different MultiCOMET http://multicomet.ijs.si/ (illustrated in Figure 4), (3) languages. The proposed approach has been utilized to develop evaluation of the proposed approach on the Slovenian language. An MultiCOMET, a publicly available online service for generating additional contribution is the publicly available source code [3] multilingual commonsense descriptions. Our experimental results allowing users to train their own models for other natural show that the proposed approach is suitable for generating languages. commonsense description for natural languages with Latin script. Comparing performance on Slovenian sentences to the English The rest of this paper is organized as follows: Section 2 provides a original, we have achieved precision as high as 0.7 for certain types data description. Section 3 describes the problem and the algorithm of descriptors. used. Section 4 exhibits our experimental results. The paper concludes with discussion and directions for the future work in CCS CONCEPTS Section 5. •CCS Information systems Information retrieval Document 2 Data Description representation Content analysis and feature selection KEYWORDS One might say the only way for AI to learn to perform deep learning, commonsense reasoning, multilingual natural commonsense reasoning, is to learn from humans. Following the approach proposed by COMET [1], we used data from the language processing ATOMIC [2] dataset. The ATOMIC dataset consists of over 24,000 sentences containing common phrases manually labelled by 1 Introduction workers on Amazon Turk. For each sentence the workers were As artificial intelligence systems are becoming better at performing asked to assign open-text values to nine descriptors which capture highly specialized tasks, sometimes outperforming humans, they nine if-then relation types to distinguish causes vs. effects, agents are unable to understand a simple children’s fairy tale due to their vs. themes, voluntary vs. involuntary events and actions vs. mental inability to make commonsense inferences from simple events. states [2] as described in ATOMIC. With recent breakthroughs in the area of deep learning and overall The following are the nine descriptors and their explanations: increases in computing power, it has enabled us to model xIntent – Because PersonX wanted… commonsense inferences with deep learning models. In our research, we expand on the approach to automatic generation of xNeed – Before, PersonX needed… commonsense descriptors proposed in COMET [1] by applying their deep learning models to languages other than English. xAttr – PersonX is seen as… The approach presented in COMET tackles automatic xReact – As a result, PersonX feels… commonsense completion with the development of generative xWant – As a result, PersonX wants… models of commonsense knowledge, and commonsense transformers that learn to generate diverse commonsense xEffect – PersonX then… descriptions in natural language [1]. oReact – As a result, others feel… Our research hypothesis is that the approach proposed by COMET oWant – As a result, others want… [1] can be expanded to Latin script languages other than English. To test this claim, we have trained our own deep learning model on oEffect – Others then… the original training data, and another model on the data translated into another natural language. 37 The dataset contains almost 300,000 unique descriptor values for we were strict in our comparisons, for instance “to stay away from the listed nine descriptors. An example of a labeled sentence is people” and “to get away from others” do not count in overlap. shown in Figure 3. Experimental results show there is considerable difference in In order to test the proposed approach, we implemented it for the performance between the nine descriptors. The best performing Slovene language. We have translated the sentences from the descriptor was xReact, where precision@5 was 0.716, followed by ATOMIC dataset to Slovene, keeping the descriptor values in oReact and oWant with precisions@5 of 0.706 and 0.468 English. The translation was done using Google Cloud’s respectively. The worst performing descriptor was xWant, with a Translation API [4]. precision@5 of 0.21 (see Table 1). 3 Problem Description and Algorithm Descriptor Precision The problem we are solving is predicting the most likely values for xIntent 0.324 each tag in the ATOMIC [1] dataset, given an input sentence in a xNeed 0.352 Latin script language. Following the proposal in COMET, we are addressing the following problem: xAttr 0.438 xReact 0.716 Given a training knowledge base of natural tuples in the {𝑠, 𝑟, 𝑑} format, where 𝑠 is the sentence, 𝑟 is the relation type and 𝑑 xWant 0.210 represents the relation values. The task is to generate 𝑑 given 𝑠 and xEffect 0.456 𝑟 as inputs. oReact 0.706 Figure 1 depicts our approach to solving this problem. The system oWant 0.468 takes labelled sentences as input, translates them to the targeted oEffect 0.310 Latin language and trains a deep learning model capable of Average 0.442 labelling previously unseen sentences with values for nine Table 1: Experimental results on the nine descriptors, showing descriptors capturing the nine predefined relation types as precision of the top 5 predictions. described in Section 2. The best performing descriptor was xReact (representing the relation: As a result, PersonX feels). This was likely due to the fact that most predicted values were only one word long for both models, making it considerably easier for their predictions to overlap. The worst performing descriptor was xWant (representing the relation: As a result, PersonX wants), this could be attributed to the fact that the most predicted values were at least 3-4 words in length, greatly decreasing the likelihood of overlap. Another reason for such low precision could be our strict overlap comparisons. Figure 1: Architecture of the proposed approach Original Translated/Predicted 4 Experimental Results Sentence PersonX looks PersonY PersonX izgleda PersonY Prior to training the model, we split the ATOMIC dataset into train, ___ in the face ___ v obraz test and development sets identical to those used in COMET [1]. In xReact nervous satisfied our evaluation we used 100 sentences from the test set. Values Our deep learning models are trained on the ATOMIC [2] dataset. happy happy We have trained one model on the original dataset in English, and another model on an automatically translated dataset to Slovene. satisfied attractive Both models were trained under the same parameter settings: batch powerful proud size=6, iterations=50000, maximum number of input features = 50. confident angry To evaluate the performance of the proposed approach, we compared the predictions of the model trained on Slovene Table 2: One of the worst performing test sentences for xReact sentences with the predictions of the English model. As the performance metrics, we took the top 5 predicted values for each Table 2 shows the predicted values of one of the worst performing sentences for the xReact descriptor. Note the sentence “PersonX descriptor and checked their overlap. By taking the English looks PersonY ___ in the face” can refer to “Bob looks Mary predictions as the ground truth, we are measuring the precision of slowly in the face” or “Adrian looks Anna kindly in the face” our model by the number of identical descriptor values. Note that or something 38 else. The columns in Table 2 and Table 3 labelled “Original” show the original English sentence and its predicted descriptor values. The columns labelled “Translated/Predicted” show the sentence translated into Slovene and its predicted descriptor values. Table 3 shows the predicted values of one of the worst performing sentences for the xWant descriptor. We can see that there are no common predictions between the two models. Note the sentence “PersonX avoids every ___” can refer to “Marko avoids every car on the road” or “Dunja avoids every boring event” or something else. Original Translated/Predicted Sentence PersonX avoids every ___ PersonX se izogiba vsakemu ___ xWant to stay away from people to get away from others Values to avoid trouble to make sure they are ok to stay away to get away from the situation to not get caught to be alone to not be noticed to make a decision Table 3: One of the worst performing test sentences for xWant While Tables 2 and 3 show the model’s outputs for a single descriptor, Figure 3 shows the full output of the model, given an example sentence “Mojca je pojedla odličen sendvič” (Mary ate an excellent sandwich). Figure 2 shows a close-up of the output of Figure 3. The images in Figures 2 and 3 were taken directly from the interface of our online service MultiCOMET [5]. Figure 3: Full tree of predicted descriptor values generated for an example Slovene sentence For the sentence “Mojca je pojedla odličen sendvič” (Mary ate an excellent sandwich) depicted in Figures 2 and 3, here is a potential English interpretation of the Slovenian output of the model: Mary was hungry (xAttr) and wanted to eat food (xIntent). To do that, she needed to go to the restaurant (xNeed). At the restaurant, other people were also eating food (oEffect). As a consequence of eating the sandwich, Mary’s clothes got dirty (xEffect). Mary feels impressed (xReact) and wants to eat something else (xWant). The restaurant is grateful (oReact) for Mary’s visit and wants to thank Mary (oWant). The MultiCOMET online service is a publicly available implementation of our proposed approach, shown in Figure 4. At the time of writing, MultiCOMET only supports English and Slovene. Figure 2: Close-up of predicted descriptor values generated for an example Slovene sentence 39 Figure 4: Illustrative example of MultiCOMET after submitting a query “Mary ate a wonderful sandwich.” 5 After testing the proposed multilingual approach on the Slovene Discussion language, we intend to expand our coverage to other Latin script In our research we expanded on an existing monolingual languages including Croatian, Italian and French. approach and proposed a new approach to generating multilingual commonsense descriptions from natural language. ACKNOWLEDGMENTS In order to implement our approach, we built on an existing The research described in this paper was supported by the library, implementing the approach proposed by COMET [1]. Slovenian research agency under the project J2-1736 Causalify Our experimental results show that we are getting meaningful and co-financed by the Republic of Slovenia and the European values for the descriptors. Experimental comparison of the Union under the European Regional Development Fund. The predicted descriptor values of the Slovene and English models operation is carried out under the Operational Programme for the show an average precision of 0.44, given our strict comparison Implementation of the EU Cohesion Policy 2014–2020. methodology. We noted the precision values ranged from 0.716 to 0.210 across different descriptors. REFERENCES Based on our literature review (September 2020), none of the [1] Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli articles citing the original COMET [1] paper expanded their Celikyilmaz, Yejin Choi. (2019). COMET: Commonsense Transformers for Automatic Knowledge Graph Construction. Allen Institute for Artificial approach to include other languages. The most similar work we Intelligence, Seattle, WA, USA. Paul G. Allen School of Computer Science found in the literature combining commonsense and & Engineering, Seattle, WA, USA. Microsoft Research, Redmond, WA, USA. [2] Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas multilinguality was [6] where the authors were extending the Lourie, Hannah Rashkin, Brendan Roof, Noah A. Smith, Yejin Choi. (2019). SemEval Task 4 solution using machine translation. ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning. Paul G. Allen School of Computer Science & Engineering, University of The possible direction for future work includes improving the Washington, Seattle, USA. Allen Institute for Artificial Intelligence, Seattle, USA. quality of the translated sentences from ATOMIC by manual [3] MultiCOMET GitHub https://github.com/AMGrobelnik/MultiCOMET translation to improve the precision of the models. Another Accessed 31.08.2020 possible direction would be to evaluate the performance of our [4] Google Cloud’s Translation API Basic https://cloud.google.com/translate Accessed 31.08.2020 models on a larger number of sentences to increase the reliability [5] MultiCOMET http://multicomet.ijs.si/ Accessed 31.08.2020 of the results. [6] Josef Jon, Martin Fajcik, Martin Docekal, Pavel Smrz. (2020). BUT-FIT at SemEval-2020 Task 4: Multilingual commonsense. arXiv. https://arxiv.org/pdf/2008.07259.pdf 40 A Slovenian Retweet Network 2018-2020 Bojan Evkoski Igor Mozetič & Jožef Stefan International Nikola Ljubešić & Postgraduate School, Petra Kralj Novak Jožef Stefan Institute Jožef Stefan Institute Jamova cesta 39 Jamova cesta 39 Ljubljana, Slovenia Ljubljana, Slovenia Bojan.Evkoski@ijs.si ABSTRACT the tweets in terms of hashtags and URLs. We draw con- As the popularity of social media has been growing steadily clusions in Section 6. since the beginning of their era, the use of data from these platforms to analyze social phenomena is becoming more 2. DATA and more reliable. In this paper, we use tweets posted over a We acquired 5,147,970 tweets in the period from January period of two years (2018-2020) to analyze the socio-political 2018 to January 2020 with the TweetCat tool [6], built environment in Slovenia. We use network analysis by ap- specifically for collecting Twitter data written in “smaller” plying community detection and influence identification on languages. The tool identifies users tweeting in the focus lan- the retweet network, as well as content analysis of tweets guage by searching for most common words in that language by using hashtags and URLs. Our study shows that Slove- through the Twitter Search API, and collects these users’ nian Twitter users are mainly grouped in three major socio- tweets through the whole data collection period. On aver- political communities: Left, Center and Right. Although age, the dataset containis around 8,000 tweets per day, with the Left community is the most numerous, the most influ- the three highest volume peaks on March 13, 2018 (11,556 ential users belong to the Right and Center communities. tweets, the resignation of Slovenia’s PM, Miro Cerar), June 1, Finally, we show that different communities prefer different 2018 (13,506 tweets, the last day of the 2018 Slovenian par- online media to inform themselves, and that they also pri- liamentary elections campaign), and May 9, 2019 (12,381 oritize topics differently. tweets, Eurovision semi-final in which Slovenia had a suc- cessful run). The variation of the daily volume of tweets Keywords is affected by many phenomena, but the more evident are: Complex networks, Twitter, community detection, influencers a weekly seasonality with high volumes on working days and low volumes on weekends, extraordinary periods for 1. INTRODUCTION the country (e.g. the 2018 Slovenian parliamentary elections Since the rise of the social networks, their data has been ex- campaign, boosting average daily tweets by around 2,000), tensively used in social analysis. As the popularity of these and holidays (e.g. 2018 and 2019 Easters as local minima platforms continues to grow daily, using them as a proxy to with 5,174 and 4,887 tweets, respectively). analyze specific phenomena is becoming more and more re- liable. Their popularity, accessibility and availability made 3. COMMUNITY DETECTION them the go-to way to share one’s opinion, support another We used the collected tweets to construct a retweet network and even get in conflict with an opposing one. Recently, with for the purpose of community detection. A retweet network the targeted advertising advancements, social media became is a directed weighted graph, where nodes represent Twit- the most important cultural and political battlefront. ter users and edges represent the retweet relations. An edge from node (user) A to node B exists if B retweeted A at In this paper, the country of interest is Slovenia and the least once, indicating the information spread from A to B, proxy is Twitter data. By following the methodology devel- or A influenced B. Note that retweeting a retweet is actually oped in [3, 2, 4, 8], we address the following questions: retweeting the original tweet (source), thus ignoring all in- termediate retweets. The weight of an edge is the number of • Are there groups of densely connected Twitter users times user B retweeted user A. We removed all self-retweets, in the Slovenian retweet network 2018-2020? since they did not provide us additional information for com- • Who are the leading influencers in these groups? munity and influence detection. Consequently, we formed a • What is the content of the tweets in these groups and network with 10,876 users (94% of all users) and 1,576,792 how much does it overlap? retweets (92% of all retweets). This paper is organised as follows. In Section 2, the data This network can be simplified if the direction of the edges acquisition process and the collected Twitter data are pre- is ignored, meaning that two users are linked if one retweets sented. Section 3 discusses the communities in the retweet the other while the source and destination are irrelevant. It network and their properties. Section 4 covers the notion of turns out that such undirected retweet graphs between Twit- influencers and identifies the main influencers in the Slove- ter users are useful to detect communities of like-minded nian retweet network. Section 5 investigates the content of users who typically share common views on specific topics. 41 Figure 1: The Slovenian retweet network (2018-2020) colored according to the detected communities, with shares of the total number of users. The label size of a node corresponds to the number of unique users that retweeted it. Only nodes with at least 700 unique retweeters are included. In complex networks, a community is defined as a subset of the communities. Most of the properties are normalized by nodes that are more closely connected to each other than the user to ease the comparison between communities. to other nodes. For the purpose of this paper, we apply a standard algorithm for community detection, the Louvain • Nodes – unique users count method [1]. The method partitions the nodes into commu- • Central user – user with most retweets nities by maximizing modularity (which measures the differ- • Central user retweets – times the central user is retweeted ence between the actual fraction of edges within the commu- nity and such fraction expected in a randomized graph with • Central user retweeters – unique users retweeting the the same degree sequence) [7]. Modularity values range from central user −0.5 to 1.0, where a value of 0.0 indicates that the edges are • HHI (n = 50) – Herfindahl–Hirschman index [9] mea- randomly distributed, and larger values indicate a higher sures the distribution of influence of the top n influen- community density. tial users. Higher value reflects the community influ- ence concentrated only in few influential users, while We ran the Louvain method (resolution = 1.05) on our undi- lower value indicates more dispersed and balanced in- rected retweet network resulting in 183 communities with a fluence distribution. modularity value of 0.382, which indicates a strong connect- • Edges in/node – edges remaining in the community per edness within communities. Only the three largest commu- user (source and destination in the same community) nities each have more than 5% of all users, while combined • Edges out/node – edges going out of the community they contain 85% of all users. The three main detected com- per user (destination in a different community) munities are presented in Fig. 1. We observe the following: • Weighted edges in/node – weighted edges remaining in • The three largest communities are labeled as Left, Cen- the community per user ter and Right with 55%, 20% and 10% as their re- • Weighted edges out/node – weighted edges going out spective shares of all users. The labeling of the com- of the community per user munities does not necessarily represent their political • Out/In ratio – “Edges out” divided by “Edges in” orientation. • Weighted out/in ratio – “Weighted edges out” divided • The Left community, even though the largest, con- by “Weighted edges in” tains the smallest number of users with more than 700 unique retweeters. 4. INFLUENCERS • The Left community is well separated from the Center We use two simple, but powerful metrics to detect influ- and the Right communities, which are more tightly encers in the retweet network: the weighted out-degree and interlinked. the Hirsch index (h-index) [5]. Both metrics are calculated from the number of retweets, thus known as retweet influ- We performed an exploratory data analysis and calculated ence metrics, indicating the ability of a user to post content the community properties presented in Table 1, to compare of interest to others. 42 Figure 2: Weighted out-degree (total retweets) and h-index comparison. Both charts include the top 25 most influential Slovenian Twitter users according to their respective metric. Bar colors represent the community of a user. Triangles point to users exclusive to one of the charts. Table 1: Community properties For domain URLs, we filtered the 2,297,008 tweets which Left Center Right contain a URL. Then, we extracted the domain part of the Nodes 7,030 1,223 2,519 URLs and removed the domains with no specific meaning Central user vecer BojanPozar JJansaSDS Central user retweets 10,398 31,432 50,688 for Slovenia’s content analysis (e.g. social networks: twit- Central user retweeters 973 1,325 1,242 ter.com, facebook.com, instagram.com, etc., and URL short- HHI (n = 50) 0.031 0.066 0.042 eners: ift.tt, bit.ly, ow.ly, etc.). This results in 512,308 Edges in/node 19.32 14.53 69.30 tweets (approximately 22% of all the tweets with links). The Edges out/node 4.47 37.11 13.19 most frequently occurring domains are owned by Slovenian Weighted edges in/node 52.91 83.68 308.33 media with nova24tv.si, rtvslo.si and delo.si as the top three Weighted edges out/node 6.95 119.42 36.14 Out/In ratio 0.23 2.55 0.19 URL domains with 23,879, 20,210 and 17,360 occurrences Weighted Out/In ratio 0.13 1.43 0.12 respectively. If instead of the total number of occurrences we count only the unique number of users which posted a do- Weighted out-degree is simply the total number of retweets main URL, the top three domains are rtvslo.si, siol.net and of a particular user, while the h-index is an author-level bib- delo.si with 2,802, 2,193 and 2,186 unique users respectively. liometric indicator that measures the scientific output of a scholar by quantifying both the number of publications (i.e., For the hashtag analysis, we filtered only tweets which con- productivity) and the number of citations per publication tain a hashtag, ending up with 701,266 tweets. The top three (i.e., citation impact). Adapted to a Twitter network, it hashtags are the following: #volitve2018 (the 2018 Slove- would be described as: a user with an index of h has posted nian parliamentary elections), #plts (the Slovenian First h tweets and each of them was retweeted at least h times. Football League) and #sdszate (Slovenian Democratic Party hashtag, meaning: SDS for you) with 9,845, 9,318 and 7,308 Let RT be the function indicating the number of retweets occurrences respectively. If we count only the unique num- for each original tweet. The values of RT are ordered in ber of users using a particular hashtag, the results for the decreasing order, from the largest to the lowest, while i in- top three Slovenian hashtags are as follows: #volitve2018 dicates the ranking position in the ordered list. The h-index with 2,473, #slovenija with 1,611 and #fakenews with 1,343 is then defined as follows: users. h-index(RT) = max min(RT(i), i) To see these results in the context of communities, we look at i the tweets authored by members of the three largest commu- The top 25 most influential users by weighted out-degree and nities, resulting in 84% of the tweets with relevant domain h-index are shown in Fig. 2. The two metrics provide fairly URLs and 83% of the tweets with relevant hashtags. We similar results (they differ only in 9 users). Both results summed the domain URL counts, while grouping them by confirm the already visible phenomena from the previous the community in which their user belongs. We applied the observations: The Right community has the most influential same procedure to the hashtags. Finally, we filtered the top users, while the Left community, even though the biggest, eight domain URLs and hashtags for each community and does not have nearly as popular users as the ones from the put them on a single Sankey diagram in Fig. 3. Even though other two communities. overlaps exist, the most popular hashtags and media very much differ from community to community, meaning that 5. CONTENT ANALYSIS all three main communities prioritize topics differently and We refer to content analysis in terms of getting knowledge they inform themselves via different media. from the text of the tweets. In this paper, we perform two kinds of content analysis: domain URLs and hashtags. 43 Figure 3: A Sankey diagram depicts the use of the eight most common hashtags (left-hand side) and URLs (right-hand side) by the three largest detected communities. 6. CONCLUSIONS Parliament: Roll-call votes and Twitter activities. PLoS In this paper we explored the Slovenian twitter network from ONE, 11(11):e0166586, 2016. January 2018 until January 2020. We applied community [3] D. Cherepnalkoski and I. Mozetič. Retweet networks of detection, identifying three main communities: Left, Center the European Parliament: Evaluation of the community and Right. We identified the most influential and the central structure. Applied Network Science, 1(1):2, 2016. users of each community by calculating the weighted out- [4] M. Grčar, D. Cherepnalkoski, I. Mozetič, and P. Kralj degree and the h-index of the nodes. We used the Herfind- Novak. Stance and influence of Twitter users regarding ahl–Hirschman index to estimate the distribution of influ- the Brexit referendum. Computational Social Networks, ence within the top communities in the network. Finally, by 4(1):6, 2017. analysis of hashtags and URL domains in tweets, we discov- [5] J. E. Hirsch. An index to quantify an individual’s ered the most popular topics for Slovenians as well as the scientific research output. Proceedings of the National most referred Slovenian media on Twitter. We showed that Academy of Sciences, pages 16569–16572, 2005. users from different communities prioritize different topics [6] N. Ljubešić, D. Fišer, and T. Erjavec. TweetCaT: a and use different media to inform themselves. tool for building Twitter corpora of smaller languages. In Proceedings of the Ninth International Conference on 7. ACKNOWLEDGMENTS Language Resources and Evaluation (LREC’14), pages The authors acknowledge financial support from the Slove- 2279–2283, Reykjavik, Iceland, May 2014. European nian Research Agency (research core funding no. P2-103 Language Resources Association (ELRA). and P6-0411), and the European Union’s Rights, Equality [7] M. E. J. Newman. Modularity and community and Citizenship Programme (2014-2020) project IMSyPP structure in networks. Proceedings of the National (Innovative Monitoring Systems and Prevention Policies of Academy of Sciences, 103(23):8577–8582, 2006. Online Hate Speech, grant no. 875263). [8] P. K. Novak, L. D. Amicis, and I. Mozetič. Impact investing market on twitter: influential users and 8. REFERENCES communities. Applied Network Science, 3(1):40, 2018. [1] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and [9] G. J. Werden. Using the Herfindahl–Hirschman index. E. Lefebvre. Fast unfolding of communities in large In L. Phlips, editor, Applied Industrial Economics, networks. Journal of Statistical Mechanics: Theory and number 2, pages 368–374. Cambridge University Press, Experiment, 2008(10):P10008, 2008. 1998. [2] D. Cherepnalkoski, A. Karpf, I. Mozetič, and M. Grčar. Cohesion and coalition formation in the European 44 Toward improved semantic annotation of food and nutrition data Lidija Jovanovska Panče Panov Jožef Stefan International Postgraduate School & Jožef Stefan Institute & Jožef Stefan Institute Jožef Stefan International Postgraduate School Ljubljana, Slovenia Ljubljana, Slovenia lidija.jovanovska@ijs.si pance.panov@ijs.si ABSTRACT repository without which there is a great difficulty in achieving This paper aims to provide a critical overview of the state-of-the- cross-cultural and expert consensus. 1 art vocabularies used for semantic annotation of databases and In this paper, we will briefly go through the fundamental datasets in the domain of food and nutrition. These vocabularies components of the Semantic Web technologies, as well as the are commonly used as a backbone for creating metadata that is standards for the development of high-level KOS (Section 2). Next, usually used in search. Furthermore, the paper aims to provide a we provide a critical overview of the most significant semantic summary of ICT technologies used for storing food and nutrition resources in the domain of food and nutrition (Section 3). Finally, datasets and searching digital repositories of such datasets. Fi-we present a proposal for the design and implementation of a nally, the results of the paper will provide a roadmap for moving broad ontology that would allow us to harmonize and integrate towards FAIR (findable, accessible, interoperable, and reusable) reference vocabularies and ontologies from different sub-areas food and nutrition datasets, which can then be used in various of food and nutrition (Section 4). AI tasks. 2 BACKGROUND KEYWORDS The goal of the Semantic Web is to make Internet data machine- ontologies, semantic technologies, data mining, food and nutri- readable by enhancing web pages with semantic annotations. tion Linked data is built upon standard web technologies, also in- cluding semantic web technologies in its technology stack [11]. Resource Description Framework (RDF) allows the represen- 1 INTRODUCTION tation of relationships between entities using a simple subject- Today more than ever before in history, we live in an age of predicate-object format known as a triple. The triples form an information-driven science. Vast amounts of information are be- RDF database — called a triplestore — which can be populated ing produced daily as a result of new types of high-throughput with RDF facts about some domain of interest. RDF Schema technology in all walks of life. Consequently, the quantity of (RDFS) was developed immediately after the appearance of RDF available scientific information is becoming overwhelming and as a set of mechanisms for describing groups of related resources without its proper organization, we would not be able to maxi- and the relationships between them. Simple Protocol and RDF mize the knowledge we harvest from it. Namely, research groups Query Language (SPARQL) is the query language for querying carry out their research in different ways, with specific and pos- RDF triples stored in RDF triplestores. sibly incompatible terminologies, formats, and computer tech- The Web Ontology Language (OWL) is based on Descrip- nologies. To tackle these issues, researchers have developed high- tion Logics, a family of logics that are expressively weaker than level knowledge organization systems (KOS), such as ontologies, First Order Logic, but enjoy certain computational properties ad- which constitute the core of the semantic web stack. Throughout vantageous for purposes such as ontology-based reasoning and the years, an abundance of ontologies has been developed and data validation. Most of the ontologies used today are represented released, slowly expanding from the biomedical sciences to the in the OWL format. fields of information science, machine learning, as well as the All the semantic technologies operate on top of various KOS. A domain of food and nutrition science. KOS is intended to encompass all types of schemes for organizing There is an old, yet simple saying which goes: “You are what information and promoting knowledge management [7]. One you eat”. As the world becomes more globalized and food pro-example of a KOS is a thesaurus as a structured, normalized, and duction grows massively, it is becoming increasingly difficult to dynamic vocabulary designed to cover the terminology of a field track the farm-to-fork food path. In the last few decades, digital of specific knowledge. It is most commonly used for indexing technology has been profoundly affecting many health and eco- and retrieving information in a natural language in a system nomic aspects of food production, distribution, and consumption. of controlled terms. When looking at the expressiveness of a Issues regarding food safety, security, authenticity as well as con- KOS, a thesaurus is on the lower side of the scale. On the other flicts arising from biocultural trademark protection are issues side, ontologies enjoy greater expressiveness than thesauri due to that were further enhanced by the lack of a centralized food data the inclusion of description logics. Arp, Smith, and Spear define the term ontology as “A representation artifact, comprising a Permission to make digital or hard copies of part or all of this work for personal taxonomy as proper part, whose representations are intended to or classroom use is granted without fee provided that copies are not made or designate some combination of universals, defined classes, and distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this certain relations between them” [1]. work must be honored. For all other uses, contact the owner/author(s). Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia © 2020 Copyright held by the owner/author(s). 1https://www.nature.com/scitable/knowledge/library/food-safety-and-food- security-68168348/, accessed 22/04/2020 45 Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Jovanovska and Panov The Open Biomedical Ontologies (OBO) Foundry applies the of more sophisticated ontologies, such as FoodOn. Even though key principles that ontologies should be open, orthogonal, instan- the OBO Foundry principles apply only to ontologies, we can tiated in a well-specified syntax, and designed to share a common use the more general ones as evaluation criteria for the LanguaL space of identifiers. Open means that the ontologies should be thesaurus. For instance, as previously mentioned, the thesaurus is available for use without any constraint or license and also recep- open, made available in an accepted concrete syntax, versioning tive to modifications proposed by the community. Orthogonal is ensured, textual definitions are available for all the terms and means that they ensure the additivity of annotations and compli- a sufficient amount of documentation is provided. ance with modular development. The proper and well-specified syntax is expected to support algorithmic processing and the FoodOn [4] is an open-source, comprehensive ontology com-common system of identifiers enables backward compatibility posed of term hierarchy facets that cover basic raw food source with legacy annotations as the ontologies evolve [17]. ingredients, process terms for packaging, cooking, and preser- The FAIR guiding principles for scientific data management vation, and different product type schemes under which food and stewardship were conceived to serve as guidelines for those products can be categorized. FoodOn is applicable in several use- who wish to enhance the reusability and invaluableness of their cases, such as personalized foods and health, foodborne pathogen data holdings [19]. The power of these principles lies in the fact surveillance and investigations, food traceability and food webs, that they are simple and minimalistic in design and as such can be and sustainability. FoodOn echoes most of LanguaL’s plant and adapted to various application scenarios. Findability ensures that animal part descriptors —– both anatomical (arm, organ, meat, a globally unique and persistent identifier is assigned to the data seed) and fluid (blood, milk) —– but reuses existing Uberon [12] and the metadata which describes the data. Accessibility ensures and Plant Ontology [10] term identifiers for them. Multiple com-that the data and the metadata can be retrieved by their identifier ponent foods are more challenging because LanguaL provides using a standardized communications protocol. Interoperability no facility for giving identifiers to such products. ensures that data, as well as metadata, use a formal, accessible, Building on top of this, FoodOn allows food product terms like and shared language for knowledge representation. Reusability lasagna noodle to be defined directly in the ontology, and allows ensures that data and metadata are accurately described, released them to reference component products through various relations with a clear and accessible license, have detailed provenance, and which do not exist in LanguaL, such as: "has ingredient", "has meet domain-relevant community standards. part", "composed primarily of". As a suggestion, these relations can all be represented with a single relation "has ingredient" and 3 CRITICAL OVERVIEW OF FOOD AND the quantity can be expressed explicitly when annotating the NUTRITION SEMANTIC RESOURCES objects. All of the ontology terms have unique identifiers and In this section, we provide a critical overview of the most relevant the ontology is accessible and can be searched via The European KOS in the field of food and nutrition. We start by describing Bioinformatics Institute (EMBL-EBI) and its Ontology Lookup LanguaL [8], a thesaurus that serves as a foundation for most of Service (OLS).3 The ontology itself is open-source and is a mem-the ontologies in this domain. We are more focused on analyzing ber of the OBO Foundry. It also includes the upper-level Basic ontologies which belong to different sub-spheres of the food and Formal Ontology (BFO) [1]. The adherence to BFO proves useful nutrition domain. Namely, FoodOn [4], as a more general food in the case of aligning ontologies covering different domains description ontology, ONS [18], relevant in the field of nutritional because they share the same top-level. studies and ISO-Food [6], relevant in the field of annotating isotopic data acquired from food samples. ONS [18] is the first systematic effort to provide a solid and extensible ontology framework for nutritional studies. ONS was built to fill the gap between the description of nutrition-based LanguaL [8] is a thesaurus used for describing, capturing, and retrieving data about food. Since 1996, it has been used to index prevention of disease and the understanding of the complex im- numerous European Union (EU) and US agency databases, among pact nutrition has on health. Its structure consists of 3334 terms which, the US Department of Agriculture (USDA) Nutrient Data- imported from already existing ontologies and 100 newly de- base for Standard Reference and 30 European Food Information fined terms. The usability of ONS was tested in two scenarios: Resource (EuroFIR) databases. Food ingredients are represented an observational study, which aims at developing novel and af- with indexing terms, preferably in the form of a noun or a phrase. fordable nutritious foods to optimize the diet and reduce the risk The thesaurus also includes precombined terms which are food of diet-related diseases among groups at risk of poverty, and product names to which facet terms have been assigned. There an intervention study represented by the impact of increasing are 4 main facets in LanguaL: A (Product Type), B (Food Source), doses of flavonoid-rich and flavonoid-poor fruit and vegetables C (Part of Plant or Animal), and E (Physical State, Shape, or Form). on cardiovascular risk factors in an “at risk” group study. Other food product description facets include chemical additive, The development of ONS followed FAIR principles and as a preservation or cooking process, packaging, and standard na- result, it has been published in the FAIR-sharing database.4 Be-tional and international upper-level product type schemes. fore defining new terms, the developers of ONS have ensured The LanguaL thesaurus complies with the FAIR guidelines. that they are not yet defined, with the use of the ONTOBEE web The completeness of LanguaL’s indexing is to a large extent service. Terms that were already defined were imported using the assured by the Langual Food Product Indexing (FPI) software, ontology reuse service — ONTOFOX [20]. In compliance with which verifies that all facets have been indexed for each food the OBO Foundry principles, the ONS has been developed to be in the list [8]. It is available online2 and can be queried using a interoperable with other ontologies, as it has been formalized food descriptor or synonym. Its interoperability and reusability are eminent as it represents a cornerstone in the development 3https://www.ebi.ac.uk/ols/ontologies/FoodOn, accessed 22/04/2020 2https://www.langual.org, accessed 22/04/2020 4https://fairsharing.org/bsg-s001068/, accessed 22/04/2020 46 Toward improved semantic annotation of food and nutrition data Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia using the latest OWL 2 Web Ontology Language and RDF speci- fications and edited using Protégé [13] and the Hermit reasoner for consistency checking. It is also accessible, under the Creative Commons license (CC BY 4.0), published on GitHub and at NCBO BioPortal. Moreover, this ensured the adoption of a well-defined and widely adopted structure for the top and mid-level classes and principally the adherence to BFO as upper-level ontology. ISO-Food is an ontology that was conceived to aid with the or- ganization, harmonization, and knowledge extraction of datasets containing information about isotopes, that represent variants of a particular chemical element which differ in neutron number. To develop this ontology a mixed approach was used, a combination of both expert knowledge-driven (bottom-up) and data-driven (top-down) methods. Its main classes include Isotope, Sample, Location, Measurement, Article. The main class Isotope is con- nected to the rest of the classes with respective relations. The Food and Nutrient classes are linked to the RICHFIELDS ontology [5]. The ontology was further applied in a study for describing isotopic data, to annotate a data sample that consists of isotopic measurements of milk and potato samples. The ISO-Food ontology can be accessed online via the Bio- Portal repository of biomedical ontologies.5 It reuses terms from several ontologies, such as the concept Unit from the Units of Measurements Ontology (UO), the classes Food and Component from the RICHFIELDS ontology [5], the class Document from Figure 1: Diagram representing the alignment of the pro-the Bibliographic Ontology (BIBO) [3]. posed ontology with the identified relevant upper-level and domain ontologies. 4 PROPOSAL Ontologies for data mining. To provide a suitable formalized representation of the outcomes of the research in the food and domain of food and nutrition (see Figure 1). In this way, we can nutrition domain, as well as to suggest new ways to extract knowl- also use the benefits of cross-domain reasoning. Since FoodOn, edge from the ever-abundant data produced in this field, we turn ONS, and OntoDM all use BFO as a main top-level ontology, they to ontologies that are used to formally represent the data analysis speak the same general language and are consequently, easier to process. More specifically, we focus on the align. OntoDM ontology, which provides a unified framework for representing data mining entities. It consists of three modular ontologies: Towards the FNS Harmony ontology. In the context of the OntoDM-core [15] which represents core data mining entities, such as datasets, H2020 project FNS Cloud6 (food, nutrition, security) the goal is to data mining tasks, algorithms, models and patterns, develop an infrastructure and services to exploit food, nutrition OntoDT [16] — a generic ontology of datatypes, and and security data (data, knowledge, tools – resources) for a range OntoDM-KDD [14] which describes the process of knowledge discovery. of purposes. To support the different functionalities required by The ontology defines top-level concepts in data mining and the cloud platform, we started with the development of the FNS- machine learning, such as data mining task, algorithm, and their Harmony (FNS-H). The application ontology would allow us to generalizations, which denote the outputs of applying an imple- harmonize and integrate the different reference vocabularies and mentation of an algorithm on a particular dataset. Starting with ontologies from different sub-areas of food and nutrition, as well these general concepts, OntoDM also defines the components of as ontologies representing the domain of data analysis. the algorithms, such as distance and kernel functions, and other features they may contain. From the input and output data per- Initial ontology development. The development of FNS-H, spective, in this ontology, there is a hierarchical representation which is intended to bridge the gap between the field of data of data, from general concepts such as dataset to more specific analysis and food and nutrition will be guided by common best concepts regarding its structure, such as the number of features, practice principles for ontology development. The aim is to max- their role in a given task, concluding with the datatype of each imize the reuse of available ontology resources and simultane- attribute. These properties of OntoDM provide a complete formal ously follow the Minimum Information to Reference an External representation of the data mining process from beginning to end. Ontology Term (MIREOT) principles [2]. In the first phase, we will integrate the FoodOn ontology and the ONS ontology with the OntoDM suite of ontologies. With this integration, we will Combining orthogonal domain ontologies. Our goal is to align the selected ontologies in the domain of food and nutrition be able to (1) define domain-specific data types for the domain with the OntoDM ontology of data mining to improve the se- of food and nutrition by extending OntoDT generic data types; mantic annotation of the food and nutrition domain datasets, as (2) define food and nutrition analysis pipelines for the domain well as to formally represent data analysis tasks performed in the of food and nutrition by extending OntoDM-core, and (3) define 5http://bioportal.bioontology.org/ontologies/ISO-FOOD, accessed 22/04/2020 6https://www.fns-cloud.eu/ 47 Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Jovanovska and Panov food and nutrition knowledge discovery scenarios by extending [3] Bojana Dimić Surla, Milan Segedinac, and Dragan Ivanović. OntoDM-KDD ontology. 2012. A bibo ontology extension for evaluation of scien- The development of the ontology already started in a top- tific research results. In Proceedings of the Fifth Balkan down fashion, it is expressed in OWL2 and being developed using Conference in Informatics, 275–278. the Protégé ontology development tool. Aspiring to maximize [4] Damion M Dooley, Emma J Griffiths, and Gurinder S Gosal accessibility, the ontology will be available for access on a GitHub et al. 2018. Foodon: a harmonized food ontology to in- repository, 7 as well as via BioPortal. In the current stage of crease global food traceability, quality control and data development, an initial set of higher-level domain terms, data integration. npj Science of Food, 2, 1, 1–10. types, data formats, data provenance metadata, lists of external [5] Tome Eftimov, Gordana Ispirova, and Peter Korosec et al. ontologies and vocabularies were extracted from the literature 2018. The richfields framework for semantic interoperabil- and FNS-Cloud project documents. ity of food information across heterogenous information In the next steps, we will first align the extracted terms with systems. In KDIR, 313–320. the BFO ontology and then integrate them with domain terms [6] Tome Eftimov, Gordana Ispirova, and Doris Potočnik. 2019. from the domain ontologies based on BFO, such asFoodOn, and Iso-food ontology: a formal representation of the knowl- ONS, at the first instance, as well as with the OntoDM set of edge within the domain of isotopes for food science. Food ontologies. Other potentially relevant ontologies include the On- chemistry, 277, 382–390. tology for Biomedical Investigations (OBI), Ontology of Biologi- [7] Heather Hedden. 2016. The accidental taxonomist. Infor- cal and Clinical Statistics (OBSC), Ontology of Chemical Entities mation Today, Inc. of Biological Interest (ChEBI), Ontology of Statistical Methods [8] Jayne D Ireland and A Møller. 2010. Langual food descrip- (STATO), and others. To achieve integration of different ontolog- tion: a learning process. European journal of clinical nutri- ical resources, we will use the ROBOT tool [9] that supports the tion, 64, 3, S44–S48. automation of a large number of ontology development tasks and [9] Rebecca C Jackson, James P Balhoff, and Eric Douglass. helps developers to efficiently produce high-quality ontologies. 2019. Robot: a tool for automating ontology workflows. BMC bioinformatics, 20, 1, 407. 5 CONCLUSION [10] Pankaj Jaiswal, Shulamit Avraham, and Katica Ilic et al. In this paper, we provided an overview of the most relevant 2005. Plant ontology (po): a controlled vocabulary of plant knowledge organization systems in the domain of food and nu- structures and growth stages. Comparative and functional trition. We started with the LanguaL food thesaurus that served genomics, 6, 7-8, 388–397. as a foundation for the development of the more sophisticated [11] Brian Matthews. 2005. Semantic web technologies. E-learning, ontologies — FoodOn, used for a multi-faceted description of 6, 6, 8. various foods; ONS, used for observational and interventional [12] Christopher J Mungall, Carlo Torniai, and Georgios V Gk- nutrition studies; ISO-Food for the studies of isotopic data in outos et al. 2012. Uberon, an integrative multi-species foods. Next, we assessed the selected vocabularies with respect anatomy ontology. Genome biology, 13, 1, R5. to the FAIR principles and OBO Foundry guidelines for scien- [13] Mark A Musen. 2015. The protégé project: a look back and tific data management. All of the selected vocabularies showed a look forward. AI matters, 1, 4, 4–12. compliance with these accomplishment criteria, with only minor [14] Panče Panov, Larisa Soldatova, and Sašo Džeroski. 2013. suggestions for improvement provided from our side. Finally, in Ontodm-kdd: ontology for representing the knowledge our proposal, we lay down the foundations of a new ontology discovery process. In International Conference on Discovery which would connect data mining concepts in the domain of Science. Springer, 126–140. food and nutrition using domain ontologies (FoodOn, ONS) with [15] Panče Panov, Larisa Soldatova, and Sašo Džeroski. 2014. ontologies for datatypes, data mining, and knowledge discovery Ontology of core data mining entities. Data Mining and in databases (OntoDT, OntoDM-core, OntoDM-KDD). By doing Knowledge Discovery, 28, 5-6, 1222–1265. so, we can provide richer semantic annotation and discover new [16] Panče Panov, Larisa N Soldatova, and Sašo Džeroski. 2016. scenarios of harvesting knowledge from the food and nutrition Generic ontology of datatypes. Information Sciences, 329, data. 900–920. [17] Barry Smith, Michael Ashburner, and Cornelius Rosse ACKNOWLEDGMENTS et al. 2007. The obo foundry: coordinated evolution of This work was supported by the Slovenian Research Agency through the ontologies to support biomedical data integration. Nature grant J2-9230, as well as the European Union’s Horizon 2020 research and biotechnology, 25, 11, 1251–1255. innovation programme through grant 863059 (FNS-Cloud, Food Nutrition [18] Francesco Vitali, Rosario Lombardo, and Damariz Rivero et Security). al. 2018. Ons: an ontology for a standardized description of interventions and observational studies in nutrition. REFERENCES Genes & nutrition, 13, 1, 12. [19] Mark D Wilkinson, Michel Dumontier, and IJsbrand Jan [1] Robert Arp, Barry Smith, and Andrew D Spear. 2015. Build- Aalbersberg et al. 2016. The fair guiding principles for ing ontologies with basic formal ontology. Mit Press. scientific data management and stewardship. [2] Mélanie Courtot, Frank Gibson, and Allyson L Lister et al. Scientific 2011. Mireot: the minimum information to reference an data, 3. [20] Zuoshuang Xiang, Mélanie Courtot, and Ryan R Brinkman external ontology term. Applied Ontology, 6, 1, 23–33. et al. 2010. Ontofox: web-based support for ontology reuse. 7https://github.com/panovp/FNS-Harmony BMC research notes, 3, 1, 175. 48 Absenteeism prediction from timesheet data: A case study Peter Zupančič Biljana Mileva Boshkoska Panče Panov 1A Internet d.o.o. Faculty of Information Studies in Jožef Stefan Institute and Naselje nuklearne elektrarne 2 Novo mesto, Ljubljanska cesta 31a, Jožef Stefan International Krško, Slovenia Novo mesto, Slovenia Postgraduate School peter.zupancic91@gmail.com Jožef Stefan Institute, Jamova cesta Jamova cesta 39 39, Ljubljana, Slovenia Ljubljana, Slovenia biljana.mileva@fis.unm.si pance.panov@ijs.si ABSTRACT In this paper, we address the task of absenteeism prediction Absenteeism, or employee absence from work, is a perpetual from time sheets data. More specifically, based on data that we get problem for all businesses, given the necessity to replace an from MojeUre time attendance register system, we want to build a absent worker to avoid a loss of revenue. In this paper, we focus predictive model to predict if or for how many days an employee on the task of predicting worker’s absence based on historical would be absent. In this case, we are considering one-week-ahead timesheet data. The data are obtained from MojeUre, a system for prediction from workers profiles and one year historical time tracking and recording working hours, which includes timesheet sheets data. To predict if an employee will be absent in a given profiles of employees from different companies in Slovenia. More week, we employee the task of binary classification, which can specifically, based on historical data for one year, we want to be addressed by using a large number of binary classification predict, under (which) certain conditions, if an employee will be methods. On the other hand, to predict the number of days an absent from work and for how long (e.g., a week, a month). In employee would be absent in a given week, we employee re- this respect, we compare the performance of different predictive gression, which can be addressed by using regression methods. modeling methods by defining the prediction task as a binary Furthermore, we observe and discuss how adding of aggregate classification task and as a regression task. Furthermore, in the attributes influences the prediction power if used together with case of one week ahead prediction, we test if we can improve the the timesheet profiles. predictions by using additional aggregate descriptive attributes, together with the timesheet profiles. 2 DATA KEYWORDS In this section, we present the MojeUre system and then de- Absenteeism at work, absence prediction, predictive modeling, scribe the structure of the raw data, as well as the process of timesheet data, human resource management data cleaning. Then we present the structure of the dataset, used for learning the predictive and the aggregate attributes, we con- structed in order to test if they would improve the predictive 1 INTRODUCTION power of the predictive models. Companies strive to have better predictive accuracy in their day to day operations, with the main goal of improving the productiv- ity of the human resources (HR) department and hence obtaining 2.1 MojeUre system higher profits and lower HR expenditures. They obtain informa- The MojeUre system (https://mojeure.si) was developed to sup- tion and insight from the large collections of human resource port the process of planning workers schedules, as well as for management (HRM) data that each employer owns, to support recording work attendance and absenteeism. In addition to the day to day operations and decision making, as well as, to comply easy recording of the working hours of employees by a company, to the national and international legislation. the system also provides access to each employee’s own working The new era of HR executives is moving from settling on hours, vacation control, sick leave, travel orders, etc. The system receptive choices exclusively taking into account reports and can be accessed using the web or by using a mobile application. dashboards towards connecting business information and hu- The entry of working hours is done either through a web man asset information to foresee future results which will bring application or a mobile application. In the case the company also changes. Having such data enables them to detect patterns and wants to invests into a working time registrar, this can be done trends, anticipate events and spot anomalies, forecast using what- through the registrar where the employee has a personalized card if simulations and learn of changes in employee behaviour so that for clock-in or clock-out (for example usage of break, such as a employee can take actions that lead to desired business outcomes. lunch break, a private break, etc.). The system allows different The purpose of HRM is measuring employee performance and en- types of registered hours to be entered in the system in a single gagement, studying workforce collaboration patterns, analyzing day. employee churn and turnover and modelling employee lifetime All data used in the paper was obtained from the electronic value [1]. system for recording working hours. There are currently more than 150 different companies that use the system for registering Permission to make digital or hard copies of part or all of this work for personal workers attendance. The basic function of the system is to record or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the arrivals and departures of an employee at work and to record the full citation on the first page. Copyrights for third-party components of this the various types of employee absence, such as sick leave and work must be honored. For all other uses, contact the owner/author(s). vacation leave. In addition, the system covers other absences Information society ’20, October 5–9, 2020, Ljubljana, Slovenia © 2020 Copyright held by the owner/author(s). such as paternity leave, maternity leave, part-time leave, study leave, student leave, etc. 49 Information society ’20, October 5–9, 2020, Ljubljana, Slovenia Zupančič et al. In this paper, we use data from the MojeUre system for the Table 3: Attributes representing the workers profiles year 2019 and we have timesheet attendance data for all 52 weeks. The data instances are composed of three types of attributes: (1) Attribute name Type Description attributes describing workers profiles (See Table 1), (2) attributes describing timesheets absence profiles of each worker (See Table VacationLeave numeric Total days of vacation leave for 2), and (3) attributes that are aggregates from timesheets profiles TotalDays all weeks, which are defined in constructed using domain knowledge (more details about the the timesheets data used for the attributes is provided in Section 2.2). The timesheets attributes descriptive attribute space. composing the absence profile of each worker are calculated SickLeave numeric Total days of sick leave for all based on the logged presence and absence logging data aggre- TotalDays weeks, which are defined in the gated on the week level.. The entire dataset for the whole year timesheets data used for the de- consists of 232 different attributes and 2363 employees which are scriptive attribute space. defined as each row. ShortTerm numeric A count of how many times an VacationLeave3 employee was at vacation leave for at least 3 days per week. Table 1: Workers profile attributes LongTerm numeric A count of how many times an VacationLeave5 employee was on vacation leave Attribute name Type Description for at last 5 days per week. EmployeeID numeric Unique employee identifier. ShortTerm numeric A count of how many times an WorkHour numeric Data indicating how many SickLeave3 employee was on sick leave for hours per day an employee is at least 3 days. employed by contract. LongTerm numeric A count of how many times an CompanyType nominal Company type by specific cate- SickLeave5 employee was on sick leave for gories. at least 5 days. EmploymentYears numeric Describes how many years the WinterVacation numeric The number of vacation leave person has been employed by LeaveAbsence days that were used in winter. the current company. SpringVacation numeric The number of vacation leave JobType nominal Describes type of job (e.g. per- LeaveAbsence days that were used in spring. manent, part-time). SummerVacation numeric The number of vacation leave Region nominal The region in which the em- LeaveAbsence days that were used in summer. ployee’s company is located. AutumnVacation numeric The number of vacation leave LeaveAbsence days that were used in autumn. WinterSickLeave numeric The number of sick leave days Table 2: Timesheet absence profile attributes Absence that were used in winter. SpringSick numeric The number of sick leave days LeaveAbsence that were used in spring. Attribute name Type Description SummerSick numeric The number of sick leave days WeekWNYTotal numeric The number of all absences in LeaveAbsence that were used in summer. a given week, including the AutumnSick numeric The number of sick leave days sum of sick leave and (vacation) LeaveAbsence that were used in autumn. leave. WinterVacation numeric The number of vacation leave WeekWNY numeric The number of absences with LeaveHoliday days that were used in winter VacationLeave type vacation leave in a given during school holidays. week. SpringVacation numeric The number of vacation leave WeekWNY nominal The number of absences with LeaveHoliday days that were used in spring SickLeave type sick leave in a given week. during school spring holidays. WeekWNY nominal Value tells if employee was ab- SummerVacation numeric The number of vacation leave Absence sent at least 1 day in whole LeaveHoliday days that were used in summer week. during school summer holidays. AutumnVacation numeric The number of vacation leave LeaveHoliday days that were used in autumn 2.2 Data prepossessing and feature during school holidays. engineering Feature Engineering is an art (Shekhar A, 2018) and involves the process of using domain knowledge to create features with The period we are considering in our analysis is one year, the goal to increase the predictive power of machine learning that is composed of 52 weeks. For construction of the aggregate algorithms. In this section, we describe the newly constructed attributes, we have defined our seasons by weeks, defined as attributes using domain knowledge. Furthermore, we present the follows: (1) the winter season is defined from week 51 in the process of data cleaning. Before cleaning, the original dataset previous year to week 12 in the New year; (2) the spring season contains 2087 instances of individual employees. The engineered is defined from week 13 to week 25; (3) the summer season is aggregate attributes using domain knowledge from timesheets defined from week 26 week to week 39; and (4) the autumn season profiles are presented in Table 3. is defined from week 40 week to week 49. 50 Absenteeism prediction from timesheet data: A case study Information society ’20, October 5–9, 2020, Ljubljana, Slovenia In addition, we also defined the school holidays by weeks, which are defined as follows: (1) the winter holidays are defined Target Descriptive attributes from week 7 to 8; (2) the spring holidays are defined from week attribute 18 to 19; (3) the summer holidays are defined from week 26 to Timesheet Worker absence Week K week 35; and (4) the autumn holidays are defined from week 44 profile binary profile Absence to week 45. 1-(K-1) week After we cleaned up the initial dataset, we obtained a smaller number of dataset instances. This resulted in a dataset with 961 (a) Without aggregate attributes distinct rows or more precisely different employees. The main Target control statement for the data cleaning was a test if an employee Descriptive attributes attribute has less than one VacationLeaveTotalDays in the defined period. Timesheet Timesheet This would mean that: (1) an employee that fulfills this condition Worker absence absence Week K doesn’t work any more in company; or (2) the company doesn’t profile binary profile aggregates Absence use recording system anymore; or (3) the employee is student 1-(K-1) week 1-(K-1) week and for students the vacation leave days are not recorded as they (b) With aggregate attributes are usually paid per working hour only. The most of employees in the dataset are working in company Figure 1: The structure of the data instances used for learn- type called “Izobraževanje, prevajanje, kultura, šport” (Education, ing predictive models translation services, culture, sports). In addition, most of the em- ployees are coming from the region “Osrednjeslovenska” (Central Slovenia region). The largest number of absence vacation leave or holiday leave was in week 52, which is the last week in year 2019 which is expected. the aggregate attributes were calculated. The absence of the 13th week was used a target attribute. For each quarter, we constructed 3 DATA ANALYSIS SCENARIOS AND two different variants of datasets, one containing the aggregate EXPERIMENTS attributes and the other without the aggregate attributes. This Research question. In general, in this paper we want to perform procedure was done for both tasks: binary classification and re- one-week ahead prediction of employee absence, using worker gression. profile data, historical timesheet data aggregated on a week level, as well as aggregated attributes described in the previous sec- Experimental setup. For our paper, we used Weka as main soft- tion. We explore the task of predicting employee absence both ware [2] to execute predictive modelling experiments. WEKA is as a binary classification task and as a regression task. In the an open source software provides tools for data preprocessing, experiments, we want to test if and how the aggregates attributes implementation of several Machine Learning algorithms, and influence the predictive power of the built models both for the visualization tools so that one can develop machine learning case of binary classification and regression. techniques and apply them to real-world data mining problems. In the experiments, for all methods we used the default method Tasks. In the binary classification task, we want only to predict settings from Weka mining software. The evaluation method if an employee will be absent in a given week. For this case, we used was 10 fold cross-validation. use the boolean attribute WeekWNYAbsence as a target attribute (WNY is the identifier of the target week). In the regression Methods. Here, we used different predictive methods imple- task, we want to predict the number of absence days. For this mented in the WEKA software with different settings. For the case, we use one of the following numeric attributes as targets regression task, we compare the performance of the following WeekWNYTotal (for predicting the total number of absence days), methods Linear regression (LR), M5P (both regression and model WeekWNYVacationLeave (for predicting the number of vacation trees)[3], RandomForest (RF) [4] with M5P trees as base learners, leave days), or WeekWNYSickLeave (for predicting the number Bagg (Bag) [5] having M5P trees as base learners, IBK (nearest of sick leave days). neighbour classifier with different number of neighbours) [6] and SMOreg (support vector regression) [7]. Construction of the experimental datasets For the purpose For binary prediction, we compare the performance of the of analysis, we construct two types of datasets: (1) the first type following methods: jRIP (decision rules) J48 (decision trees) Ran- contain worker profile and timesheet absence profiles as descrip- domForest (RF), Bagging (Bagg) having J48 trees as base learners, tive attributes (see Figure 1a); and (2) the second type includes RandomSubSpace (RS) [8] having J48 trees as base learners, SMO also timesheets absence aggregates (see Figure 1b). (support vector machines) [9], and IBK (nearest neighbour classi-In order to perform analysis, we need to properly construct the fier with different number of neighbours). datasets used for learning predicting models. For example, if we want to predict workers absence for week 15, we use historical Evaluation measures. To answer our research question for the timesheets data from week 1-14 together with the aggregates case of regression, we use several measures for regression anal- calculated on this period as descriptive attributes. ysis, such as: Mean Absolute Error (MAE), Root mean squared We decided to split the year consisting of 52 weeks in four error (RMSE), and Correlation coefficient (CC). quarters (Q1: W1-W13, Q2: W14-W26, Q3:W27-W39, Q4:W40- For the case of classification, we use several measures for clas- W52), each containing 13 weeks. The absence data for the first sification analysis, such as: the percentage of correctly classified 12 weeks were used as historical timesheet profiles, out of which instances (classification accuracy), precision, and recall. 51 Information society ’20, October 5–9, 2020, Ljubljana, Slovenia Zupančič et al. Table 4: Predictive performance results. The bold value denotes the highest value when we compare datasets with (A) or without (NA) added aggregate attributes. The gray cells denote the best performing method for each dataset. (a) Performance results for the regression task - RMSE measure (less is better) Dataset LR MP5 M5P-R RF Bagg IBK(K=1) IBK(K=3) IBK(K=7) SMOreg Q1-A 0.789 0.692 0.775 0.688 0.64 0.804 0.687 0.734 0.681 Q1-NA 0.723 0.674 0.767 0.729 0.647 0.798 0.693 0.724 0.659 Q2-A 1.692 1.369 1.422 1.412 1.438 1.894 1.476 1.382 1.617 Q2-NA 1.44 1.382 1.396 1.457 1.379 1.752 1.506 1.425 1.497 Q3-A 0.942 0.919 0.976 0.999 0.935 1.409 1.074 1.015 0.963 Q3-NA 0.911 0.929 0.956 0.968 0.927 1.223 1.046 1.017 0.969 Q4-A 0.977 0.947 0.961 0.923 0.922 1.222 1.029 1.005 0.984 Q4-NA 0.992 0.985 0.976 1.024 0.975 1.186 1.066 0.999 1.007 (b) Performance results for the classification task - Accuracy in% (more is better) Dataset JRip j48 RF Bagg RS SMO IBK(K=1) IBK(K=3) IBK(K=7) Q1-A 87.429 90.810 90.357 90.833 89.881 92.762 87.452 91.810 90.810 Q1-NA 87.429 90.810 90.381 89.857 90.357 90.833 89.429 91.810 90.833 Q2-A 63.645 68.879 65.751 65.419 66.736 69.200 58.153 64.347 68.842 Q2-NA 66.466 68.177 67.118 66.441 66.429 66.773 65.049 62.291 67.463 Q3-A 84.429 84.404 83.288 83.061 84.409 86.677 77.182 82.616 85.333 Q3-NA 83.737 83.520 82.379 83.737 84.864 86.449 81.263 85.101 84.879 Q4-A 71.130 67.277 72.150 70.460 70.305 70.452 69.627 70.644 70.302 Q4-NA 70.455 68.266 66.774 67.441 69.791 69.466 66.093 67.610 68.960 4 RESULTS AND DISCUSSION be absent in a given week). To see the difference in performance, Regression task1. In Table 4a, we present the results for RMSE we performed experiments on datasets constructed on different measure. It indicates how close the observed data points are to quarters of the year. The best prediction method in the case of re- the model’s predicted values, and lower values indicate better fit. gression is Bagging and in general we could say that predictions From the results, we can observe that in general Bagging of M5P are slightly better if we don’t use aggregate attributes. The best trees obtains the best performance. Predicting absence in week method in the case of classification is SMO. Again almost same 13 from Q1 is generally better without using aggregate attributes. results with using or not using external aggregate attributes. We have similar behaviour for predicting absence in week 26 (Q2) In future work, we plan to perform selective analysis of absen- and week 39 (Q3). Predicting absence for the last week in the teeism using the same data based on different criteria, such as year from Q4 is generally better done using additional aggregate seasonality, closeness to holidays (before, after), critical weeks for attributes. If we consider MAE, the best performing method is certain professions etc. In addition, we plan to perform regional SMOreg, and for Q1, Q2 better results are obtained without the analysis and workers domain analysis which is based on com- use of aggregate attributes, opposite to the Q3 and Q4. Finally, if pany type. Moreover, more insight into absence patterns will be we consider CC the best performing method is Bagging, and for available after collecting several years of attendance data for each Q1 and Q4 better results are obtained without using aggregate employee. Finally, we plan to compare the different granularity attributes, opposite to Q2 and Q3. of prediction (day - based vs. week - based vs. half a month based vs. month based analysis). Classification task2. In Table 4b, we present the results for accuracy. From the results, we can observe that in general SMO ACKNOWLEDGMENTS obtains the best performance. For Q1, we obtain better results We thank the company 1A Internet d.o.o., which provided us access to if we do not include aggregate attributes. For Q2, Q3 and Q4 the data which were used in our research. Panče Panov is supported by the best results are obtained by using the additional aggregate the Slovenian Research Agency grant J2-9230. attributes. If we consider precision the best performing methods are SMO and JRip, while for recall the best performing method REFERENCES is IBK using 7 nearest neighbours. [1] Malisetty, S., Archana, R. V., & Kumari, K. V. (2017). Predictive analytics in HR management., Indian Journal of Public Health Research & Development, 8(3), 115-120. 5 CONCLUSION AND FUTURE WORK [2] Witten, I. H., & Frank, E. (2002). Data mining: practical machine learning tools The main goal of the paper was to test if adding additional and techniques with Java implementations., Acm Sigmod Record, 31(1), 76-77. [3] Ross J. Quinlan. Learning with Continuous Classes. In: 5th Australian Joint timesheet aggregate attributes can influence the predictive power Conference on Artificial Intelligence., Singapore, 343-348, 1992. in the case of one-week ahead absenteeism prediction from [4] Leo Breiman (2001). Random Forests., Machine Learning. 45(1):5-32. timesheet data. The research was performed on data from year [5] Leo Breiman (1996). Bagging predictors., Machine Learning. 24(2):123-140. [6] D. Aha, D. Kibler (1991). Instance-based learning algorithms., Machine Learning. 2019, collected by the MojeUre work attendance register system. 6:37-66. We used various predictive modelling methods formulating the [7] S.K. Shevade, S.S. Keerthi, C. Bhattacharyya, K.R.K. Murthy. Improvements to the prediction task as regression (predicting the number of absent SMO Algorithm for SVM Regression., In: IEEE Transactions on Neural Networks, 1999. days in a week) and classification (predicting if an employee will [8] Tin Kam Ho (1998) The Random Subspace Method for Constructing Decision Forests., IEEE Transactions on Pattern Analysis and Machine Intelligence. 1Complete results for regression are presented at the following URL 20(8):832-844. URL http://citeseer.ist.psu.edu/ho98random.html. https://tinyurl.com/yyp85vfr [9] J. Platt. Fast Training of Support Vector Machines using Sequential Minimal 2Complete results for classification are presented at the following URL Optimization., In B. Schoelkopf and C. Burges and A. Smola, editors, Advances https://tinyurl.com/y6o6h6d8 in Kernel Methods - Support Vector Learning, 1998. 52 Monitoring COVID-19 through text mining and visualization M.Besher Massri Joao Pita Costa Andrej Bauer Jožef Stefan Institute, Slovenia Quintelligence, Slovenia University of Ljubljana, Slovenia besher.massri@ijs.si joao.pitacosta@quintelligence.com andrej.bauer@andrej.com Marko Grobelnik Janez Brank Luka Stopar Jožef Stefan Institute, Slovenia Jožef Stefan Institute, Slovenia Jožef Stefan Institute, Slovenia marko.grobelnik@ijs.si janez.brank@ijs.si luka.stopar@ijs.si ABSTRACT The global health situation due to the SARS-COV-2 pandemic motivated an unprecedented contribution of science and tech- nology from companies and communities all over the world to fight COVID-19. In this paper, we present the impactful role of text mining and data analytics, exposed publicly through IRCAI’s Coronavirus Watch portal. We will discuss the available technol- ogy and methodology, as well as the ongoing research based on the collected data. KEYWORDS Text mining, Data analytics, Data visualisation, Public health, Figure 1: Coronavirus Watch portal Coronavirus, COVID-19, Epidemic intelligence 1 INTRODUCTION the lack of resolution of the data in aspects like the geographic When the World Health Organization (WHO) announced the location of reported cases, the commodities (i.e., other diseases global COVID-19 pandemic on March 11th 2020 [25], following that also influence the death of the patient), the frequency of the the rising incidence of the SARS-COV-2 in Europe, the world data, etc. On the other hand, it was not common to monitor the started reading and talking about the new Coronavirus. The ar- epidemic through the worldwide news (with some exceptions as rival of the epidemic to Europe scaled out the news published the Ravenpack Coronavirus News Monitor [21]). about the topic, while public health institutions and governmen- The Coronavirus Watch portal suggests the association of tal agencies had to look for existing reliable solutions that could reported incidence with worldwide published news per country, help them plan their actions and the consequences of these. which allows for real-time analysis of the epidemic situation Technological companies and scientific communities invested and its impact on public health (in which specific topics like efforts in making available tools (e.g. the GIS [1] later adopted mental health and diabetes are important related matters) but by the World Health Organisation (WHO)), challenges (e.g. the also in other domains (such as economy, social inequalities, etc.). Kaggle COVID-19 competition [13]), and scientific reports and This news monitoring is based on state-of-the-art text mining data (e.g. the repositories medRxiv [15] and Zenodo [27]). technology aligned with the validation of domain experts that In this paper we discuss the Coronavirus Watch portal [12], ensures the relevance of the customized stream of collected news. made available by the UNESCO AI Research Institute (IRCAI), Moreover, the Coronavirus Watch portal offers the user other comprehending several data exploration dashboards related to perspectives of the epidemic monitoring, such as the insights the SARS-COV-2 worldwide pandemic (see the main portal in from the published biomedical research that will help the user Figure 1). This platform aims to expose the different perspectives to better understand the disease and its impact on other health on the data generated and trigger actions that can contribute to conditions. While related work was promoted in [13] in relation a better understanding of the behavior of the disease. with the COVID-19, and is offered in general by MEDLINE mining tools (e.g., MeSH Now [16]), there seems to be no dedicated tool 2 RELATED WORK to the monitoring and mining of COVID-19 - related research as that presented here. The many platforms that have been made publicly available over the internet to monitor aspects of the COVID-19 pandemics are mostly focusing on data visualization based on the incidence of 3 DESCRIPTION OF DATA the disease and the death rate worldwide (e.g., the CoronaTracker 3.1 Historical COVID-19 Data [3]). The limitations of the available tools are potentially due to To perform an analysis of the growth of the coronavirus, we need Permission to make digital or hard copies of all or part of this work for personal to use the historical data of cases and deaths. This data is retrieved or classroom use is granted without fee provided that copies are not made or from a GitHub repository by John Hopkins University[4]. The distributed for profit or commercial advantage and that copies bear this notice and data source is based mainly on the official data from the World the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy Health Organization (WHO)[24] along with some other sources, otherwise, or republish, to post on servers or to redistribute to lists, requires prior like the Center for Disease and Control[2], and Worldometer[26], specific permission and/or a fee. Request permissions from permissions@acm.org. among others. This data provides the basis for all functionality Information society ’20, October 5–9, 2020, Ljubljana, Slovenia © 2020 Association for Computing Machinery. that depended on the statistical information about COVID-19 numbers. 53 Information society ’20, October 5–9, 2020, Ljubljana, Slovenia 3.2 Live Data from Worldometer Apart from historical data, live data about the COVID-19 number of cases, deaths, recovered, and tests are retrieved from the worl- dometer website. Although the cases might not be as official as the one provided by John Hopkins University (which is based on WHO data), this source is updated many times per day providing the latest up-to-date data about COVID-19 statistics at all times. 3.3 Live News about Coronavirus The live news is retrieved from Event Registry [10], which is a media-intelligence platform that collects news media from around the world in many languages. The service analyzes news from more than 30,000 news, blogs, and PR sources in 35 lan- Figure 2: A snapshot of the 5D Visualization on March guages. 23rd. Countries that were at the peak in terms of growth are shown high up like Turkey. Whereas countries that 3.4 Google COVID-19 Community mostly contained the virus are shown down like China. Mobility Data Google’s Community Mobility [11] data compares mobility pat-by clicking on the country name on the left table. As seen in terns from before the COVID-19 crisis and the situation on a figure 1. weekly basis. Mobility patterns are measured as changes in the frequency of visits to six location types: Retail and recreation, 4.3 Statistical Visualizations Grocery and pharmacy, Parks, Transit stations, Workplaces, and Residential. The data is provided on a country level as well as on The following set of visualization all aims at displaying the statis- a province level. tics about COVID-19 cases and deaths in a visual format. While they all provide countries comparison, each one focus on differ- ent perspective; Some are more complex and focus on the big 3.5 MEDLINE: Medical Research Open picture (5D evolution), and some are simple and focus on one Dataset aspect (Progression and Trajectory). Besides, all of them have The MEDLINE dataset [14] contains more than 30 million cita-configuration options to tweak the visualization, like the ability tions and abstracts of the biomedical literature, hand-annotated to change the scale of the axes to focus on the top countries or by health experts using 16 major categories and a maximum of the long tale. Or a slider to manually move through the days for 13 levels of deepness. The labeled articles are hand-annotated by further inspection. Furthermore, the default view compares all humans based on their main and complementary topics, and on the countries or the top N countries, depending on the visualiza- the chemical substances that they relate to. It is widely used by tion. However, it’s possible to track a single country or a set of the biomedical research community through the well-accepted countries and compare them together for a more focused view. search engine PubMed [19]. This is done by selecting the main country by clicking on it on the left table and proceeding to select more countries by pressing 4 CORONAVIRUS WATCH DASHBOARD the ctrl key while clicking on the country. The main layout of the dashboard displayed in figure 1 consists 4.3.1 5D Evolution. 5D Evolution is a visualization that displays of two sides. It is split into the left table of countries, where a the evolution of the virus situation through time. It is called like simple table of statistics is provided about countries along with that since it encompasses five dimensions: x-axis, y-axis, bubble the total numbers of cases, deaths, and recovered. On the right size, bubble color, and time, as seen in figure 2. By default, it il-side, there is a navigation panel with tabs, each representing a lustrates the evolution of the virus in countries based on N. cases functionality. Each functionality answers some questions and (x-axis), The growth factor of N. Cases (y-axis), N. Deaths (bubble provides insights about a certain type of data. size), and country region (bubble color) through time. In addition, a red ring around the country bubble is drawn whenever the first 4.1 Coronavirus Data Table death appears. The growth rate represents how likely that the The data table functionality is a simple table that shows the basic numbers are increasing with respect to the day before. A growth statistics about the new coronavirus. It’s taken from Worldometer rate of 2 means that the numbers are likely to double in the next as it’s the most frequently updated source for coronavirus. The day. The growth rate is calculated using the exponential regres- data table comes in two forms, one that is a simplified version sion model. At each day the growth rate is based on the N. cases which is the table on the left, and one contains the full information from the previous seven days. The goal of this visualization to in a separate tab. show how countries relate to each other and which are exploding in numbers and which ones managed to "flatten the curve", since 4.2 Coronavirus Live News flattening the curve means less growth rate. It’s intended to be one visualization that gives the user a big picture of the situation. The second functionality is a live news feed about coronavirus from around the world. The feed comes from Event Registry, 4.3.2 Progression. The progression visualization displays the which is generated by querying for articles that are annotated simple Date vs N. cases/deaths line graph. It helps to provide with concepts and keywords related to coronavirus. The user can a simplistic view of the situation and compare countries based check for a country’s specific news (news source in that country) on the raw numbers only. The user can display the cumulative 54 Monitoring COVID-19 through text mining and visualization Information society ’20, October 5–9, 2020, Ljubljana, Slovenia numbers where each day represents the numbers up to now, or daily where at each date the numbers represent the cases/deaths on that day only. 4.3.3 Trajectory. While the progress visualization displays the normal date vs N. cases/deaths, this visualization seeks to com- pare how the trajectory of the countries differ starting from the point where they detect cases. This visualization helps to com- pare countries’ situations if they all start having cases on the same date. The starting point has been set to the day the country reaches 100 cases, so we would compare countries when they started gaining momentum. 4.4 Time Gap The time gap functionality tries to estimate how the countries are aligned and how many days each country is behind the other, whether that is in the number of cases or deaths. This assumes that the trajectory of the country will continue as it with taking much more strict/loose measurements, which is a rough assump- tion. It helps to estimate how bad or good the situation in terms of the number of days. To see the comparison, a country has to Figure 3: A snapshot of the Social Distancing Simulator. be selected from the table on the left. However, not all countries The canvas show a representation of the population. with are comparable as they have very different trajectories or growth red dots representing sick people, yellow dots represent- rates. ing immunized people, and grey dots represent deceased The growth of each country is represented as an exponential people. function, the base is calculated using linear regression on the log of the historical values (that is, exponential regression). Based on that, the duplication N. days, or the N. days the number of The simulator is controlled by three parameters. First, Social cases/deaths will double is determined. two countries are compa- distancing that controls to what extent the population enforces rable if they have a reasonable difference is the base or doubling social distancing. At 0% there is no social distancing and per- factor. If they are comparable, we see where the country with the sons move with maximum speed so that there is a great deal smaller value fits in the historical values of the country with the of contact between them. At 100% everyone remains still and larger numbers, with linear interpolation if the number is not there is no contact at all. Second, mortality is the probability exact, hence the decimal values. that a sick person dies. If you set mortality to 0% nobody dies, while the mortality of 100% means that anybody who catches the infection will die. Finally, infection duration determines how 4.5 Mobility long a person is infected. A longer time gives an infected person The mobility visualization is based on google community mo- more opportunities to spread the infection. Since the simulation bility data that describe how communities in each country are runs at high speed, time is measured in seconds. moving based on 6 parameters: Retail and recreation, Grocery and pharmacy, Parks, Transit stations, Workplaces, and Residen- 4.7 Biomedical Research Explorer tial. The data is then reduced to 2-dimensional data while keeping To better understand the disease, the published biomedical sci- the Euclidean proximity nearly the same. The visualization can ence is the source that provides accurate and validated infor- indicate that the closer the countries are on the visualization, the mation. Taking into consideration a large amount of published similar the mobility patterns they have. The visualization uses science and the obstacles to access scientific information, we the T-SNE algorithm for dimensionality reduction [23], which made available a MEDLINE explorer where the user can query reduces high dimensional data to low dimensional one while the system and interact with a pointer to specify the search re- keeping the distance proximity between them proportionally sults (e.g., obtaining results on biomarkers when searching for the same as possible. The algorithm works in the form of iter- articles hand-annotated with the MeSH class "Coronavirus"). ations, at each iteration, the bubbles representing the country To allow for the exploration of any health-related texts (such as are drawn. We used those iterations to provide animation to the scientific reports or news) we developed an automated classifier visualization. [5] that assigns to the input text the MeSH classes it relates to. The annotated text is then stored in Elasticsearch [18], from where 4.6 Social Distancing Simulator it can be accessed through Lucene language queries, visualized The Social Distancing simulator is displayed in figure 3. Each over easy-to-build dashboards, and connected through an API circle represents a person who can be either healthy (white), to the earlier described explorer (see [8], [20] and [17] for more immune (yellow), infected (red), or deceased (gray). A healthy detail). person is infected when they collide with an infected person. The integration of the MeSH classifier with the worldwide After a period of infection, a person either dies or becomes per- news explorer Event Registry allows us to use MeSH classes in manently immune. Thus the simulation follows the Susceptible- the queries over worldwide news promoting an integrated health Infectious-Recovered-Deceased (SIRD) compartmental epidemio- news monitoring [9] and trying to avoid bias in this context logical model. [7]. An obvious limitation is a fact that the annotation is only 55 Information society ’20, October 5–9, 2020, Ljubljana, Slovenia available for news written in the English language, being the [7] J. Pita Costa et al. 2019. Health news bias and its impact unique language in MEDLINE. in public health. In Proceedings of the Slovenian KDD con- ference. 5 CONCLUSION AND FUTURE WORK [8] J. Pita Costa et al. 2020. Meaningful big data integration for In this paper, we presented the coronavirus watch dashboard as a global covid-19 strategy. Computer Intelligence Magazine. a use-case of observing pandemic. However, this methodology [9] J. Pita Costa et al. 2017. Text mining open datasets to sup- can be applied to other kinds of diseases given the availability of port public health. In WITS 2017 Conference Proceedings. similar data. For further development, we plan to implement a [10] EventRegistry. 2020. Event Registry. https://eventregistry. local dashboard for other countries as well which would provide org. (2020). local data in the local language. In addition, given the existence of [11] Google. 2020. Google COVID-19 Community Mobility Re- more than seven months of historical data, we would like to build port. https://www.google.com/covid19/mobility/. (2020). some predictive models to predict the number of cases/deaths in [12] IRCAI. 2020. IRCAI coronavirus watch portal. http : / / the next few days. coronaviruswatch.ircai.org/. (2020). Moreover, we are using the StreamStory technology [22] in [13] Kaggle. 2020. Kaggle covid-19 open research dataset chal- order to: (i) compare the evolution of the disease between coun- lenge. https : / / www. kaggle. com / allen - institute - for - tries by comparing their time-series of incidence; (ii) investi- ai/CORD-19-research-challenge. (2020). gate the correlation between the incidence of the disease with [14] MEDLINE. 2020. MEDLINE description of the database. weather conditions and other impact factors; and (iii) analyze https://www.nlm.nih.gov/bsd/medline.html. (2020). the dynamics of the evolution of the disease based on incidence, [15] medRxiv. 2020. medRxiv covid-19 sars-cov-2 preprints morbidity, and recovery. This technology allows for the anal- from medrxiv and biorxiv. https://connect.medrxiv.org/ ysis of dynamical Markov processes, analyzing simultaneous relate/content/181. (2020). time-series through transitions between states, offering several [16] MeSHNow. 2020. MeSHNow. https://www.ncbi.nlm.nih. customization options and data visualization modules. gov/CBBresearch/Lu/Demo/MeSHNow/. (2020). Furthermore, following the work done in the context of the [17] MIDAS. 2020. MIDAS COVID-19 portal. http : / / www. Influenza epidemic in [6], we are using Topological Data Analysis midasproject.eu/covid-19/. (2020). methods to understand the behavior of COVID-19 throughout [18] Elastic NV. 2020. Elasticsearch portal. https://www.elastic. Europe. In it, we examine the structure of data through its topo- co/. (2020). logical structure, which allows for comparison of the evolution [19] PubMed. 2020. PubMed biomedical search engine. https: of the epidemics within countries through the encoded topology //pubmed.ncbi.nlm.nih.gov/. (2020). of their incidence time series. [20] Quintelligence. 2020. Quintelligence COVID-19 portal. http://midas.quintelligence.com/. (2020). ACKNOWLEDGMENTS [21] Ravenpack. 2020. Ravenpack coronavirus news monitor. The first author has been supported by the Knowledge 4 All https://coronavirus.ravenpack.com/. (2020). foundation and the H2020 Humane AI project under the European [22] Luka Stopar. 2020. StreamStory. http://streamstory.ijs.si/. research and innovation programme under GA No. 761758), while (2020). the second author was funded by the European Union research [23] Laurens van der Maaten and Geoffrey Hinton. 2008. Vi- fund ’Big Data Supporting Public Health Policies’, under GA ualizing data using t-sne. Journal of Machine Learning No. 727721. The third author acknowledges that this material is Research, 9, (November 2008), 2579–2605. based upon work supported by the Air Force Office of Scientific [24] WHO. 2020. WHO Coronavirus portal. https://www.who. Research under award number FA9550-17-1-0326. int/emergencies/diseases/novel-coronavirus-2019. (2020). [25] WHO. 2020. World Health Organization who director- REFERENCES general’s opening remarks at the media briefing on covid- 19 - 11 march 2020. https://www.who.int/dg/speeches/ [1] ArcGIS. 2020. ArcGIS who covid-19 dashboard. https:// detail/who-director-general-s-opening-remarks-at-the- covid19.who.int/. (2020). media-briefing-on-covid-19---11-march-2020. (2020). [2] CDC. 2020. Center for Disease Control and Prevention. [26] WorldoMeters. 2020. WorldoMeters. https://www.worldometers. https://www.cdc.gov/coronavirus/2019-ncov/index.html. info/coronavirus/. (2020). (2020). [27] Zenodo. 2020. Zenodo coronavirus disease research com- [3] CoronaTracker. 2020. CoronaTracker. https://www.coronatracker. munity. https : / / zenodo . org / communities / covid - 19/. com/analytics/. (2020). (2020). [4] CSSE. 2020. Covid-19 data repository by the center for systems science and engineering (csse) at johns hopkins university. https://github.com/CSSEGISandData/COVID- 19. (2020). [5] J. Pita Costa et al. 2020. A new classifier designed to an- notate health-related news with mesh headings. Artificial Intelligence in Medicine. [6] J. Pita Costa et al. 2019. A topological data analysis ap- proach to the epidemiology of influenza. In Proceedings of the Slovenian KDD conference. 56 Usage of Incremental Learning in Land-Cover Classification Jože Peternelj Beno Šircelj Klemen Kenda Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute Jamova 39, 1000 Ljubljana, Jamova 39, 1000 Ljubljana, Jožef Stefan International Slovenia Slovenia Postgraduate School joze.peternelj@ijs.si beno.sircelj@ijs.si Jamova 39, 1000 Ljubljana, Slovenia klemen.kenda@ijs.si ABSTRACT 2. DATA In this paper we present a comparison of a variety of incre- 2.1 EO data mental learning algorithms along with traditional (batch) The Earth observation data were provided by the Sentinel 2 learning algorithms in an earth observation scenario. The mission of the EU Copernicus programme, whose main ob- approach was evaluated with the earth observation data jectives are land monitoring, detection of land use and land set for land-cover classification from Europe Space Agency’s changes, support for land cover creation, disaster relief sup- Sentinel-2 mission, the digital elevation model and the ground port and monitoring of climate change [2]. The data com-truth data of land use and land cover from Slovenia. We prise 13 multi-spectral channels in the visible/near- infrared show that incremental algorithms can produce competitive (VNIR) and short wave infrared (SWIR) spectral range with results while using less time than batch methods. a temporal resolution of 5 days and spatial resolutions of 10m, 20m and 60m [8]. The Sentinel’s Level-2A products Keywords (surface reflections in cartographic geometry) were accessed remote sensing, earth observation, incremental learning, ma- via the services of SentinelHub1 and processed using eo-chine learning, classification learn2 library. Additionally, a digital elevation model for Slovenia (EU-DEM) with 30m resolution3 was used. 1. INTRODUCTION 2.2 LULC data Land cover classification is one of the common and well re- searched tasks of machine learning (ML) in the Earth Ob- LULC (Land Use Land Cover) data for Slovenia is collected servation (EO) community [1]. The challenge is to classify by the Ministry of Agriculture, Forestry and Food and is land into different types based on remote sensing data such publicly available [10]. The data is provided in shapefile for-as satellite images, radar data, information on weather [12] mat, with each polygon representing a patch of land marked and altitude. The most commonly used data are satellite with one of the LULC classes. Originally there were 25 images, which may vary in acquisition period, resolution or classes, but we introduced a more general dataset by group- wavelength. A plethora of algorithms have explored the po- ing similar classes together. The frequencies of 8 newly tential of using a single-date image [3] and even time series grouped classes are shown in Figure 1. of images for the task [11, 13]. Extensive work with state-of-the-art accuracy was performed using methods of deep 2.3 Feature Engineering learning [14]. The latter report a high computational effort The EO data were collected for the whole year. 4 raw band in the learning and forecasting phase, which reduces their measurements (red, green, blue - RGB and near-infrared potential for continuous tasks requiring a timely response. - NIR) and 6 relevant vegetation- related derived indices There have also been efforts to reduce learning and predic- (normalized differential vegetation index - NDVI, normal- tion times using intelligent feature selection [6, 7]. To the ized differential water index - NDWI, enhanced vegetation best of our knowledge, no cases have been reported where index - EVI, soil-adjusted vegetation index - SAVI, structure stream models have been used in an EO scenario. The pri- intensive pigment index - SIPI and atmospherically resis- mary purpose of incremental learning would be to reduce the tant vegetation index - ARVI) were considered. The derived computational cost of classification, regression, or clustering indices are based on extensive domain knowledge and are techniques, which, when dealing with large data provided used for assessing vegetation properties. One example is the by Sentinel 2 and other sources, can be a significant cost to NDVI index, which is an indicator of for vegetation health organizations trying to extract knowledge from that data. and biomass. Its value changes during the growth period One of the advantages of incremental learning is that it is of the plants and differs significantly from other unplanted not necessary to load all the data into memory at once when creating a model. We only need to store the model and the 1https://www.sentinel-hub.com/ part of the data we are processing. This could be especially 2https://github.com/sentinel-hub/eo-learn useful in various EO scenarios, as the data from Copernicus 3https://www.eea.europa.eu/data-and-maps/data/ services is estimated to exceed 150PB. eu-dem#tab-original-data 57 Figure 1: Frequencies of grouped classes for LULC data from 2017 show that the new simplified clas- sification preserves the most common classes sepa- rated and merges the less common classes. Classes with the lowest frequencies were selected for over- sampling. areas. The NDVI is calculated as: N IR − red N DV I = NIR + red Figure 2: Example of some of the timeless fea- tures. ARVI_max_mean_len shows the length of max- Timeless features were extracted based on Valero et al. [11]. imum mean value in a sliding temporal neighbour- These features can describe the three most important crop hood of ARVI index. BLUE_max_mean_surf shows the stages: the beginning of greenness, the ripening period and surface of the flat interval area containing the peak the beginning of senescence [11, 13]. Annual time series using the blue raw band. EVI_mean_val shows mean have different shapes due to the phenological cycle of a crop value of EVI index and SAVI_neg_sur shows the max- and characterize the development of a crop. With timeless imum surface of the first negative derivative interval features, they can be represented in a condensed form. of SAVI index. For each pixel, 18 features per each of 10 time series were generated. From elevation data, the raw value and maxi- q mum tilt for a given pixel were calculated as 2 additional checks if the ratio is less than 1 − , where = log 1/δ 2n features. In total 182 features were constructed. From these and 1 − δ is desired confidence. If the ratio is small enough, features only a Pareto-optimal subset of 9 features was se- meaning that attribute A is really better than attribute B, lected [6]. then the algorithm divides the node by that attribute. 3. METHODOLOGY Bagging of HT (incremental ) Classification accuracy ( CA ) and F1 score were calcu- Given a standard training set D of size n, bagging generates lated for 11 different ML methods, 6 batch learning meth- m new training sets Di, each of size n0, by uniform sampling ods and 5 incremental learning methods. All incremental from D. Because the sampling is done with replacement, learning methods are available in the ml-rapids (MLR)4 li-some observations can be repeated in each Di. If n0 = n, brary which has been developed in order to support the use then for large n the set Di is expected to have the fraction of incremental learning techniques within eo-learn [4] library. (1 − 1/e)(≈ 63.2%) of the unique examples of D, the rest being duplicates. Then, m HT models are fitted using the Hoeffding Tree (incremental ) above m samples and combined by voting. To include a new Hoeffding tree (HT) is an incremental decision tree that can sample, a random subset of models are selected according learn from massive streams. It assumes that the distribution to Poisson distribution [9], and these models are updated of generating examples does not change over time. The Ho-with the sample in the same way as the HT model described effding tree begins as an initially empty leaf. Each time the above. new example arrives, the algorithm sorts it down the tree (it updates the internal nodes statistics ) until it reaches the Na¨ıve Bayes (incremental) leaf. When it reaches the leaf, it updates the leaf statistics of Na¨ıve Bayes (NB) is a classification technique based on Bayes’s all unused attributes. It then takes the best (A) and second- Theorem. It lets us calculate the probability of data belong- best (B) attributes based on standard deviation and calcu- ing to a given class, given prior knowledge. Bayes’ Theorem lates the ratio of their reductions. To find the best attribute is: to split a node the Hoeffding bound is used. First algorithm P (data|class) timesP (class) 4 P (class|data) = https://github.com/JozefStefanInstitute/ml-rapids P (data) 58 where P (class|data) is the probability of class given the pro- vided data. To add a new training instance, NB only needs to update relevant entries in its probability table. Logistic Regression (incremental ) Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable. A model with two predictors x1 and x2 and a binary variable Y , denoted by p = P (Y = 1), which gives us the odds of the values belonging to the class p. The relationship between these terms can be modeled with the following equation: 1 p = 1 + e−(β0+β1x1+β2x2) The parameters β0, β1, β2 can be determined by stochastic gradient descend using logistic loss function. Figure 4: F1 score vs. inference time of different Perceptron (incremental) models for predicting LULC classes. *Denotes in- Perceptron is very similar to Logistic regression. It models a cremental algorithms. binary variable with the same activation function. The only difference is in the cost function that is used for gradient descend. Batch learning methods We can observe that ml-rapid’s Na¨ıve Bayes, Hoeffding Tree, Batch learning methods learn from the whole training set Bagging of HT, Decision Trees, LGBM and Random Forest and do not have to rely on heuristics (e.g. Hoeffding bound) belong to the Pareto optimal set of algorithms according to or incremental approaches (like SGD) for building the model. the training time and F1 score. Regarding inference times The following batch methods have been tested: decision Logistic Regression, Decision Trees and Random Forest are trees, gradient boosting (LGBM), random forest, percep- the only Pareto optimal algorithms. The choice of algo- tron, multi-layer perceptron, and logistic regression [5]. rithm depends on the available processing power and time. For a system that has a lot of time and resources available, 4. RESULTS it would be best to use Random Forest as it has the high- est F1 score. In practice, this is not always feasible. For Results of the experiments are summarised in Figures 3, example, if the algorithm were used for an on-board system 4 and Table 1. Figures depict dependency of algorithm- on the satellite, we could not afford to save all the data and specific F1 score vs. its training and inference times. An would prefer to load only the model. With an incremental ideal algorithm would be located in the top left corner, algorithm, the data could be collected, processed and dis- achieving full F1 score with a training and inference time of carded while the acquired knowledge would be stored in the 0. Any algorithm that has no other algorithm in its top-left model. Another preference for HT would be in a wrapper quadrant (no algorithm is both more accurate and faster) feature selection algorithm [6]. This type of algorithms do belongs to a Pareto front, which means that this algorithm a lot of evaluations of the selected method. The main re- is optimal for a certain set of use-cases. sult is a subset of features that can later be used with other algorithms. The acquired set of features might be biased towards the method used, but the results would be obtained much faster. From the confusion matrix of the HT algorithm shown in Figure 5, we can see that shrubland is often wrongly classified as forest, bareland or grassland and vice versa. This is mainly due to the unclear distinction between these classes (e.g. shrubland can be anything between bareland and for- est) and poor ground truth data due to infrequent updates, low accuracy, and lack of detail (e.g. patch of land labeled as shrubland can also grassland and trees). The unclear dis- tinction between certain classes may also explain confusion between wetlands and shrubland or wetlands and grassland, as wetlands may be covered with grass or shrubs. The lack of detail also contributes to misclassification between grass- land and artificial surface, as not every small grassy area, such as park or lawn, is included in ground truth data. Fi- Figure 3: F1 score vs. training time of different nally, grass cultures, unused land overgrown by grass and models for predicting LULC classes. *Denotes in- rotation of crops are likely some of the reasons for confusion cremental algorithms. between cultivated land and grassland. 59 7. REFERENCES [1] D4.7 stream-learning validation report, May 2020. Perceptive Sentinel. [2] Drusch, M., Del Bello, U., Carlier, S., Colin, O., Fernandez, V., Gascon, F., Hoersch, B., Isola, C., Laberinti, P., Martimort, P., et al. Sentinel-2: Esa’s optical high-resolution mission for gmes operational services. Remote sensing of Environment 120 (2012), 25–36. [3] Gómez, C., White, J. C., and Wulder, M. A. Optical remotely sensed time series data for land cover classification: A review. ISPRS Journal of Photogrammetry and Remote Sensing 116 (2016). [4] H2020 PereptiveSentinel Project. Eo-learn library. https://github.com/sentinel-hub/eo-learn. Accessed: 2019-09-06. [5] Hastie, T., Tibshirani, R., and Friedman, J. The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media, 2009. [6] Koprivec, F., Kenda, K., and Šircelj, B. Fastener feature selection for inference from earth observation Figure 5: Confusion matrix of HT based model for data. Entropy (Sep 2020). predicting LULC classes. [7] Koprivec, F., Peternelj, J., and Kenda, K. Feature Selection in Land-Cover Classification using Training Inference EO-learn. In Proc. 22th International Multiconference CA F1 time time (Ljubljana, Slovenia, 2019), vol. C, Institut ”Jožef LGBM 4.87 0.38 0.86 0.86 Stefan”, Ljubljana, pp. 37–40. Decision Tree 4.18 0.02 0.82 0.82 [8] Koprivec, F., Čerin, M., and Kenda, K. Crop Random Forest 7.53 0.14 0.87 0.87 Classification using Perceptive Sentinel. In Proc. 21th MLP 264.67 0.07 0.81 0.81 International Multiconference (Ljubljana, Slovenia, Logistic Regression 63.50 0.01 0.67 0.65 2018), vol. C, Institut ”Jožef Stefan”, Ljubljana, Perceptron 24.05 0.01 0.45 0.38 pp. 37–40. Hoeffding Tree* 0.44 0.06 0.79 0.79 [9] Oza, N. C. Online bagging and boosting. In 2005 Bagging of HT* 3.07 0.46 0.83 0.83 IEEE international conference on systems, man and Na¨ıve Bayes* 0.18 0.15 0.64 0.62 cybernetics (2005), vol. 3, Ieee, pp. 2340–2345. Logistic Regression* 0.31 0.08 0.15 0.07 [10] Slovenian ministry of agriculture. Mkgp - Perceptron* 0.33 0.07 0.14 0.04 portal. http://rkg.gov.si/. Accessed: 2020-08-11. [11] Valero, S., Morin, D., Inglada, J., Sepulcre, G., Table 1: Comparison of models for predicting LULC Arias, M., Hagolle, O., Dedieu, G., Bontemps, classes. *Denotes incremental algorithms. S., Defourny, P., and Koetz, B. Production of a dynamic cropland mask by processing remote sensing image series at high temporal and spatial resolutions. 5. CONCLUSIONS Remote Sensing 8(1) (2016), 55. In our approach we have concentrated on effective process- [12] Čerin, M., Koprivec, F., and Kenda, K. Early ing. Our goal was to provide methods and workflows which land cover classification with Sentinel 2 satellite can reduce the need for extensive hardware and processing images and temperature data. In Proc. 22th power. Our goal was focused on use cases where a near state- International Multiconference (Ljubljana, Slovenia, of-the-art accuracy can be achieved with only a fraction of 2019), vol. C, Institut ”Jožef Stefan”, Ljubljana, the processing power required by the state-of-the-art. We pp. 45–48. have researched stream mining algorithms. We have shown [13] that these algorithms, even if they are not the most accurate Waldner, F., Canto, G. S., and Defourny, P. Automated annual cropland mapping using or the fastest, take their place at the Pareto front in a multi- knowledge-based temporal features. ISPRS Journal of target environment, which means that some users might find Photogrammetry and Remote Sensing 110 (2015). them suitable for their needs and that they provide the best [14] results for particular computational demand. Zhu, X. X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., and Fraundorfer, F. Deep learning in 6. ACKNOWLEDGMENTS remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing This work was supported by the Slovenian Research Agency Magazine 5, 4 (2017), 8–36. and the ICT program of the EC under project PerceptiveSen- tinel (H2020-EO-776115) and project EnviroLENS (H2020- DT-SPACE-821918). 60 Predicting bitcoin trend change using tweets Jakob Jelencic Artificial Intelligence Laboratory Jozef Stefan Institute and Jozef International Postgraduate School Ljubljana, Slovenia jakob.jelencic@ijs.si ABSTRACT by people’s trust in it. Which means that possible up or Predicting future is hard and challenging task. Predict- down trends could be predicted by understanding sentiment ing financial derivative that one can benefit from is even of people tweets related to Bitcoin and other cryptocurrencies. more challenging. The idea of this work is to use informa- Tweets data-set is combined with classical Open-High-Low- tion contained in tweets data-set combined with standard Close [OHLC] data-set for 5 minute time periods. OHLC Open-High-Low-Close [OHLC] data-set for trend prediction data-set contain information about opening and closing price of crypto-currency Bitcoin [XBT] in time period from 2019- of given time period, its maximum and minimum price during 10-01 to 2020-05-01. A lot of emphasis is put on text prepro- observed time period and sum of volume and number of cessing, which is then followed by deep learning models and transactions made [4]. This present additional information concluded with analysis of underlying embedding. Results how the market is behaving at any given point. were not as promising as one might hope for, but they present a good starting point for future work. In financial mathematics derivatives are usually modeled with some kind of stochastic process. Most commonly some 1. INTRODUCTION form of Brownian motion is used. In theory increment in Twitter is an American microblogging and social network- Brownian motion is distributed as N (µ, Σ) independent from ing service on which users post and interact with messages previous increment. This implies that prediction of a real known as ”tweets”. Registered users can post, like, and time price change of a derivative is not possible, so the target retweet tweets, but unregistered users can only read them. goal should be changed accordingly. Instead of predicting the Users access Twitter through its website interface, through impossible, the goal of this work is to predict a change in a Short Message Service (SMS) or its mobile-device application trend. Trend is calculated with exponential moving average, software. Tweets were originally restricted to 140 characters, application of it can be observed in Figure 1. but was doubled to 280 for non-CJK languages in Novem- ber 2017. People might post a message for a wide range of Definition: Exponential moving average: reasons, such as to state someone’s mood in a moment, to n−1 advertise one’s business, to comment on current events, or X EMA(TS , n) = α · ( (1 − α)iTSn−i ), to report an accident or disaster [5]. i=0 Bitcoin is a cryptocurrency. It is a decentralized digital 2 currency without a central bank or single administrator that α = . n + 1 can be sent from user to user on the peer-to-peer bitcoin network without the need for intermediaries. Bitcoin is known for its unpredictable price movements, sometimes even to 10% on the daily basis. Bitcoin also serve as an underlying asset for various financial derivatives, which means that one can profit from knowing the future price changes. Tweets data offer a constant stream of new information about people beliefs about Bitcoin. Since Bitcoin is very volatile asset, without any real-world value, its value is mainly driven Figure 1: Example of exponential moving average 61 Figure 2: Example of working dataset. 2. DATA DESCRIPTION • Escape characters were removed. Collected tweets range from 01-10-2019 to 01-05-2020. We • Tweet was split by ” ”. have filtered tweets by crypto-related hashtags. Originally tweets contained multilingual data, but only English one • All non alphanumeric characters were removed, includ- were extracted. Data-set still resulted in more than 5 000 ing ”#”. 000 tweets over a little more than a half year period. Dealing with such big data-set has proven to be too difficult of a • All characters were converted to lower case. task. But since a lot of tweets are just pure noise, this data- • Usual stop-words were removed. set can be reduced. Idea is to extract the tweets with the largest target audience. Since the data-set contain number of tweet’s author friends and followers, we have extracted At this point data-set contain over 200000 different tokens, the tweets with maximum sum of both in a 5 minute period. which is way to sparse for so limited data-set. At this point Unfortunately, crypto world is relatively anonymous, so there empirical cumulative distribution function was calculated and is no Warren Buffet alike personalty, to whom we could gave all tokens that have less than 50 appearances were removed. extra weight. The dictionary size is now 2150. Then we concatenated the reduced tweets with 5-minute Another thing to consider is how to process numbers that OHLC data-set. Snapshot can be observed in Figure 2. appear in between text. Obviously a separate token for Column names should be pretty self-explanatory, expect for each number is not acceptable, since it would negate all the ”tw1”,”tw2”,”tw3”, which stands for metadata information work it was done so far. The following function was applied about tweets and ”ama”, which stand for current movement to process numbers. 5 more tokens were created and then of trend. Continuous features are then normalized, ”ama” is numbers from a certain interval were assigned corresponding shifted one step into the future so it forms the target variable. token. Regression task has the most success with predictions. • Small number: X < 1000. 3. TWEETS PROCESSING • Medium number: X ∈ [1000, 10000). Aim of this chapter is to focus on processing tweets. Tweets differ from regular text data, since many of them consist • Semi big number: X ∈ [10000, 100000). hyperlink, hashtags, abbreviations, grammar mistakes and so • Big number: X ∈ [100000, 1000000). on. This excludes any pre-build preprocessing tools, like the one available in deep learning library Tensorflow [1] which • Huge number: X ≥ 1000000. is used for building deep learning models. In the Figure 2 we can see an example of some tweets. The cleaning process was executed in the same order as it is stated below. For Additional masking token were assigned for missing data. each tweet the following process was executed: This wrap up dictionary, final length of dictionary is 2156. 62 Last thing in processing tweets is to handle their length. Not • Stacked LSTM layer with 128 neurons. all tweets have the same length. One idea is to take the • maximum length of all tweets, then mask the others so they Stacked LSTM layer with 128 neurons. all have the same length. Unfortunately this would take a lot • Second input layer with 64 neurons (OHLC). of unnecessary space, which is a problem. Also long tweets does not mean informative tweet. In Figure 3 is plotted the • Concatenation. empirical cumulative distribution function of tweets’ length. • Stacked dense layer with 64 neurons. • Output dense layer with 1 neuron. Loss process of benchmark model can be observed in Figure 4, while loss process of tweets model can be observed in Figure 5. Orange color represent training set, while blue validation set. It is clear that the tweets model behaved a lot worse on training set than benchmark model, but on test set it has slightly lower MSE (benchmark: 13.78, tweets: 13.74). This implies that there is a lot of reserve in fitting of the tweets model, since the difference between the train and validation loss is so big. That is good since otherwise it seems that tweets do not contribute much for prediction. It is also worth noting that tweets model took way longer to learn, around 380 epochs compared to benchmark’s model Figure 3: Histogram of tweets’ length. 40. No additional manipulation of tokens were done. It is known that tokens ”bitcoin” and ”btc” means the same, and they could be join into one token, but they are left intact and the deep learning model will decide either they are the same or not. 4. DEEP LEARNING MODELS Obvious choice for text models are recurrent neural networks, more specifically Long-Short-term-Memory [LSTM] recurrent networks [2]. They are usually combined with embedding layers, which transform singular token to vector of arbitrary size [6]. Since the task at hand is predicting the future, there is no good benchmark metric or model which could serve as a threshold for our model performance. So in order to see if the tweets can contribute anything, we have decided to build a shallow neural network of just OHLC data which would serve as a benchmark model. 80% of the data-set was taken as a training set, remaining was left out for validation. Figure 4: Loss process of benchmark model. Split was the same in both models. Both time we used Adam optimizer [3] and mean-squared error [MSE] as a loss function. Training was stopped as soon as validation loss did not improve for 10 epochs. Batch size was 256. Structure of a benchmark model: • Input dense layer with 32 neurons. • Stacked dense layer with 32 neurons. • Stacked dense layer with 32 neurons. Figure 5: Loss process of tweets model. • Output dense layer with 1 neuron. 5. ANALYSIS OF UNDERLYING EMBED- Structure of a tweets model: DING MATRIX We have extracted underlying embedding matrix from tweets • Input embedding layer of size 64 (tweets). model. Since the model tried to minimize mean-squared error 63 Figure 6: TSNE projection of embedding matrix. [MSE] of predicted trend and actual trend, the embedding 6. CONCLUSION matrix accordingly to MSE derivative. For analysis we will While the obtained model cannot be served as production use cosine similarity as a metric. If 2 words are close in model for automatic trading, it presents a nice future work the embedding matrix, this does not mean that they are opportunity. We will continue to collect tweets, and hopefully semantically similar in concept of everyday language, but with time build a more accurate data-set and with some it means that they are similar in concept of Bitcoin trend hyper-tuning of tweets models achieve improved prediction. prediction. For example if model converged perfectly, and tokens ”bitcoin” and ”eth” have cosine similarity near 1, that 7. ACKNOWLEDGMENTS would mean that they both have similar impact on Bitcoin This work was financially supported by the Slovenian Re- trend. Which is not so hard to believe since it is known that search Agency. all crypto-currencies are heavily correlated with one another. On Table 1 it can be seen cosine similarity of some of the 8. REFERENCES most common tokens in the dictionary. [1] TensorFlow. https://www.tensorflow.org/. [2] I. Goodfellow, Y. Bengio, and A. Courville. Deep Table 1: Cosine similarity pairs of most common Learning. MIT Press, 2016. tokens. http://www.deeplearningbook.org. Tokens Pair Similarity [3] D. Kingma and J. Ba. Adam: A Method for Stochastic bitcoin, crypto 0.472 Optimization. 2014. blockchain, entrepreneur 0.561 https://arxiv.org/abs/1412.6980. crypto, cryptocurrency 0.519 [4] J. J. Murphy. Technical Analysis of the Financial cryptocurrency, blockchain 0.560 Markets: A Comprehensive Guide to Trading Methods volume, social media 0.508 and Applications. New York Institute of Finance Series. ethereum, blockchain 0.557 New York Institute of Finance, 1999. [5] R. Nugroho, C. Paris, S. Nepal, J. Yang, and W. Zhao. We cannot be completely satisfied with results, but for such A survey of recent methods on deriving topics from limited data-set they are not that bad. As it is with any twitter: algorithm to evaluation. Knowledge and embedding evaluation, it comes to certain amount of subjec- Information Systems, pages 1–35, 2020. tivity what is good and what is not. [6] S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Series in Artificial Intelligence. In order to gain the better perspective of obtained embedding Prentice Hall, Upper Saddle River, NJ, third edition, we did a T-distributed stochastic neighbor embedding projec- 2010. tion to 2 dimension and plotted 100 nearest pairs. Projection can be observed in Figure 6. 64 Large-Scale Cargo Distribution Luka Stopar, PhD Luka Bradesko, PhD Tobias Jacobs, PhD Researcher Researcher Senior Researcher Jozef Stefan Institute Jozef Stefan Institute NEC Laboratories Europe GmbH Jamova cesta 39 Jamova cesta 39 Kurfürsten-Anlage 36 1000 Ljubljana, Slovenija 1000 Ljubljana, Slovenija 69115 Heidelberg luka.stopar@ijs.si luka.bradesko@ijs.si tobias.jacobs@neclab.eu Azur Kurbašić Miha Cimperman, PhD Researcher Researcher Jozef Stefan Institute Jozef Stefan Institute Jamova cesta 39 Jamova cesta 39 1000 Ljubljana, Slovenija 1000 Ljubljana, Slovenija azurkurbasic@gmail.com miha.cimperman@ijs.si ABSTRACT generalization of TSP where multiple vehicles are available. This This study focuses on the design and development of methods for class of routing problems is notoriously hard; it not only falls into generating cargo distribution plans for large-scale logistics the class of NP-complete problems, but also in practice it cannot be networks. It uses data from three large logistics operators while solved optimally even for moderate instance sizes. focusing on cross border logistics operations using one large graph. Nevertheless, due to its practical importance, many heuristics and The approach uses a three-step methodology to first represent the approximation algorithms for the vehicle routing problem have logistic infrastructure as a graph, then partition the graph into been proposed. Bertsimas et al. propose to an integer programming smaller size regions, and finally generate cargo distribution plans based formulation of the Taxi routing problem and present a for each individual region. The initial graph representation has been heuristic based on a max-flow formulation, applied in a framework extracted from regional graphs by spectral clustering and is then which allows to serve 25,000 customers per hour. A heuristic based further used for computing the distribution plan. on neighborhood search has been presented by Kytöjoki et al. in [4] and evaluated on instances with up to 20,000 customers. A large The approach introduces methods for each of the modelling steps. number of natural-inspired optimization methods have been The proposed approach on using regionalization of large logistics applied to VRP, including genetic algorithms [7], particle swarm infrastructure for generating partial plans, enables scaling to optimization [8], and honey bees mating optimization [9]. thousands of drop-off locations. Results also show that the The particular approach of partitioning the input graph for VRP has proposed approach scales better than the state-of-the-art, while been proposed by Ruhan et al. [5]. Here k-means clustering is preserving the quality of the solution. combined with a re-balancing algorithm to obtain areas with Our methodology is suited to address the main challenge in balanced number of customers. Bent et al. study the benefits and transforming rigid large logistics infrastructure into dynamic, just- limitations of vehicle and customer based decomposition schemes in-time, and point-to-point delivery-oriented logistics operations. [6], demonstrating better performance with the latter. Keywords In this paper, we present a methodology for large-scale parcel Logistics, graph construction, vehicle routing problem, spectral distribution, by utilizing optimization methods with large graph clustering, optimization heuristics, discrete optimization. clustering. The paper is structured as follows. In Section 2, we present the technical details of the proposed methodology. We explain the algorithms and data structures used in each of the steps 1. INTRODUCTION and discuss the interfaces required to link the steps into a working The complexity of operations in the logistics sector is growing, so system. In Section 3, we demonstrate the performance of our is the level of digitalization of the industry. With data driven methodology on two real-world use cases and compare it to the logistics, dynamic optimization of basic logistics processes is at the state-of-the-art on synthetic datasets. Finally, in Section 4 we forefront of the next generation of logistics services. include key findings, summarizing the strengths and limitations of the proposed approach. Finding optimal routes for vehicles is a problem which has been studied for many decades from a theoretical and practical point of view: see [2] for a survey. The most prominent case is the Traveling Salesperson Problem (TSP), where the shortest route for visiting n locations using a single vehicle has to be determined. What is typically associated with the Vehicle Routing Problem (VRP) is a 65 2. METHODOLOGY the rate of going from 𝑖 to 𝑗 is represented in terms of the number of possible trips that the driver can make between the two locations 2.1 Overview in one hour. In this section, we present the details of the proposed methodology The algorithm works by approximating the minimal 𝑘-cut of the for large-scale cargo distribution planning. The methodology, graph, removing its edges and thus reducing the graph to 𝑘 illustrated in Figure 1, uses a three-step, divide and conquer disconnected components. We adapt a spectral partitioning approach to cargo distribution, where we reduce the size of the algorithm introduced in [10] to graphs. optimization problem by (i) abstracting the physical infrastructure into a sparse graph representation, (ii) partitioning the graph into The algorithm first symmetrizes the transition rate matrix as 𝑄𝑠 = smaller chunks (i.e. regions) and (iii) planning the distribution in 1 (𝑄 + 𝑄𝑇), to ensure real-valued eigenvalues, and computes its each region independently. This allows us to run the optimization 2 Laplacian: on large graphs while producing better local results. −1 𝐿 = 𝐼 − 𝑑𝑖𝑎𝑔(𝑄𝑠1⃗ ) 𝑄𝑠 Next, it computes the 𝑘 eigenvectors of 𝐿, corresponding to the smallest 𝑘 eigenvalues. It then discards the eigenvector corresponding to 𝜆1 = 0 and assembles eigenvectors 𝑣2, 𝑣3, … , 𝑣𝑘 corresponding to eigenvalues 𝜆2 ≤ 𝜆3 ≤ ⋯ ≤ 𝜆𝑘 as columns of matrix 𝑉. The rows of 𝑉 are then normalized and used as input to the k-means clustering algorithm which constructs the final partitions. 2.4 Vehicle Routing The vehicle routing step uses Tabu search [12] to construct the distribution plan. Starting with an initial solution, Tabu search constructs a linear search path by iteratively improving the solution in a greedy fashion until a stopping criterion is met. To avoid converging to local minima, Tabu search blacklists recent moves and/or solutions for one or more iterations using design-time rules. Figure 1: Three step methodology for logistics optimization. In each iteration, the search process generates new possible solutions by removing a node from its current route and placing it Initially, we create a representation of the physical infrastructure as after one of the other nodes in the graph, possibly on a different an abstract graph, representing each pickup and drop-off location route. To mitigate scaling problems associated with generating as a node with edges as shortest connections on road in between. 𝑂(𝑛2) possible moves in each step, the algorithm only considers a Next, we partition the abstract graph with a spectral partitioning handful of moves. Specifically, the probability of considering approach. The method is an adaptation of [10] to graphs, where we placing node 𝑖 after node 𝑗 is proportional to the inverse of the use the first k eigenvalues and eigenvectors of the graphs’ Euclidean distance 𝑑(𝑖, 𝑗) between the nodes. Laplacian to construct the partitions. In each partition, we construct Like other local search algorithms, Tabu search starts from an a distribution plan using an iterative search algorithm. From an initial feasible solution which is constructed using a construction-initial solution, the algorithm constructs a linear search path by based heuristic algorithm. The heuristic procedure iteratively changing the position of a node in the distribution plan. To avoid selects a node and places it after one of the other nodes in a way local minima, it uses design-time blacklist rules which prevent the that minimizes the travel distance. The procedure iterates until all algorithm from oscillating in a local neighborhood. Each step is values are initialized. described in more details in the following sections. 2.2 Graph Construction For graph construction, the Dijkstra SPF algorithm [11] was 3. DEMONSTRATION AND RESULTS applied to identify neighbor relationships between the nodes in the In this section, we demonstrate the effectiveness of the proposed OpenStreetMaps (OSM) dataset and construct the graph methodology on two real-world use cases and compare the representation. By mapping post offices to the closest node on methodology to the state-of-the-art in vehicle routing. The first OSM, we tag the post office nodes for SPF search. pilot included two national logistics operators, namely Hrvatska Posta (Croatia) and Posta Slovenije (Slovenia). As the main focus The search frontier is a baseline for the SPF procedure and of future logistics in Europe is to operate as one large homogenous represents the list of nodes whose graph neighbors are to be logistics infrastructure, the two infrastructures were considered as searched. The final graph is built by iterating with the SPF one logistics graph. The second pilot included Hellenic Post procedure through the list of all post offices in physical (Greece) graph representation and data. infrastructure (graph nodes), and consolidating results into final the sparse matrix – each iteration computes one row of the matrix. In initial testing, simulated data were used for modelling parcel flow with graph abstraction, graph processing, and optimization 2.3 Graph Partitioning responses. The final instances were constructed from real The partitioning step first represents the graph as a transition rate infrastructure data to test the functionalities. The results are matrix (𝑄)𝑖𝑗 = 𝑞𝑖𝑗, where 𝑞𝑖𝑗 represents the rate of going from presented in the following subsections. node 𝑖 to node 𝑗 and is computed as the inverse minimal travel time (obtained from step 1) between the two nodes. With this approach, 66 3.1 Evaluation on Large Synthetic Graphs For the experiments we used a Tabu list with a length of 5% of the We now demonstrate the scalability of the proposed methodology entities (locations) that the algorithm must check, and terminated by comparing its performance to the performance of the baseline the algorithm when there was no improvement in the solution for Tabu search algorithm on synthetic graphs of various sizes, more than 10 seconds. comparing both algorithms’ running time and the total travel time On large graphs, we see that the proposed methodology in the generated cargo distribution plan. Our results show that the significantly reduces the computation time while preserving the proposed methodology enables fast generation of distribution plans quality of the result. The proposed methodology reduces the on graphs of up to 10,000 nodes, while also improving the quality computation time on graphs larger than 5k nodes, providing a of the generated result. substantial saving of 91% on graphs with 10k nodes. We also We simulate the logistics infrastructure by generating random observe that the quality of the output slightly improved when planar graphs representing the road network and drop-off locations. applying our divide-and-conquer methodology over Tabu search. First, we generate a cluster of 𝑛 drop-off locations by sampling a The improvement ranges between 23% and 40% and is largely Gaussian distribution around 𝑘 randomly chosen locations. Next, attributed to the significantly reduced search space in the partitions we connect the locations with Delaunay triangulation [13], as compared to the entire graph. resulting in a planar graph. We compute the distance between two 3.2 Testing the instances on pilot use cases locations using the Euclidean metric and assign a 50 𝑘𝑚/ℎ speed The methods presented and tested on synthetic graphs were also limit to intra-city edges and a 90 𝑘𝑚/ℎ speed limit to inter-city tested on data from two pilot scenarios, namely Slovenian-Croatian edges. Part of a synthetic graph with 10,000 nodes is shown in post (Pošta Slovenije & Hrvatska Pošta) and Hellenic Post Figure 2 below. (Greece). In the pilot use cases, the analytical pipeline is used to process ad-hoc events in the logistics infrastructure. The ad-hoc events included were structured into three categories: new parcel request (ad-hoc order), event on distribution objects (vehicle break down) and events related to changes in border crossings – border closed (cross border event). The instances built on simulated data were loaded with OpenStreetMaps data for abstraction of real infrastructure description into graph representation, as illustrated in Figure 4. Figure 2: Representation of simulated graph with 10,000 nodes. Table 1 summarizes the computation times of the proposed method along with the quality of the generated distribution plan and compares the results to Tabu search without prior clustering. We measure the quality of the generated distribution plan as the distance travelled by all vehicles according to the plan. In each row, we show the average of 10 trials on 10 different graphs. Table 1: Comparison of efficiency of Tabu search and proposed methodology. Figure 4: A region of Posta Slovenia graph representation, using OpenStreetMap. Graph Proposed Methodology Tabu search Size A similar approach was used for the case of Hellenic Post, where the OSM data for the region of Greece were loaded into the graph Running Travel Running Travel abstraction instance. For traffic modelling of the vehicles, the Time Distance Time Distance SUMO simulator [14] was used with the regional map. For graph [km] [km] manipulations, the SIoT infrastructure was used to generate the 1000 6.07min 64.7k 0.76min 85.5k social graph when an ad-hoc event was triggered. The social graph represented all entities (vehicles, etc.) in the infrastructure that are 2000 10.07min 122.9k 2.98min 160.8k in the scope to be included in event processing. In this way, 5000 30.14min 259.2k 60.04mi 428.2k distribution objects were mapped to physical infrastructure for n loading the objects into the graph representation for further 7000 39.29min 377.9k 166.79m 577.1k optimization and distribution plan estimation in 10000 55.64min 552.2k 10.78h 845.1k 67 6. REFERENCES [1] European Commission. (2015). Fact-finding studies in support of the development of an EU strategy for freight transport logistics. Lot 1: Analysis of the EU logistics sector. [2] Kumar, Suresh Nanda, and Ramasamy Panneerselvam. "A survey on the vehicle routing problem and its variants." (2012). [3] Bertsimas, Dimitris, Patrick Jaillet, and Sébastien Martin. "Online vehicle routing: The edge of optimization in large- scale applications." Operations Research 67.1 (2019): 143- 162. Figure 4: Processing ad-hoc order on a pilot scenario, using [4] Kytöjoki, Jari, et al. "An efficient variable neighborhood SUMO simulator. search heuristic for very large scale vehicle routing problems." Computers & operations research 34.9 (2007): An example of the social graph generation and ad-hoc event 2743-2757. processing is presented in Figure 4, where a new ad-hoc request is processed by SIoT and analytical pipeline. [5] He, Ruhan, et al. "Balanced k-means algorithm for partitioning areas in large-scale vehicle routing problem." The results show that abstracting the logistics infrastructure and 2009 Third International Symposium on Intelligent clustering the graph into regional structures enabled real-time Information Technology Application. Vol. 3. IEEE, 2009. processing of complex events in the logistics infrastructure. The response time for processing an ad-hoc event in regions of between [6] Bent, Russell, and Pascal Van Hentenryck. "Spatial, 50 and 100 nodes was between 20 and 30 seconds. This is relatively temporal, and hybrid decompositions for large-scale vehicle fast compared to alternatively processing 1000 nodes or more routing with time windows." International Conference on Principles and Practice of Constraint Programming. Springer, Berlin, Heidelberg, 2010. 4. CONCLUSION [7] Razali, Noraini Mohd. "An efficient genetic algorithm for In this paper, we presented an approach for generating cargo large scale vehicle routing problem subject to precedence distribution plans on large logistic infrastructures. Our results show constraints." Procedia-Social and Behavioral Sciences 195 that the proposed approach can scale to graphs of up to 10,000 (2015): 1922-1931. nodes in practical time while preserving and even slightly [8] Marinakis, Yannis, Magdalene Marinaki, and Georgios improving the quality of the result. Dounias. "A hybrid particle swarm optimization algorithm Since the main use case of logistics is point-to-point regional for the vehicle routing problem." Engineering Applications delivery and just-in-time delivery, these new services are oriented of Artificial Intelligence 23.4 (2010): 463-472. exactly to regional logistics optimization. More importantly, the [9] Marinakis, Yannis, Magdalene Marinaki, and Georgios approach enables to process ad-hoc events, such as new parcel Dounias. "Honey bees mating optimization algorithm for the delivery requests, events related to distribution vehicles, or to vehicle routing problem." Nature inspired cooperative infrastructure. The ad-hoc event processing includes manipulating strategies for optimization (NICSO 2007). Springer, Berlin, the graph representation and running the optimization methods in Heidelberg, 2008. 139-148. real-time. Since our method clusters and regionalizes large graphs, such approach can enable real-time processing of events on large [10] Ng, Jordan, Weiss. “On Spectral Clustering: Analysis and an graphs, by limiting the changes to the affected regional parts of the algorithm”. Advances in Neural Information Processing infrastructure. Systems. MIT Press, 2001. 849-856. However, while our approach can be combined with several state- [11] Dijkstra, E. W. A note on two problems in connexion with of-the-art methods, its main drawback remains the inability to graphs. Numerische Mathematik, 1(1), 269–271, 1959 generate inter-region routes, making it suitable only for local and [12] Handbook of Combinatorial Optimization, Fred Glover, last-mile distribution plans. Future work will focus on investigating Manuel Laguna, Vol. 3, 1998 the generation of inter-region plans and connecting multiple [13] Computational Geometry: Algorithms and Applications, regions into one distribution plan. Some of the options include Mark de Berg, Otfried Cheong, Marc van Kreveld, Mark introducing border checkpoints where cargo can be handed over to Overmars, Third Edition, 2008 vehicles of neighboring regions, using dedicated inter-region “highway” channels, and using dedicated vehicles for cross-region [14] http://sumo.sourceforge.net deliveries. 5. ACKNOWLEDGEMENTS This paper is supported by European Union’s Horizon 2020 research and innovation programme under grant agreement No 769141, project COG-LO (COGnitive Logistics Operations through secure, dynamic and ad-hoc collaborative networks). 68 Amazon forest fire detection with an active learning approach Matej Čerin Klemen Kenda Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan International Jožef Stefan International Postgraduate School Postgraduate School Jamova 39, 1000 Ljubljana, Jamova 39, 1000 Ljubljana, Slovenia Slovenia matej.cerin@ijs.si klemen.kenda@ijs.si ABSTRACT ing satellite images [6, 11], they inspect changes on satellite Wildfires are a growing problem in the world. With climate images to detect fires. Our solution to that problem is to use change, the fires have a larger range an are harder to put machine learning. Because we do not have prepared labeled down. Therefore it is important to find a way to detect and data-set active learning like approach is our next candidate. monitor fires in real-time. In this paper, we explain how we can use satellite images and combine it with knowledge of Active learning is the approach used when the labeled data active learning to get accurate classifier for forest fires. To are unavailable, and labeling data is too expensive or time- build the classifier we used active learning like approach. We consuming. The algorithm starts with a small labeled data train the classifier with one labeled image. Then used a clas- set and then use its predictions to train itself again. That sifier to classify the set of images. We manually inspected way the algorithm can learn itself. Algorithms usually need the images and relabeled wrongly classified examples and additional input for some data points. In these cases, a hu- build a new classifier. In the paper, we show that in a few man should label those data, and the algorithm can then iteration steps we can get a classifier that can with good correct its predictions. The active learning approach is used accuracy identify wildfires. in many use cases (speech recognition, information extrac- tion, classification, ...). Over the years, it proved to work Keywords relatively well [8]. remote sensing, earth observation, active learning, rain for- est, wildfires, machine learning, feature selection, classifica- In this paper we use active learning like approach to clas- tion sify wildfires. By the principle of active learning approach, we label a small subset of data and then train the classi- 1. INTRODUCTION fier. Then we manually check the classification results and correct the wrongly classified examples. We then use a new In last years wildfires are a growing problem for the world. bigger data-set to train the new classifier. We continue with Each year the number of forest fires around the world grow. iterations until we are satisfied with the results. That way In recent years we had growing number of fires in Ama- we can iteratively get a good classifier without labeling huge zon, Australia, Africa and Siberia. Because of high global amounts of data. warming and high temperatures, the wildfires have a bigger range and are also harder to put out. Forest fires are par- tially responsible for the air pollution [12], loss of habitat 2. DATA for animals. Amazon rain forest is also called the lungs of 2.1 Data Acquisition the world, because of oxygen production by the trees. The In the article, we use data from ESA Sentinel-2 mission [3]. loos of forest also connects to a higher chance of floods and The sentinel-2 mission produces satellite images in 13 differ- landslides [6]. Therefore the classification and monitoring ent spectral bands with wave lengths of ligt observed from of wildfires is an important task. It is important to know approximately 440 nm to 2200 nm. The spatial resolution is the time series of the spread of the fire. With that knowl- between 10 and 60 m. It consists of two satellites that circle edge we can create models for future fire events, and to plan the earth with 180◦ phase. One point on the earth’s surface measures in case of wildfire. is visited at least once every five days. In future we could use also use some other satellite data sources like available at The satellite images are a good source for observation of www.planet.com [1]. Those data have revisit time of 1 day land type [5]. Therefore they could be used for monitoring and might be even better candidate for accurate monitoring forest fires. They can be detected on satellite images, but of wildfires. the area of Amazon is big and it would take a lot of time to manually label burned areas by forest fires. Therefore we To download data we use eo-learn library [9] that have inte-should develop an algorithm that can detect fires. grated sentinel-hub[10] library used to access satellite data. Data were downloaded for the year 2019, with a spatial res- There are already existing algorithms for fire detection us- olution of 30 m. The 30 m resolution was chosen because 69 burned areas usually extends through much bigger area than 30 m and a therefore higher resolution would not help us identify forest fires. But the processing of each image would take significantly more time than it did now. 2.2 Data Preprocessing ESA already makes most of the preprocessing steps, like atmospheric reflectance or projection [4]. Therefore data is already clean and ready for use. For our experimentation purposes, we filtered out clouds for that purpose we used models available in eo-learn library. In our experiments, we used all spectral bands, but the earth observation community developed many different in- dices that can be calculated from raw spectral bands and use them as a feature in our machine learning experiments. In- dices that we used are NDVI, SAVI, EVI, NDWI, and NBR, defined in papers [7, 2]. As our feature vector we used all 13 raw bands and mentioned indices. 3. METHODOLOGY In our experiments, we iteratively improved the classifier. In each iterative step, we looked at the images and deter- mine if the classification was good or not. To do that most successfully we plotted the images in true color, where the burned area is usually dark, and if the fire is active the smoke Figure 1: The Figure shows the true color and false- is also visible. The other figure that we checked was image color images of the same area before, during and with RGB colors plotted Sentinel-2 bands 12, 11, and 3 (false after the fire. These kinds of images can be used to color). Here most of the image is usually in shades of green. manually determine burned areas. The burned area is dark gray color and the area currently burning is yellow or orange (Figure 2). With those two images, we have no problem checking if the area is burned or only images, where the classifier classified fire. That is be- not. cause we noticed that the classifier already, in the beginning, finds fire, but it picked up some other areas and objects as We experimented with two different approaches. In the first fire as well. Therefore we need to find those images and label approach, we evaluated the results of classification for each them as not fire. pixel and in the second experiment, we evaluated the aver- age result for a bigger area determined with the clustering 4. We used a false-positive set to add to data-set the pix- algorithm. els that the classifier classified wrongly and true positive examples to keep the data-set balanced. We chose in each The classifier used in our experiment was logistic regression. iteration the two values for the probability of prediction in We used it because it is quite an accurate classifier for earth logistic regression. The first value was used to determine in observation and it can assess how strong the prediction is. false-positive images to find pixels that were classified with a probability above that value to add those pixels in the data set. And the second value was used to find pixels that 3.1 Experiment 1 contained forest fire. We changed those values because the First, we manually searched the area of the Amazon forest to algorithm is unreliable in the first iterations and low value in find the first satellite image with a forest fire. Then we used the images with fire would pick up a lot of noise in the data that satellite image and labeled 270 pixels as fire area and set. But with each iteration the algorithm became more 270 pixels as not fire area. We trained the logistic regression reliable, therefore we could pick lower probability without classifier and used it as our initial classifier in our iteration. much noise. The values are shown in the Table 1. The iteration steps in our experiment were: 1. Use a classifier and classify pixels of a random images of 3.2 Experiment 2 the Amazon rain forest. The formation of the initial classifier and the first three steps in that experiment were the same as in the first experiment. 2. We took images that the classifier would classify with a forest fire. The images were classified as containing a burned Additional steps in the experiment are: area if at least 3 % of pixels on the image were classified as 4. For the evaluation of the classifier, we first made cluster- fire. ing with the K-Means algorithm to group similar pixels on each image. The idea of that step is to use a homogeneous 3. We checked those images and manually assigned them group of pixels that probably represent the same ground into two sets (true-positive and false-positive). We checked cower. Those steps are useful because we noticed that K- 70 Iteration FP TP F1 score Iteration 1 0.0 0.80 Classifier from Experiment Iteration 2 0.4 0.70 1 predicting on data-set 0.81 Iteration 3 0.4 0.70 from Experiment 2 Iteration 4 0.5 0.60 Classifier from Experiment Iteration 5 0.5 0.60 2 predicting on data-set 0.78 Iteration 6 0.5 0.50 from Experiment 1 Table 1: The table shows the values of the minimum average probability of a pixel being burned area for Table 3: The F1 scores of classifiers. false-positive images (FP) and true-positive images (TP). higher than they would be on real images. In both exper- iments we used random images from the area of amazon, therefore some images might be in both training and testing Means usually grouped fire areas in one or two clusters. We set. clustered the pixels in 6 clusters. That number was chosen because on most images that number split the area that way Figure 3 depicts a time-lapse of a wildfire progress. We can that clusters with fire were separated from not burned area. see that there are some small noise pixels that are classified At the same time it did not split same ground types on too wrongly, but they are relatively rare. many clusters. Figure 2: The figure shows how clustering groups different pixels. The burned area is all in one cluster. 5. Calculate the average probability of pixel representing forest fire for each cluster. 6. To choose what pixels to add in the data-set we once again determined two values. They defined above what average pixel probability should cluster have to add pixels from that cluster in the data set. The used values for each iteration are presented in Table 2. Iteration FP TP Iteration 1 - 0.75 Iteration 2 0.5 0.75 Figure 3: The sub-figures show the development of Iteration 3 0.5 0.60 forest fire. On the left, we have true color satellite Iteration 4 0.5 0.60 images and on the right, we have the classification Iteration 5 0.5 0.60 result with our algorithm. yellow color depicts the Iteration 6 0.5 0.5 burned area. Table 2: The table shows the values of minimum Another interesting thing to observe in our experiments is average probability in the cluster for false-positive what the classifier learned and how it improved in each it- images (FP) and true-positive images (TP). eration. We noticed that in the first iterations of our exper- iments, the classifier did already find fire, but it also picked up many other areas as fire. One of the first improvements of 4. RESULTS the classifier was that it did not classify water areas (rivers We tested the classifiers from each experiment on data set and lakes) as fire. The other later improvements classifier form the other experiment. To evaluate results we calculated were also some rocky areas. It also improved significantly in F1 scores. The results are shown in Table 3. the agricultural areas, but in some cases, we could not train classifiers that there is no fire. The F1 scores are relatively high, but those data sets were constructed in a similar way, therefore the scores might be The classifier learned wrongly and we could not remove com- 71 pletely some agricultural areas and some roads in the cities. [2] Bannari Abdou et al. “A review of vegetation indices”. Most of the agricultural areas were classified correctly, but In: Remote Sensing Reviews 13 (Jan. 1996), pp. 95– there were present some fields that no matter what we did 120. doi: 10.1080/02757259509532298. were not classified correctly. This might be due to the fact [3] ESA. https://www.esa.int/Our_Activities/Observing_ that the field might be on the place that was previously the _ Earth / Copernicus / Sentinel - 2 / Satellite _ burned and the algorithm still pick that up even though it constellation. Accessed 13 August 2018. was not visible from the imagery to us. [4] ESA. https : / / sentinel . esa . int / web / sentinel / 5. CONCLUSIONS user-guides/sentinel-2-msi/processing-levels/ level-2. Accessed 13 August 2018. The approach with active learning seems promising and we can get relatively good classifiers in a short time. That way [5] Filip Koprivec, Matej Čerin, and Klemen Kenda. “Crop we could train a classifier for any classification task of satel- classification using PerceptiveSentinel”. In: (Oct. 2018). lite images. With that approach we do not need to check all [6] Rosa Lasaponara, Biagio Tucci, and Luciana Gher- images as we would if we would like to label all the data by mandi. “On the Use of Satellite Sentinel 2 Data for hand. In the end, we get a relatively good classifier. Automatic Mapping of Burnt Areas and Burn Sever- ity”. In: Sustainability 10 (Oct. 2018), p. 3889. doi: In this paper, we showed that it is possible in a relatively 10.3390/su10113889. small number of iterations to get a good and reliable clas- [7] David Roy, Luigi Boschetti, and S.N. Trigg. “Remote sifier of forest fires. Because satellite images are more ac- Sensing of Fire Severity: Assessing the Performance cessible in last years than previously it could give us almost of the Normalized Burn Ratio”. In: Geoscience and real-time insight in the Amazon rain forest. Remote Sensing Letters, IEEE 3 (Feb. 2006), pp. 112– 116. doi: 10.1109/LGRS.2005.858485. In the feature one could use other satellite sources with bet- ter time-resolution to monitor wildfires. That way we could [8] Burr Settles. “Active Learning Literature Survey”. In: get more accurate view on the spread of fires. (July 2010). [9] Sinergise. https://github.com/sentinel- hub/eo- 6. ACKNOWLEDGMENTS learn. Accessed 23 August 2019. This work was supported by the Slovenian Research Agency [10] Sinergise. https://github.com/sentinel-hub/sentinelhub-and the ICT program of the EC under projects enviroLENS py. Accessed 14 August 2018. (H2020-DT-SPACE-821918) and PerceptiveSentinel (H2020- [11] Mihai Tanase et al. “Burned Area Detection and Map- EO-776115). The authors would like to thank Sinergise for ping: Intercomparison of Sentinel-1 and Sentinel-2 Based their contribution to EO-learn library along with all help Algorithms over Tropical Africa”. In: Remote Sensing with data analysis. 12 (Jan. 2020), p. 334. doi: 10.3390/rs12020334. References [12] G. R. van der Werf et al. “Global fire emissions es- timates during 1997–2016”. In: Earth System Science [1] https : / / www . planet . com/. Accessed 1 September Data 9.2 (2017), pp. 697–720. 2020 . doi: 10 . 5194 / essd - 9- 697- 2017. url: https://essd.copernicus.org/ articles/9/697/2017/. 72 Indeks avtorjev / Author index Andrej Bauer ................................................................................................................................................................................ 53 Bradeško Luka ............................................................................................................................................................................. 65 Brank Janez .................................................................................................................................................................................. 53 Čerin Matej ................................................................................................................................................................................... 69 Cimperman Miha.......................................................................................................................................................................... 65 Eftimov Tome .............................................................................................................................................................................. 21 Erjavec Tomaž ......................................................................................................................................................................... 5, 17 Evkoski Bojan .............................................................................................................................................................................. 41 Grobelnik Marko .................................................................................................................................................................... 37, 53 Jacobs Tobias ............................................................................................................................................................................... 65 Jelenčič Jakob ............................................................................................................................................................................... 61 Jovanovska Lidija ......................................................................................................................................................................... 45 Kenda Klemen ........................................................................................................................................................................ 57, 69 Koroušič Seljak Barbara ............................................................................................................................................................... 21 Kralj Novak Petra ......................................................................................................................................................................... 41 Kurbašić Azur .............................................................................................................................................................................. 65 Lavrač Nada ................................................................................................................................................................................. 13 Ljubešić Nikola ............................................................................................................................................................................ 41 Luka Stopar .................................................................................................................................................................................. 53 Massri M.Besher .................................................................................................................................................................... 25, 53 Mileva Boshkoska Mileva ............................................................................................................................................................ 49 Mladenić Dunja ............................................................................................................................................ 5, 9, 17, 21, 25, 33, 37 Mladenić Grobelnik Adrian.......................................................................................................................................................... 37 Mozetič Igor ................................................................................................................................................................................. 41 Novak Erik ................................................................................................................................................................................... 29 Panov Panče ........................................................................................................................................................................... 45, 49 Peternelj Jože ............................................................................................................................................................................... 57 Petrželková Nela .......................................................................................................................................................................... 13 Pita Costa Joao ............................................................................................................................................................................. 53 Popovski Gorjan ........................................................................................................................................................................... 21 Šircelj Beno .................................................................................................................................................................................. 57 Sittar Abdul .................................................................................................................................................................................... 5 Škrlj Blaž ...................................................................................................................................................................................... 13 Stopar Luka .................................................................................................................................................................................. 65 Swati ....................................................................................................................................................................................... 17, 33 Zajec Patrik .................................................................................................................................................................................... 9 Žunič Gregor ................................................................................................................................................................................ 29 Zupančič Peter .............................................................................................................................................................................. 49 73 74 IS Odkrivanje znanja in podatkovna skladišča • SiKDD Data Mining and Data Warehouses • SiKDD 20 Dunja Mladenić, Marko Grobelnik 20 Document Outline 02 - Naslovnica - notranja - C - TEMP 03 - Kolofon - C - TEMP 04, 05 - IS2020 - Predgovor & Odbori 07 - Kazalo - C 08 - Naslovnica - podkonferenca - C 09 - Predgovor podkonference - C 10 - Programski odbor podkonference - C 01 - A-Dataset-for-Information-Spreading-over-the-News Abstract 1 Introduction 2 Related Work 3 Data collection methodology 4 Semantic similarity between news articles 4.1 Dataset annotations 4.2 Evaluation of dataset 5 Conclusions 6 Acknowledgements 02 - Zajec_SiKDD Abstract 1 Introduction 2 Methodology 2.1 Problem Definition 2.2 Overview of the proposed method 2.3 Representing the entities 2.4 Selecting the topics 2.5 Using multiple languages 2.6 Assigning pseudo labels 3 Experiments 3.1 Dataset 3.2 Evaluation Settings 3.3 Results and discussion 4 Conclusion and future work Acknowledgments 03 - Knowledge_graph_aware_text_classification__SiKDD_-6 Abstract 1 Introduction 2 Background and related work 3 Knowledge graph-based semantic feature construction 3.1 Feature selection 3.2 Microsoft Concept Graph 3.3 Proposed approach extending tax2vec 4 Experiments and results 4.1 Data sets 4.2 Results 5 Conclusion Acknowledgments 04 - swati_eve_out Abstract 1 Introduction 1.1 Contributions 2 Dataset 2.1 Data Source 2.2 Data Generation Process 3 Availability 3.1 Reusability 4 Potential Use Cases 4.1 Examine Event-Selection Bias 4.2 Outlet Prediction 5 Statistics and Analysis 6 Related Work 7 Conclusions and Future Work Acknowledgments 05 - Ontology_alignment_using_Named_Entity_Recognition_methods_in_the_domain_of_food-1 Introduction Related work Hansard corpus FoodIE Wikifier Data Ontology alignment Evaluation and experimental setup Match types Evaluation metrics Results and discussion Conclusion and future work 06 - Extracting-structured-metadata-from-multilingual-textual-descriptions-in-the-domain-of-silk-heritage Abstract 1 Introduction 2 Description of Data 3 Methodology 3.1 Annotating datasets with slot values 3.2 Binary Classification Tasks 3.3 Multi-class Classification Tasks 4 Results 4.1 Experimental Datasets 4.2 Binary Classification Tasks 4.3 Multi Class Classification Class 5 Conclusion and Future Work Acknowledgments 07 - SiKDD2020__Hierarchical_Classification_of_Educational_Resources-1 Abstract 1 Introduction 2 Related Work 3 Data Set 4 Methodologies 4.1 Feature Extraction 4.2 Multi-class SVM Classifier 4.3 Lecture Weights 5 Evaluation 5.1 Parameters and Specifications 5.2 Results 6 Discussion 7 Future Work 8 Conclusion Acknowledgments References 08 - swati_outlet_prediction Abstract 1 Introduction 1.1 contributions 1.2 Problem Statement 2 Literature Review 3 Data Description 3.1 Raw Data Source 3.2 Dataset 4 Materials and Methods 4.1 Problem Modeling 4.2 Methodology 5 Experimental Evaluation 5.1 Baselines 5.2 Evaluation Metric 5.3 Results and Analysis 6 Conclusions and Future Work Acknowledgments 09 - MultiCOMET-FINAL-2 10 - A_Slovenian_Retweet_Network_2018_2020 11 - Semantic_annotation_of_food_and_nutrition_data__SiKDD_2020_-final Abstract 1 Introduction 2 Background 3 Critical overview of food and nutrition semantic resources 4 Proposal 5 Conclusion Acknowledgments 12 - 23nd_international_multiconference___Information_Society_2020-1 Abstract 1 Introduction 2 Data 2.1 MojeUre system 2.2 Data prepossessing and feature engineering 3 Data analysis scenarios and experiments 4 Results and discussion 5 Conclusion and Future Work Acknowledgments 13 - Monitoring-COVID-19-through-text-mining-and-visualization Abstract 1 Introduction 2 Related work 3 Description of Data 3.1 Historical COVID-19 Data 3.2 Live Data from Worldometer 3.3 Live News about Coronavirus 3.4 Google COVID-19 Community Mobility Data 3.5 MEDLINE: Medical Research Open Dataset 4 CORONAVIRUS WATCH DASHBOARD 4.1 Coronavirus Data Table 4.2 Coronavirus Live News 4.3 Statistical Visualizations 4.4 Time Gap 4.5 Mobility 4.6 Social Distancing Simulator 4.7 Biomedical Research Explorer 5 Conclusion and Future Work Acknowledgments 14 - PRAZEN _ TREBA ZAMENJATI Abstract 1 Introduction 2 Data 2.1 MojeUre system 2.2 Data prepossessing and feature engineering 3 Data analysis scenarios and experiments 4 Results and discussion 5 Conclusion and Future Work Acknowledgments Blank Page Blank Page Blank Page Blank Page 15 - skidd Introduction Data description Tweets processing Deep learning models Analysis of underlying embedding matrix Conclusion Acknowledgments References 16 - SIKDDCogLo_2020_Final_V2_22_09_2020 17 - SI_KDD_2020__Amazon-forest-fire-detection-with-an-active-learning-approach Introduction Data Data Acquisition Data Preprocessing Methodology Experiment 1 Experiment 2 Results Conclusions Acknowledgments 12 - Index - C Blank Page Blank Page Blank Page 14 - SI_KDD_2020___Usage_of_Incremental_Learning_in_Land_Cover_Classification.pdf Introduction Data EO data LULC data Feature Engineering Methodology Results Conclusions Acknowledgments References Blank Page