Zbornik 21. mednarodne multikonference

INFORMACIJSKA DRUŽBA - IS 2018

Zvezek C

Proceedings of the 21st International Multiconference

INFORMATION SOCIETY - IS 2018

Volume C

Odkrivanje znanja in podatkovna

skladišča - SiKDD

Data Mining and Data Warehouses - SiKDD

Uredila / Edited by

Dunja Mladenić, Marko Grobelnik

http://is.ijs.si

8.–12. oktober 2018 / 8–12 October 2018

Ljubljana, Slovenia





Zbornik 21. mednarodne multikonference

INFORMACIJSKA DRUŽBA – IS 2018

Zvezek C





Proceedings of the 21st International Multiconference

INFORMATION SOCIETY – IS 2018

Volume C





Odkrivanje znanja in podatkovna skladišča - SiKDD

Data Mining and Data Warehouses - SiKDD





Uredila / Edited by



Dunja Mladenić, Marko Grobelnik





http://is.ijs.si





8.–12. oktober 2018 / 8–12 October 2018

Ljubljana, Slovenia

Urednika:





Dunja Mladenić

Laboratorij za umetno inteligenco

Institut »Jožef Stefan«, Ljubljana



Marko Grobelnik

Laboratorij za umetno inteligenco

Institut »Jožef Stefan«, Ljubljana





Založnik: Institut »Jožef Stefan«, Ljubljana

Priprava zbornika: Mitja Lasič, Vesna Lasič, Lana Zemljak

Oblikovanje naslovnice: Vesna Lasič





Dostop do e-publikacije:

http://library.ijs.si/Stacks/Proceedings/InformationSociety





Ljubljana, oktober 2018





Informacijska družba

ISSN 2630-371X



Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni

knjižnici v Ljubljani

COBISS.SI-ID=31884839

ISBN 978-961-264-137-5 (pdf)





PREDGOVOR MULTIKONFERENCI

INFORMACIJSKA DRUŽBA 2018

Multikonferenca Informacijska družba (http://is.ijs.si) je z enaindvajseto zaporedno prireditvijo osrednji srednjeevropski dogodek na področju informacijske družbe, računalništva in informatike. Letošnja prireditev se ponovno odvija na več lokacijah, osrednji dogodki pa so na Institutu »Jožef Stefan«.

Informacijska družba, znanje in umetna inteligenca so še naprej nosilni koncepti človeške civilizacije. Se bo neverjetna rast nadaljevala in nas ponesla v novo civilizacijsko obdobje ali pa se bo rast upočasnila in začela stagnirati? Bosta IKT in zlasti umetna inteligenca omogočila nadaljnji razcvet civilizacije ali pa bodo demografske, družbene, medčloveške in okoljske težave povzročile zadušitev rasti? Čedalje več pokazateljev kaže v oba ekstrema

– da prehajamo v naslednje civilizacijsko obdobje, hkrati pa so notranji in zunanji konflikti sodobne družbe čedalje težje obvladljivi.

Letos smo v multikonferenco povezali 11 odličnih neodvisnih konferenc. Predstavljenih bo 215 predstavitev, povzetkov in referatov v okviru samostojnih konferenc in delavnic. Prireditev bodo spremljale okrogle mize in razprave ter posebni dogodki, kot je svečana podelitev nagrad. Izbrani prispevki bodo izšli tudi v posebni številki revije Informatica, ki se ponaša z 42-letno tradicijo odlične znanstvene revije.

Multikonferenco Informacijska družba 2018 sestavljajo naslednje samostojne konference:

 Slovenska konferenca o umetni inteligenci

 Kognitivna znanost

 Odkrivanje znanja in podatkovna skladišča – SiKDD

 Mednarodna konferenca o visokozmogljivi optimizaciji v industriji, HPOI

 Delavnica AS-IT-IC

 Soočanje z demografskimi izzivi

 Sodelovanje, programska oprema in storitve v informacijski družbi

 Delavnica za elektronsko in mobilno zdravje ter pametna mesta

 Vzgoja in izobraževanje v informacijski družbi

 5. študentska računalniška konferenca

 Mednarodna konferenca o prenosu tehnologij (ITTC)

Soorganizatorji in podporniki konference so različne raziskovalne institucije in združenja, med njimi tudi ACM

Slovenija, Slovensko društvo za umetno inteligenco (SLAIS), Slovensko društvo za kognitivne znanosti (DKZ) in druga slovenska nacionalna akademija, Inženirska akademija Slovenije (IAS). V imenu organizatorjev konference se zahvaljujemo združenjem in institucijam, še posebej pa udeležencem za njihove dragocene prispevke in priložnost, da z nami delijo svoje izkušnje o informacijski družbi. Zahvaljujemo se tudi recenzentom za njihovo pomoč pri recenziranju.

V letu 2018 bomo šestič podelili nagrado za življenjske dosežke v čast Donalda Michieja in Alana Turinga. Nagrado Michie-Turing za izjemen življenjski prispevek k razvoju in promociji informacijske družbe bo prejel prof. dr. Saša Divjak. Priznanje za dosežek leta bo pripadlo doc. dr. Marinki Žitnik. Že sedmič podeljujemo nagradi »informacijska limona« in »informacijska jagoda« za najbolj (ne)uspešne poteze v zvezi z informacijsko družbo. Limono letos prejme padanje državnih sredstev za raziskovalno dejavnost, jagodo pa Yaskawina tovarna robotov v Kočevju.

Čestitke nagrajencem!



Mojca Ciglarič, predsednik programskega odbora

Matjaž Gams, predsednik organizacijskega odbora



i

FOREWORD - INFORMATION SOCIETY 2018

In its 21st year, the Information Society Multiconference (http://is.ijs.si) remains one of the leading conferences in Central Europe devoted to information society, computer science and informatics. In 2018, it is organized at various locations, with the main events taking place at the Jožef Stefan Institute.

Information society, knowledge and artificial intelligence continue to represent the central pillars of human civilization. Will the pace of progress of information society, knowledge and artificial intelligence continue, thus enabling unseen progress of human civilization, or will the progress stall and even stagnate? Will ICT and AI continue to foster human progress, or will the growth of human, demographic, social and environmental problems stall global progress? Both extremes seem to be playing out to a certain degree – we seem to be transitioning into the next civilization period, while the internal and external conflicts of the contemporary society seem to be on the rise.

The Multiconference runs in parallel sessions with 215 presentations of scientific papers at eleven conferences, many round tables, workshops and award ceremonies. Selected papers will be published in the Informatica journal, which boasts of its 42-year tradition of excellent research publishing.

The Information Society 2018 Multiconference consists of the following conferences:

 Slovenian Conference on Artificial Intelligence

 Cognitive Science

 Data Mining and Data Warehouses - SiKDD

 International Conference on High-Performance Optimization in Industry, HPOI

 AS-IT-IC Workshop

 Facing demographic challenges

 Collaboration, Software and Services in Information Society

 Workshop Electronic and Mobile Health and Smart Cities

 Education in Information Society

 5th Student Computer Science Research Conference

 International Technology Transfer Conference (ITTC)

The Multiconference is co-organized and supported by several major research institutions and societies, among them ACM Slovenia, i.e. the Slovenian chapter of the ACM, Slovenian Artificial Intelligence Society (SLAIS), Slovenian Society for Cognitive Sciences (DKZ) and the second national engineering academy, the Slovenian Engineering Academy (IAS). On behalf of the conference organizers, we thank all the societies and institutions, and particularly all the participants for their valuable contribution and their interest in this event, and the reviewers for their thorough reviews.

For the sixth year, the award for life-long outstanding contributions will be presented in memory of Donald Michie and Alan Turing. The Michie-Turing award will be given to Prof. Saša Divjak for his life-long outstanding contribution to the development and promotion of information society in our country. In addition, an award for current achievements will be given to Assist. Prof. Marinka Žitnik. The information lemon goes to decreased national funding of research. The information strawberry is awarded to the Yaskawa robot factory in Kočevje. Congratulations!



Mojca Ciglarič, Programme Committee Chair

Matjaž Gams, Organizing Committee Chair



ii





KONFERENČNI ODBORI

CONFERENCE COMMITTEES



International Programme Committee

Organizing Committee

Vladimir Bajic, South Africa

Matjaž Gams, chair

Heiner Benking, Germany

Mitja Luštrek

Se Woo Cheon, South Korea

Lana Zemljak

Howie Firth, UK

Vesna Koricki

Olga Fomichova, Russia

Mitja Lasič

Vladimir Fomichov, Russia

Blaž Mahnič

Vesna Hljuz Dobric, Croatia

Jani Bizjak

Alfred Inselberg, Israel

Tine Kolenik

Jay Liebowitz, USA



Huan Liu, Singapore



Henz Martin, Germany

Marcin Paprzycki, USA

Karl Pribram, USA

Claude Sammut, Australia

Jiri Wiedermann, Czech Republic

Xindong Wu, USA

Yiming Ye, USA

Ning Zhong, USA

Wray Buntine, Australia

Bezalel Gavish, USA

Gal A. Kaminka, Israel

Mike Bain, Australia

Michela Milano, Italy

Derong Liu, USA

Toby Walsh, Australia





Programme Committee

Franc Solina, co-chair

Matjaž Gams

Vladislav Rajkovič

Viljan Mahnič, co-chair

Marko Grobelnik

Grega Repovš

Cene Bavec, co-chair

Nikola Guid

Ivan Rozman

Tomaž Kalin, co-chair

Marjan Heričko

Niko Schlamberger

Jozsef Györkös, co-chair

Borka Jerman Blažič Džonova

Stanko Strmčnik

Tadej Bajd

Gorazd Kandus

Jurij Šilc

Jaroslav Berce

Urban Kordeš

Jurij Tasič

Mojca Bernik

Marjan Krisper

Denis Trček

Marko Bohanec

Andrej Kuščer

Andrej Ule

Ivan Bratko

Jadran Lenarčič

Tanja Urbančič

Andrej Brodnik

Borut Likar

Boštjan Vilfan

Dušan Caf

Mitja Luštrek

Baldomir Zajc

Saša Divjak

Janez Malačič

Blaž Zupan

Tomaž Erjavec

Olga Markič

Boris Žemva

Bogdan Filipič

Dunja Mladenič

Leon Žlajpah

Andrej Gams

Franc Novak



iii

iv





KAZALO / TABLE OF CONTENTS



Odkrivanje znanja in podatkovna skladišča - SiKDD / Data Mining and Data Warehouses - SiKDD ....... 1

PREDGOVOR / FOREWORD ....................................................................................................................... 3

PROGRAMSKI ODBORI / PROGRAMME COMMITTEES ........................................................................... 4

Preparing Multi-Modal Data for Natural Language Processing / Novak Erik, Urbančič Jasna, Jenko Miha 5

Towards Smart Statistics in Labour Market Domain / Novalija Inna, Grobelnik Marko ................................ 9

Relation Tracker - Tracking the Main Entities and Their Relations Through Time / Massri M. Besher,

Novalija Inna, Grobelnik Marko ..............................................................................................................13

Cross-Lingual Categorization of News Articles / Novak Blaž .....................................................................17

Transporation Mode Detection Using Random Forest / Urbančič Jasna, Pejović Veljko, Mladenić

Dunja ......................................................................................................................................................21

FSADA, an Anomaly Detection Approach / Jovanoski Viktor, Rupnik Jan ................................................25

Predicting Customers at Risk With Machine Learning / Gojo David, Dujič Darko .....................................29

Text Mining Medline to Support Public Health / Pita Costa Joao, Stopar Luka, Fuart Flavio, Grobelnik

Marko, Santanam Raghu, Sun Chenlu, Carlin Paul, Black Michaela, Wallace Jonathan ......................33

Crop Classification Using PerceptiveSentinel / Koprivec Filip, Čerin Matej, Kenda Klemen .....................37

Towards a Semantic Repository of Data Mining and Machine Learning Datasets / Kostovska Ana,

Džeroski Sašo, Panov Panče .................................................................................................................41

Towards a Semantic Store of Data Mining Models and Experiments / Tolovski Ilin, Džeroski Sašo, Panov

Panče......................................................................................................................................................45

A Graph-Based Prediction Model With Applications / London András, Németh József, Krész Miklós ......49

Indeks avtorjev / Author index ......................................................................................................................55





v





vi



Zbornik 21. mednarodne multikonference

INFORMACIJSKA DRUŽBA – IS 2018

Zvezek C





Proceedings of the 21st International Multiconference

INFORMATION SOCIETY – IS 2018

Volume C





Odkrivanje znanja in podatkovna skladišča - SiKDD

Data Mining and Data Warehouses - SiKDD





Uredila / Edited by



Dunja Mladenić, Marko Grobelnik





http://is.ijs.si





11. oktober 2018 / 11 October 2018

Ljubljana, Slovenia

1





2





PREDGOVOR



Tehnologije, ki se ukvarjajo s podatki so v devetdesetih letih močno napredovale. Iz prve faze,

kjer je šlo predvsem za shranjevanje podatkov in kako do njih učinkovito dostopati, se je razvila

industrija za izdelavo orodij za delo s podatkovnimi bazami, prišlo je do standardizacije procesov, povpraševalnih jezikov itd. Ko shranjevanje podatkov ni bil več poseben problem, se

je pojavila potreba po bolj urejenih podatkovnih bazah, ki bi služile ne le transakcijskem procesiranju ampak tudi analitskim vpogledom v podatke – pojavilo se je t.i. skladiščenje podatkov (data warehousing), ki je postalo standarden del informacijskih sistemov v podjetjih.

Paradigma OLAP (On-Line-Analytical-Processing) zahteva od uporabnika, da še vedno sam

postavlja sistemu vprašanja in dobiva nanje odgovore in na vizualen način preverja in išče izstopajoče situacije. Ker seveda to ni vedno mogoče, se je pojavila potreba po avtomatski analizi podatkov oz. z drugimi besedami to, da sistem sam pove, kaj bi utegnilo biti zanimivo za

uporabnika – to prinašajo tehnike odkrivanja znanja v podatkih (data mining), ki iz obstoječih

podatkov skušajo pridobiti novo znanje in tako uporabniku nudijo novo razumevanje dogajanj

zajetih v podatkih. Slovenska KDD konferenca pokriva vsebine, ki se ukvarjajo z analizo podatkov

in odkrivanjem znanja v podatkih: pristope, orodja, probleme in rešitve.





INTRODUCTION



Data driven technologies have significantly progressed after mid 90’s. The first phases were mainly focused on storing and efficiently accessing the data, resulted in the development of

industry tools for managing large databases, related standards, supporting querying languages,

etc. After the initial period, when the data storage was not a primary problem anymore, the

development progressed towards analytical functionalities on how to extract added value from

the data; i.e., databases started supporting not only transactions but also analytical processing

of the data. At this point, data warehousing with On-Line-Analytical-Processing entered as a

usual part of a company’s information system portfolio, requiring from the user to set well defined questions about the aggregated views to the data. Data Mining is a technology developed after year 2000, offering automatic data analysis trying to obtain new discoveries

from the existing data and enabling a user new insights in the data. In this respect, the Slovenian KDD conference (SiKDD) covers a broad area including Statistical Data Analysis, Data, Text and

Multimedia Mining, Semantic Technologies, Link Detection and Link Analysis, Social Network

Analysis, Data Warehouses.





3





PROGRAMSKI ODBOR / PROGRAMME COMMITTEE

Dunja Mladenić, Artificial Intelligence Laboratory, Jožef Stefan Institute, Ljubljana

Marko Grobelnik, Artificial Intelligence Laboratory, Jožef Stefan Institute, Ljubljana



4





Preparing multi-modal data for natural language processing

Erik Novak

Jasna Urbančič

Miha Jenko

Jožef Stefan Institute

Jožef Stefan Institute

Jožef Stefan Institute

Jožef Stefan International

Ljubljana, Slovenia

Ljubljana, Slovenia

Postgraduate School

jasna.urbancic@ijs.si

miha.jenko@ijs.si

Ljubljana, Slovenia

erik.novak@ijs.si

ABSTRACT

to find similar items based on the model input. Throughout the

In education we can find millions of video, audio and text educa-

paper we focus on educational material but the approach can be

tional materials in different formats and languages. This variety and

generalized to other multi-modal data sets.

multimodality can impose difficulty on both students and teachers

The reminder of the paper is structured as follows. In section 2

since it is hard to find the right materials that match their learning

we go over related work. Next, we present the data preprocessing

preferences. This paper presents an approach for retrieving and

pipeline which is able to process different types of data – text, video

recommending items of different modalities. The main focus is on

and audio – and describe each component of the pipeline in section

the retrieving and preprocessing pipeline, while the recommenda-

3. A content based recommendation model that uses Wikipedia

tion engine is based on the k-nearest neighbor method. We focus

concepts to compare materials is presented in section 4. Finally, we

on educational materials, which can be text, audio or video, but the

present future work and conclude the paper in section 5.

proposed procedure can be generalized on any type of multi-modal

data.

2

RELATED WORK

KEYWORDS

In this section we present the related work which the rest of the

paper is based on. We split this section into subsections – multi-

Multi-modal data preprocessing, machine learning, feature extrac-

modal data preprocessing and recommendation models.

tion, recommender system, open educational resources

Multi-modal Data Preprocessing. Multi-modal data can be seen

ACM Reference Format:

as classes of different data types from which we can extract similar

Erik Novak, Jasna Urbančič, and Miha Jenko. 2018. Preparing multi-modal

features. In the case of educational material the classes are video,

data for natural language processing. In Proceedings of Slovenian KDD Con-

audio and text. One of the approaches is to extract text from all

ference (SiKDD’18). ACM, New York, NY, USA, Article 4, 4 pages. https:

class types. In [6] the authors describe a Machine Learning and

//doi.org/10.475/123_4

Language Processing automatic speech recognition system that can

convert audio to text in the form of transcripts. The system can

1

INTRODUCTION

also process video files as they are also able to extract audio from

There are millions of educational materials that are found in dif-

it. Their model was able to achieve a 13.3% word error rate on an

ferent formats – courses, video lectures, podcasts, simple text doc-

English test set. These kind of systems are useful for extracting

uments, etc. Because of its vast variety and multimodality it is

text from audio and video but would need to have a model for each

difficult for both students and teachers to find the right materi-

language.

als that will match their learning preferences. Some like to read a

Recommendation models. These models are broadly used in

short scientific papers while others just like to sit back and watch

many fields – from recommending videos based on what the user

a lecture that can last for hours. Additionally, materials are written

viewed in the past, to providing news articles that the user might

in different languages, which is a barrier for people who are not

be interested in. One of the most used approaches is based on

fluent in the language the material is written in. Finding a good

collaborative filtering [16], which finds users that have similar

approach of providing educational material would help improving

preferences with the target user and recommends items based on

their learning experience.

their ratings. Recommender systems now do not contain only one

In this paper we present a preprocessing pipeline which is able

algorithm but multiple which return different recommendations.

to process multi-modal data and input it in a common semantic

Authors of [10] discuss about the various algorithms that are used

space. The semantic space is based on Wikipedia concepts extracted

in the Netflix recommender system (top-n video ranker, trending

from the content of the materials. Additionally, we developed a con-

now, continue watching, and video-video similarity), as well as the

tent based recommendation model which uses Wikipedia concepts

methods they use to evaluate their system. A high level description

of the Youtube recommender system is found in [3]. They developed

Permission to make digital or hard copies of part or all of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

a candidate generation model and a ranking model using deep

for profit or commercial advantage and that copies bear this notice and the full citation

learning. Both Netflix and Youtube recommend videos based on

on the first page. Copyrights for third-party components of this work must be honored.

For all other uses, contact the owner/author(s).

users’ interaction with them and the users history. To some extent

SiKDD’18, October 2018, Ljubljana, Slovenia

this can be used for educational resources but cannot be generalized

© 2018 Copyright held by the owner/author(s).

on the whole multi-modal data set since we cannot acquire data

ACM ISBN 123-4567-24-567/08/06.

about users’ interaction with, for instance, text.

https://doi.org/10.475/123_4

5





SiKDD’18, October 2018, Ljubljana, Slovenia

Erik Novak, Jasna Urbančič, and Miha Jenko

A collaborative filtering based recommendation system for the

Crawling. The first step is to acquire the educational materials. We

educational sector is presented in [8]. They evaluated educational

have targeted four different OER repositories (MIT OpenCourse-

content using big data analysis techniques and recommended courses

Ware, Università di Bologna, Université de Nantes and Videolec-

to students by using their grades obtained in other subjects. This

tures.NET), for which we used their designated APIs or developed

gives us insight into how recommendations can be used in educa-

custom crawlers to acquire their resources. For each material we

tion but our focus is to recommend educational materials rather

acquired its metadata, such as the materials title, url, type, language

than courses. In a sense courses can be viewed as bundles of ed-

in which it is written and its provider. These values are used in the

ucational material; thus, our interest is recommending “parts of

following steps of the pipeline as well as to represent the material

courses” to the user.

in the recommendations.

Formatting. Next, we format the acquired material metadata. We

designate which attributes every material needs to have as well as

3

DATA PREPROCESSING

set placeholders for the features extracted in the following steps

In this paper we focus on open educational resources (OER), which

of the pipeline. By formatting the data we set a schema which

are freely accessible, openly licensed text, media, and other digi-

makes checking which attributes are missing easy. We do not have

tal assets that are useful for teaching, learning and assessing [21].

a mechanism for handling missing attributes in the current pipeline

These are found in different OER repositories maintained by univer-

iteration but we will dedicate time to solve this problem in the

sities, such as MIT OpenCourseWare [12], Università di Bologna [7],

future.

Université de Nantes [4] and Universitat Politècnica de València [5],

Text Extraction. The third step, we extract the content of each

as well as independent repositories such as Videolectures.NET [20],

material in text form. Since the material can be a text, video or

a United Nations award-winning free and open access educational

audio file to handled each file type separately.

video lectures repository.

For text we employed textract [1] to extract raw text from the

For processing the different OER we developed a preprocessing

given text documents. The module omits figures and returns the

pipeline that can handle each resource type and output metadata

content as text. The extracted text is not perfect - in the case of

used for comparing text, audio and video materials. The pipeline is

materials for mathematics it does not know how to represent mathe-

an extension of the one described in [11]; its architecture is shown

matical equations and symbols. In that case, it replaces the equations

in figure 1. What follows are the descriptions of each component

with textual noise. Currently we do nothing to handle this problem

in the preprocessing pipeline.

and use the output as is.

For video and audio we use the subtitles and/or transcriptions

to represent the materials content. To do this, we use transLectures

[18] which generates transcriptions and translations of a given

video and audio. The languages it supports are English, Spanish,

German and Slovene. The output of the service is in dfxp format

crawling

[17], a standard for xml caption and subtitles based on timed text

markup language, from which we extract the raw text.

Wikification. Next, we send the material through wikification - a

forma�ng

process which identifies and links material textual components to

the corresponding Wikipedia pages [15]. This is done using Wikifier

aud

[2], which returns a list of Wikipedia concepts that are most likely

i

vi

o

de

text

o

related to the textual input. The web service also supports cross- and

multi-linguality which enables extracting and annotating materials

textract

transLectures

text

in different languages.

extrac�on

Wikifier’s input text is limited to 20k characters, because of

which longer text cannot be processed as a whole. We split longer

text into chunks of at most 10k characters and pass them to Wikifier.

Here we are careful not to split the text in the middle of a sentence

wikifica�on

and if that is not possible, to at least not split any words.

We split the text as follows. First we make a 10k characters long

substring of the text. Next, we identify the last character in the

substring that signifies the end of a sentence (a period, a question

storing

mark, or an exclamation point) and split it at that character. If there

is no such character we find the last whitespace in the substring

and split it there. In the extreme case where no whitespaces are

found we take the substring as is. The substring becomes one chunk

Figure 1: The preprocessing pipeline architecture. It is de-

of the original text. We repeat the process on the remaining text

signed to handle each data type as well as extract features to

until it is fully split into chunks.

support multi- and cross-linguality.

When we pass these chunks into Wikifier, it returns Wikipedia

concepts related to the given chunk. These concepts also contains

6





Preparing multi-modal data for natural language processing

SiKDD’18, October 2018, Ljubljana, Slovenia

the Cosine similarity between the Wikipedia concept page and the

can be represented in various file formats, such as pdf and docx

given input text. To calculate the similarity between the concept

for text, wmv and mp4 for video, and mp3 for audio. We visualized

and the whole material we aggregated the concepts by calculating

the distribution of materials over file types in figure 4, but we only

the weighted sum

show types with more than 100 items available.

n

Õ Li

S

,

k =

s

L ki

i =1

where Sk is the aggregated Cosine similarity of concept k, n is the

number of chunks for which Wikifier returned concept k, Li is the

length of chunk i, L is the length of the materials raw text, and

ski is the Cosine similarity of concept k to chunk i. The weight

Li represents the presence of concept k, found in chunk i, in the

L

whole material. The aggregated Wikipedia concepts are stored in

the materials metadata attribute.

Data Set Statistics. In the final step, we validate the material at-

Figure 4: Number of items per file type in logarithm scale.

tributes and store it in a database. The OER material data set consists

The dominant file type is text (pdf, pptx and docx), followed

of approximately 90k items. The distribution of materials over the

by video (mp4).

four repositories is shown in figure 2.

As seen from the figure, the dominant file type is text (pdf, pptx

and docx) followed by video (mp4). The msi file type is an installer

package file format used by Windows but it can also be a textual

document or a presentation. If we generalize the file type distribu-

tion over all OER repositories we can conclude that the dominant

file type is text. This will be taken into count when improving the

preprocessing pipeline and recommendation engine.

Figure 2: Number of materials per repository crawled in log-

4

RECOMMENDER ENGINE

arithm scale. Most materials come from MIT OpenCourse-

There are different ways of creating recommendations. Some em-

Ware followed by Videolectures.NET.

ploy users’ interests while other are based on collaborative filter-

ing. In this section we present our content based recommendation

Some of the repositories offer material in different languages.

engine which uses the k-nearest neighbor algorithm [13]. What

All repositories together cover 103 languages, however for only 8

follows are descriptions of how the model generates recommenda-

languages the count of available materials is larger than 100. The

tions based on the user’s input, which can be either the identifier

distribution of items over languages is shown in figure 3 where we

of the OER in the database or a query text.

only show languages with more than 100 items available. Most of

Material identifier. When the engine receives the material identi-

the materials is in English, followed by Italian and Slovene. The

fier (in our case the url of the material) we first check if the material

“Unknown” column shows that for about 6k materials we were

is in our database. If present, we search for k most similar mate-

not able to extract the language. To acquire this information, we

rials to the one with the given identifier based on the Wikipedia

will improve the language extraction method in our preprocessing

concepts. Each material is represented by a vector of its Wikipedia

pipeline.

concepts where each value is the aggregated Cosine similarity of

the corresponding Wikipedia concept page to the material. By calcu-

lating the Cosine similarity between the materials the engine then

selects k materials with the highest similarity score and returns

them to the user. Because of the nature of Wikipedia concepts this

approach returns materials written in different languages - which

helps overcoming the language barrier.

Query text. When the engine receives the query text we search

for materials with the most similar raw text using the bag-of-words

model. Each material is represented as a bag-of-words vector where

each value of the vector is the tf-idf of the corresponding word. The

Figure 3: Number of materials per language in logarithm

materials are then compared using the Cosine similarity and the

scale. Most of the material is in English, followed by Italian

engine again returns the k materials that have the highest similarity

and Slovenian.

score. This approach is simple but it is unable to handle multilingual

documents. This might be overcome by first sending the query text

As shown in before the preprocessing pipeline is designed to

to Wikifier to get its associated Wikipedia concepts and use them

handle different types of material - text, video and audio. Each type

in a similar way as described in the Material identifier approach.

7





SiKDD’18, October 2018, Ljubljana, Slovenia

Erik Novak, Jasna Urbančič, and Miha Jenko

4.1

Recommendation Results

In the future we will evaluate the current recommendation en-

The described recommender engine is developed using the QMiner

gine and use it to compare it with other state-of-the-art. We intend

platform [9] and is available at [14]. When the user inputs a text to use A/B testing to optimize the models based on the user’s inter-query the system returns recommendations similar to the given

action with them. We wish to improve the engine by collecting user

text. These are shown as a list where each item contains the title, url,

activity data to determine what materials are liked by the users,

description, provider, language and type of the material. Clicking

explore different deep learning methods to improve results, and

on an item redirects the user to the selected OER.

develop new representations and embeddings of the materials.

We have also discussed with different OER repository owners

We also aim to improve the preprocessing pipeline by improving

and found that they would be interested in having the recommen-

text extraction methods, handle missing material attributes, and

dations in their portal. To this end, we have developed a compact

adding new feature extraction methods to determine the topic and

recommendation list which can be embedded in a website. The rec-

scientific field of the educational material as well as their quality.

ommendations are generated by providing the material identifier or

raw text as query parameters in the embedding url. Figure 5 shows

ACKNOWLEDGMENTS

the embed-ready recommendation list.

This work was supported by the Slovenian Research Agency and

X5GON European Unions Horizon 2020 project under grant agree-

ment No 761758.

REFERENCES

[1] David Bashford. 2018. GitHub - dbashford/textract: node.js module for extracting

text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!

https://github.com/dbashford/textract. Accessed: 2018-09-03.

[2] Janez Brank, Gregor Leban, and Marko Grobelnik. 2017. Annotating documents

with relevant Wikipedia concepts. Proceedings of SiKDD.

[3] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks

for youtube recommendations. In Proceedings of the 10th ACM Conference on

Recommender Systems. ACM, 191–198.

[4] Université de Nantes. 2018. Plate-forme d’Enseignement de l’Université de Nantes.

http://madoc.univ-nantes.fr/. Accessed: 2018-09-03.

[5] Universitat Politècnica de València. 2016. media UPV. https://media.upv.es/#/

portal. Accessed: 2018-09-03.

[6] Miguel Ángel del Agua, Adrià Martínez-Villaronga, Santiago Piqueras, Adrià

Giménez, Alberto Sanchis, Jorge Civera, and Alfons Juan. 2015. The MLLP ASR

Systems for IWSLT 2015. In Proc. of 12th Intl. Workshop on Spoken Language

Translation (IWSLT 2015). Da Nang (Vietnam), 39–44. http://workshop2015.iwslt.

org/64.php

[7] Università di Bologna. 2018. Universita di Bologna. https://www.unibo.it/it.

Accessed: 2018-09-03.

[8] Surabhi Dwivedi and VS Kumari Roshni. 2017. Recommender system for big

data in education. In E-Learning & E-Learning Technologies (ELELTECH), 2017 5th

National Conference on. IEEE, 1–4.

[9] Blaz Fortuna, J Rupnik, J Brank, C Fortuna, V Jovanoski, M Karlovcec, B Kazic,

K Kenda, G Leban, A Muhic, et al. 2014. » QMiner: Data Analytics Platform for

Processing Streams of Structured and Unstructured Data «, Software Engineering

for Machine Learning Workshop. In Neural Information Processing Systems.

[10] Carlos A Gomez-Uribe and Neil Hunt. 2016. The netflix recommender system:

Algorithms, business value, and innovation. ACM Transactions on Management

Figure 5: An example of recommended materials for the lec-

Information Systems (TMIS) 6, 4 (2016), 13.

ture with the title “Is Deep Learning the New 42?” published

[11] Erik Novak and Inna Novalija. 2017. Connecting Professional Skill Demand with

Supply. Proceedings of SiKDD.

on Videolectures.NET [19]. The figure shows cross-lingual,

[12] Massachusetts Institute of Technology. 2018. MIT OpenCourseWare | Free Online

cross-modal, and cross-site recommendations.

Course Materials. https://ocw.mit.edu/index.htm. Accessed: 2018-09-03.

[13] Leif E Peterson. 2009. K-nearest neighbor. Scholarpedia 4, 2 (2009), 1883.

[14] X5GON Project. 2018. X5GON Platform. https://platform.x5gon.org/search.

The recommendation list consists of the top 100 materials based

Accessed: 2018-09-04.

[15] Lev Ratinov, Dan Roth, Doug Downey, and Mike Anderson. 2011. Local and

on the query input. As shown in the figure the recommendation

global algorithms for disambiguation to wikipedia. In Proceedings of the 49th

contain materials of different types, are provided by different reposi-

Annual Meeting of the Association for Computational Linguistics: Human Language

tories and written in different languages. We have not yet evaluated

Technologies-Volume 1. Association for Computational Linguistics, 1375–1384.

[16] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based

the recommendation engine but we intend to do it in the future.

collaborative filtering recommendation algorithms. In Proceedings of the 10th

international conference on World Wide Web. ACM, 285–295.

5

FUTURE WORK AND CONCLUSION

[17] Speechpad. 2018. DFXP (Distribution Format Exchange Profile) | Speechpad.

https://www.speechpad.com/captions/dfxp. Accessed: 2018-09-04.

In this paper we present the methodology for processing multi-

[18] transLectures. 2018. transLectures | transcription and translation of video lectures.

modal items and creating a semantic space in which we can compare

http://www.translectures.eu/. Accessed: 2018-09-03.

[19] VideoLectures.NET. 2018. Is Deep Learning the New 42? - Videolectures.NET.

these items. We acquired a moderately large open educational re-

http://videolectures.net/kdd2016_broder_deep_learning/. Accessed: 2018-09-03.

sources data set, created a semantic space with the use of Wikipedia

[20] VideoLectures.NET. 2018. VideoLectures.NET - VideoLectures.NET. http://

videolectures.net/. Accessed: 2018-09-03.

concepts and developed a basic content based recommendation en-

[21] Wikipedia. 2018. Open educational resources - Wikipedia. https://en.wikipedia.

gine.

org/wiki/Open_educational_resources. Accessed: 2018-09-03.

8





TOWARDS SMART STATISTICS IN LABOUR MARKET

DOMAIN

Inna Novalija



Marko Grobelnik

Jožef Stefan Institute

Jožef Stefan Institute

Jamova cesta 39, Ljubljana, Slovenia

Jamova cesta 39, Ljubljana, Slovenia





inna.koval@ijs.si

marko.grobelnik@ijs.si



ABSTRACT

respect to defined scenarios – demand analysis, skills ontology

development and skills ontology evolution.

In this paper, we present a proposal for developing smart labour



market statistics based on streams of enriched textual data and

illustrate its application on job vacancies from European

2. BACKGROUND

countries. We define smart statistics scenarios including demand

The development of smart labour market statistics touches a

analysis scenario, skills ontology development scenario and skills

number of issues from labour market policies area and would

ontology evolution scenario. We identify stakeholders –

provide contributions to questions related to:

consumers for smart statistics and define the initial set of smart

-

job creation,

labour market statistical indicators.

-

education and training systems,



-

labour market segmentation,

KEYWORDS

-

improving skill supply and productivity.

Smart statistics, labour market, demand analysis.

For instance, the analysis of the available job vacancies could

offer an insight into what skills are required in the particular area.



Effective trainings based on skills demand could be organized and

1. INTRODUCTION

that would lead into better labour market integration.

A number of stakeholder types will benefit from the development

An essential feature of modern economy is the appearance of new

of smart labour market statistics. In particular, the targeted

skills, such as digital skills. For instance, e-skills lead to the

stakeholders are:

exponential increases in production and consumption of data.

-

Statisticians from National and European statistical offices

While job profiles vary and are still in the process of being

who are interested in the application of new technologies for

defined, organizations agree that they need the new breed of

production of the official statistics.

workers.

-

Individual persons who are searching for new employment

Accordingly, the European institutions take major initiatives

opportunities. In particular, individuals are interested in the

related to digitalization of labor market, training of new skills and

job vacancies that are compatible with their current skills and

meeting the labour demand.

in the methods (like trainings) providing the possibilities to

Historically, the labour market statisticians use standard measures

obtain new skills in demand.

of the labour demand and labour supply based on traditional

-

Public and private employment agencies interested in up-to-

surveys – job vacancy surveys, wage survey, labour force surveys.

date employees profiles.

The unemployment rate provides information on the supply of

-

persons looking for work in excess of those who are currently



Education and training institutions from different levels and

employed. Data on employment provide information on the

forms of education - general/vocational education, higher

demand for workers that is already met by employers.

education, public/private, initial/ adult education.

Educational institutions are interested in relevant skills and

The data-driven smart labour market statistics intends to:

topics that should be part of the curriculum programs.

-

use the available historical job vacancies data,

-

Ministries of labour/manpower, economy/industry/trade,

-

use the available real-time job vacancies data,

education, finance, etc. The policy makers, such as

-

use the available real-time and historical dataset of additional

ministries, are interested in the overall labour market

data (described below),

situation, with respect to location and time, in the labour

-

align data sources,

market segmentation and in the processes of improving

-

construct models and obtain novel smart labour market

supply and productivity.

indicators that will complement existing labour market

-

Standards development organizations. National or

statistics,

International organizations whose primary activities are

-

provide a system for delivering results to the users.

developing, coordinating, promulgating, revising, amending,

reissuing, interpreting, or otherwise producing technical

The smart labour market statistics approach will combine

standards that are intended to address the needs of some

advanced data processing, modelling and visualization methods in

order to develop trusted techniques for job vacancies analysis with

9





relatively wide base of affected adopters. Interested in new

- Social media data, such as news, Twitter data that might be

technologies developed in relation to labour market.

relevant for labour market.

-

Academic and research institutes. Public and private entities

- Labour supply data (based on user profile analysis).

who conduct research in relevant areas. Research institutions

Open job vacancies can be found using job search services. These

are interested in the development of novel methodologies

services aggregate job vacancies by location, sector, applicant

and usage of appearing new data sources.

qualifications and skill set or type. One such service is Adzuna



[4], a search engine for job ads, which mostly covers English-

3. RELATED WORK

speaking countries.

The European Data Science Academy (EDSA) [1] was an H2020

For data acquisition and enrichment, dedicated APIs, including

EU project that ran between February 2015 and January 2018.

Adzuna API, are used, as well as custom web crawlers are

The objective of the EDSA project was to deliver the learning

developed. The data is formatted to JSON to aid further

tools that are crucially needed to close the skill gap in Data

processing and enrichment. The job vacancy dataset is obtained

Science in the EU. The EDSA project has developed a virtuous

with respect to trust and privacy regulations, the personal data is

learning production cycle for Data Science, and has:

not collected.

-

Analyzed the sector specific skillsets for data analysts across

Job vacancies usually contain the information, such as job

Europe with results reflected at EDSA demand and supply

position title, job description, company and job location. In such

dashboard;

way, job vacations that are constantly crawled/web-scraped

-

Developed modular and adaptable curricula to meet these

present a data stream. The job title and job description are textual

data science needs; and

data that contain information about skills that employee should

-

have.



Delivered training supported by multiplatform resources,

introducing Learning pathway mechanism that enables

On the obtained data wikification - identifying and linking textual

effective online training.

components (including skills) to the corresponding Wikipedia

EDSA project established a pipeline for job vacancy collecting

pages [5] is performed. This is done using Wikifier [6], which

and analysis that will be reused for the purpose of smart statistics.

also supports cross and multi-linguality enabling extraction and

annotation of relevant information from job vacancies in different

An ontology called SARO (Skills and Recruitment Ontology) [2]

languages. The data is tagged with concepts from GeoNames

has been developed to capture important terms and relationships

ontology [7]. To job postings where latitude and longitude have

to facilitate the skills analysis. SARO ontology concepts included

been available, GeoNames location uri and location name are

relevant classes to job vacancy datasets, such as Skill and

added. To the postings where only location name has been

JobPosting. Examples of instances of class Skill would be skills,

available, the coordinates and location uri are added.

such as “Data analysis”, “Java programming language” et al.

The job vacancy data representation level depends on the specific

ESCO [3] is the multilingual classification of European Skills,

country. For the United Kingdom, France, Germany and the

Competences, Qualifications and Occupations. It identifies and

Netherlands there is a substantial collection of job vacancies in

categorizes skills/competences, qualifications and occupations

the area of digital technologies.

relevant for the EU labour market and education and training, in

25 European languages. The system provides occupational

4.2 CONCEPTUAL ARCHITECTURE

profiles showing the relationships between occupations,

The labour market statistics conceptual structure is built upon the

skills/competences and qualifications. For instance, one example

following major blocks:

of existing ESCO skill is “JavaScript” (with alternative labels

“Client-side JavaScript”, "JavaScript 1.7" et al.).

1. Data sources related to different aspects of smart labour market.

The main data source aggregates historical and current job

Both SARO and ESCO ontologies are useful for the aim of smart

vacancies in the area of digital technologies and data science

statistics, in particular for skills ontology development and skills

around Europe.

ontology evolution scenarios. However, the ontologies usually are

manually manipulated, and the methods developed for smart

2. Modelling smart labour market statistics takes central part of

labour market statistics should overcome the difficulties related to

the smart labour market statistics approach, where the goal is to

this issue. The ontology evolution scenario of smart labour market

construct models based on different data sources, updated in

statistics envisions automatic identification of emerging and

business-real-time (as needed or as data sources allow). Models

decreasing skills from the data perspective.

shall bring understanding of the smart labour market statistics

domain and shall be used for aggregation, ontology development



and ontology evolution.

4. PROBLEM DEFINITION

3. Targeted users are smart statistics consumers. There are several

4.1 DATA SOURCES

major groups of users (described above). The example users might

include statisticians, policy makers, individual users (residents

The main data sources available for the development of smart

and non-residents), training and educational organizations and

labour market statistics are historical and current data about job

other.

vacancies in the area of digital technologies and data science

around Europe (~5.000.000 job vacancies 2015-2018).

4. Finally, applications of smart labour market statistics are

multiscale - they can be presented at cross-country level (around

Additional data sources may include:

10





Europe) country level (UK, France, the Netherlands etc.),

relationships, and other distinctions that are relevant for modeling

city/area level and conceptual level (ontology).

a domain. The specification takes the form of the definitions of

Figure 1 illustrates the conceptual architecture diagram for smart

representational vocabulary (classes, relations, and so forth),

labour market statistics.

which provide meanings for the vocabulary and formal constraints

on its coherent use.





Figure 1: Conceptual Architecture



The key characteristics of the development techniques will

include:

- Interpretability and transparency of the models – the aim is, for a

model to be able to explain its decision in a human readable

manner (vs. black box models, which provide results without

explanation).

- Non-stationary modelling techniques are required due to

changing data and its statistical properties in time. For instance,

the ontology evolution process will be modeled taking to the

account the incremental data arriving to the system.

- Multi-resolution nature of the models, having the property to

observe the structure of a model on multiple levels of granularity,

depending on the application needs.

- Scalability for building models is required due to the nature of

incoming data streams.

4.3 SCENARIOS

The smart labour market statistics proposal includes three



scenarios - demand analysis scenario, ontology development

Figure 2: Example of Job Vacancies Crawled and

scenario and ontology evolution scenario described below.

Processed

4.3.1 DEMAND ANALYSIS

Ontologies are often manually developed and maintained, what

Demand analysis scenario suggests production of statistical

requires a sufficient user efforts.

indicators based on the available job vacancies using techniques

In the ontology development scenario an automatic (or semi-

for data preprocessing, semantic annotation, cross-linguality,

automatic) bottom-up process of creating ontology from available

location identification and aggregation.

job vacancies will be suggested.

Job vacancies in structural and semi-structural form are the input

The relevant skills (extracted from the job vacancies) will be

to into the system, while statistics related to overall job demand,

defined and formalized. Using semantic annotation and cross-

job demand with respect to particular location, job demand with

linguality techniques for skills extraction based on JSI Wikifier

respect to particular skill (skill demand) and time frame are the

tool [6] will enable the possibility of including the newest

outputs of the system.

available skills “on the market” that are not yet captured in the

Figure 2 presents an example of crawled and processed job

ontologies, taxonomies and classifications that are manually

vacancies.

developed. The input to the ontology development scenario is a

set of job vacancies and the output is ontology of skills presenting

4.3.2 SKILLS ONTOLOGY DEVELOPMENT

the domain structure that can be compared to or used for official

Ontologies reduce the amount of information overload in the

classifications.

working process by encoding the structure of a specific domain

and offering easier access to the information for the users. Gruber

4.3.3 SKILLS ONTOLOGY EVOLUTION

[8] states that an ontology defines (specifies) the concepts,

Ontology Evolution is the timely adaptation of an ontology

to the arisen changes and the consistent propagation of these

11





changes to dependent artefacts [9]. Ontology evolution is a

-

Ontology evolution statistics. Example: emerging skills in

process that combines a set of technical and managerial activities

the ontology in the last 3 months

and ensures that the ontology continues to meet organizational

Since the data has a streaming nature, different kinds of multiscale

objectives and users’ needs in an efficient and effective way.

and aggregation options can be handled with respect to time

Ontology management is the whole set of methods and techniques

parameters.

that is necessary to efficiently use multiple variants of ontologies



from possibly different sources for different tasks [10].

Scenario 3 will suggest an automatic (or semi-automatic) ontology

6. CONCLUSION AND FUTURE WORK

evolution process based on the real-time job vacancy stream. With

In this paper, we presented a proposal for developing smart labour

respect to the nature of job vacancy data stream and skills

market statistics based on streams of enriched textual data, such as

extracted from job it will be possible to see the dynamics of

job vacancies from European countries. We define smart statistics

evolving skills – when the new skills (not included into the

scenarios, such as demand analysis scenario, skills ontology

current ontology versions appear) and how the skills ontology is

development scenario and skills ontology evolution scenario. The

changing with time.

future work would include the implementation of the smart labour

In particular, it could be possible to observe appearing new skills

market scenarios, quality assessment and evaluation of the

and suggest them for inclusion into official skills classifications.

produced statistical outcomes.

In addition, it could be visible how fast the ontology changes,



which could be the indicator of the technological progress on the

7. ACKNOWLEDGMENTS

relevant market.

This work was supported by the Slovenian Research

For instance, the current version of ESCO classification does not

Agency and EDSA European Union Horizon 2020 project

contain “TensorFlow” skill (TensorFlow [11] is an open-source

under grant agreement No 64393.

software library for dataflow programming across a range of tasks,



appeared in 2015). TensorFlow, which is already present in job

vacancies, could be captured during ontology evolution process

8. REFERENCES

and suggested as a new concept for official classifications.

[1] EDSA, http://edsa-project.eu (accessed in August, 2018).



[2] Sibarani, Elisa & Scerri, Simon & Mousavi, Najmeh & Auer,

Sören. (2016). Ontology-based Skills Demand and Trend

5. STATISTICAL INDICATORS

Analysis. 10.13140/RG.2.1.3452.8249.

Traditionally the indicators related to labour market have been

[3] ESCO taxonomy, https://ec.europa.eu/esco/portal (accessed in based on survey responses. The smart labour market statistics

August 2018).

proposal introduces a possibility to complement standard

statistical indicators, such as job vacancy rate with novel “data

[4] Adzuna developer page,

inspired” knowledge.

https://developer.adzuna.com/overview (accessed in August, 2018).

The smart labour market statistics indicators use data sources,

previously not covered by official statistics, and in such way

[5] Ratinov, L., Roth, D., Downey, D. and Anderson, M. Local

complementary to traditional data sources. The smart labour

and global algorithms for disambiguation to wikipedia. In

market statistics indicators are based on real-time data streams,

Proceedings of the 49th Annual Meeting of the Association for

which makes possible to obtain not only historical, but also

Computational Linguistics: Human Language Technologies-

current values for job vacancies that could be used for different

Volume 1, pages 1375–1384. Association for Computational

purposes, such as nowcasting. In addition, the smart labour

Linguistics, 2011.

market statistics indicators take into the account data cross-lingual

[6] JSI Wikifier, http://wikifier.org (accessed in May, 2018).

and multi-lingual nature of streaming data and can be produced at

the multiscale levels – cross-country, country, city (area) levels.

[7] GeoNames ontology,

http://www.geonames.org/ontology/documentation.html (accessed The scenarios described above would result into a number of

in August, 2018).

smart labour market indicators with multiscale options. In

particular:

[8] Ontology (by Tom Gruber),

-

http://tomgruber.org/writing/ontology-definition-2007.htm



Up-to date job vacancies statistics on a cross-

(accessed in August, 2018).

country/country/city(area) level. Example: job vacancies in

UK and France in the last month

[9] M. Klein and D. Fensel, Ontology versioning for the Semantic

-

Web, Proc. International Semantic Web Working Symposium



Up-to date skills statistics on a cross-

(SWWS), USA, 2001

country/country/city(area) level. Example: top 10 skills in

UK in the last month

[10] L. Stojanovic, B. Motik, Ontology evolution with ontology,

-

in: EKAW02 Workshop on Evaluation of Ontology-based Tools



Up-to date location statistics. Example: top locations for

(EON2002), CEUR Workshop Proceedings, Sigüenza, vol. 62,

specific skill

2002, pp. 53–62

-

Ontology development statistics. Example: number of

[11] TensorFlow, https://en.wikipedia.org/wiki/TensorFlow

concepts in the ontology

(accessed in August, 2018).

12





Relation Tracker - tracking the main entities and their relations

through time



M. Besher Massri

Inna Novalija

Marko Grobelnik

Jožef Stefan Institute

Jožef Stefan Institute

Jožef Stefan Institute

Jamova cesta 39, Ljubljana, Slovenia

Jamova cesta 39, Ljubljana, Slovenia

Jamova cesta 39, Ljubljana, Slovenia





besher.massri@ijs.si

inna.koval@ijs.si

marko.grobelnik@ijs.si



ABSTRACT

contextual information provided as characteristic keywords, for a

In this paper, we present Relation Tracker, a tool that tracks main

quick detection of information from the original articles.

entities [people and organizations] within each topic through



time. The main types of relations between the entities are detected

Regarding classifying news, we observe in [3] a new technique

and observed in time. The tool provides multiple ways of

that uses Deep Learning to increase the accuracy of prediction of

visualizing this information with different scales and durations.

online news popularity.

The tool uses events data from Event Registry as a source of

In the paper explaining Event Registry [1], we see how articles

information, with the aim of getting holistic insights about the

from different languages are grouped into events and the main

searched topic.

information and characteristics about them are extracted.



Additionally, a graphical interface is implemented which allows

search for events and visualize the results in multiple ways that

KEYWORDS

together give a holistic view about events.

Information Retrieval, Visualization, Event Registry, Wikifier,



Dmoz Taxonomy

This work begins with the events as a starting point, and it is one



more step on the same path; it groups events further into topics

1. INTRODUCTION

and trends, then it focuses on tracking how some entities are

Every day, tremendous amounts of news and information are

appearing as main entities regarding the selected topic, and how

being streamed throughout the Internet, which is requiring the

the relationship between them is changing through time.

implementation of more tools to aggregate this information. With



technology advancement, those tools have been increasing in



complexity and options provided. However, there has been a

3. DESCRIPTION OF DATA

demand for tools that give simple yet holistic summary of the

We used part of the events from Event Registry as our main

searched topic in order to acquire general insights about it.

source of data. We obtained a dataset of ~ 1.8 million events as a



list of JSON files, with event’s dates between Jan 2015 and July

Hence, we provide the Relation Tracker tool that tries to achieve

2016. Each event consists of general information like title, event

this goal; it is based on the data from Event Registry [1], which is

date, total article count, etc., and a list of concepts that

a system for real-time collection, annotation and analysis of

characterize the event, which is split into entity concepts and non-

content published by global news outlets. The tool presented in

entity concepts. Entity concepts are people, organizations, and

this paper takes the events and groups them into topics, and

locations related to the event. Whereas non-entity concepts

within each topic, it provides an interactive graph that shows the

represent abstract terms that define the topic of the event, like

main entities of each topic at each time and the main topic of

technology, education, and investment. Those concepts were

relations between those entities. In addition, a summary

extracted using JSI Wikifier [4] which is a service that enables

information about entities and their relationship is visualized

semantic annotation of the textual data in different languages. In

through different graphs to help understand more about the topic.

addition, each concept has a score that represents the relevancy of



that concept to the event.

The remainder of this paper is structured as follows. In section 2,



we show the related work done in this area. In section 3, we



provide a description of the used data. Section 4 explains the

4. METHODOLOGY

methodology and main challenges that were involved in this work.



Next, we explain the visualization features of the tool in section 5.

Finally, we conclude the paper and discuss potential future work.

4.1 Clustering and Formatting Data



To group the events into topics, we used K-Means clustering

2. RELATED WORK

algorithm, where each event is represented as a sparse vector of

the non-entity concepts it has, with the weights equal to their

Similar works have been done in the area of visualizing

scores in that event. The constant number of topics is set

information extracted from news. We see in [2] a tool for efficient

experimentally to be 100 clusters, in a balance between mixed

visualization of large amount of articles as a graph of connected

clusters and repeated clusters. Each cluster describes a set of

entities extracted from articles, enriched with additional

events that fall under the same topic, whereas the centroid vector

of each cluster represents the main characteristics of it. To name

13





the clusters, we used category classifier service from Event

4.3 Detecting the Characteristics

of

Registry, which uses Dmoz Taxonomy [5], a multilingual open-

content directory of World Wide Web links, that is used to

Relationship

classify texts and webpages into different categories; for each

The main goal was to model the relationship between any two

cluster, we formed a text consisting of the components of its

entities through a vector of words where two entities are

centroid vector, taking into account their weights within the

collocated. Since the relationship between two entities at any

vector. The resulted cluster names were ranged from technology

given time is based on the shared events between them, and each

and business to refugees and society, and clusters were exported

event is characterized by a set of concepts, we decided on using

as a JSON file for processing them in the visualization part.

those concepts - specifically the abstract or the non-entity



concepts - to characterize such relationships. For each pair, we

4.2 Choosing the Main Entities

aggregated all the non-entity concepts from the shared events

between them, and each one of them was assigned a value based

Under any topic, the top entities at each duration of time has to be

on the number of events it is mentioned in and its score in those

chosen. At first, the concepts were filtered from outliers like

events. Those concepts were sorted and ranked depending on their

publishers and news agencies. Then, an initial importance value

values, and the top ones were chosen as the main features of the

has been set for each concept based on two parameters: the TF-

relationship. In addition, these values of the concepts were used to

IDF score of concept with respect to each event, and the number

rank the shared events and extract the top ones; by giving each

of articles each event contains. If we denote the set of events that

event a value equal to the aggregated values (the ones calculated

occur in the interval of time D by ED, the number of articles that

in previous step) of all non-entity concepts it has. To summarize

event e contains is Ae, the TF-IDF score of concept c at event e by the set of characteristics, we classified them using Dmoz category

Sc,e, then the importance value of each item with respect to the

classifier in a similar way to what we have done in determining

interval D is calculated by the formula:

the names of the clusters. These categories were used to label the



relationship between the entities, indicating the main topic of the

shared events between them.





5. VISUALIZING THE RESULTS



To access a topic, a search bar is provided to select among the list

The TF-IDF function is used to give importance to the concept

of extracted topics from clustering step. Once the user selects a

based on its relevance to the events, and the number of articles is

topic, a default date is chosen and a network graph is shown

used to give more importance to the events that have more articles

explaining the topic.

talking about it, and hence, more importance to the concepts that

it has. We decided on using the product of summation rather than

5.1 Characteristics of the Main Graph

summation of product because of its computation efficiency while

Since the tool’s main goal is to show the top entities and their

still producing good results. However, to prevent the case where

relations, the network graph is the best choice for this matter.

all the chosen entities get nominated because of one or two big

Following that, we have built an interactive network graph that

events (which results in a bias towards those few events), a

has the following features:

modification to the importance value formula has been made by

-

The main entities within that topic at the selected interval

introducing another parameter, which is the links between

of time are represented by the vertices of the graph.

concepts (whenever two concepts occur in the same event, there is

-

The size of the vertices reflects the importance value of

a link between them). Each concept now affects negatively the

each entity, scaled to a suitable ratio to fit in the canvas.

other concepts it is linked to by an amount equal to the initial

-

The colors represent the type of the entity, whether it is a

importance value divided by the number of neighbors. If we

person [red] or an organization [blue].

denote the set of neighbors of concept c during the interval of

-

The links between the entities represent the existence of

time D by Nc,D, then the negative importance value is defined by:

shared events in that interval of time between them under



that topic, and hence indicating some form of relations.

The thickness of the links is proportional to the number of



shared events, whereas the labels are the ones calculated in

previous section.





Figure 1 presents top companies with relevant relations in July

The final score is just the initial importance value minus the

2015 found among business news.

negative importance value, which is then used to sort and

nominate the top entities.





14





Figure 1: Top companies in July 2015 and their relations

Figure 3: The changes in top entities under the same topic

under the business topic.

after moving the interval for 15 days.





5.2 Main Functionality

5.3 Displaying Relation Information

As the tool is concerned about tracking the changes with time.

Whenever the user selects a pair of entities, detailed information

The graph is supported with a slide bar that allows the user to

about their relationship in the selected interval of time is given,

choose from the dates where there is at least one event occurred

such as the number of shared events and articles, along with the

with respect to the selected topic. Different scales for moving

top events both concepts were mentioned in. Also, the top shared

dates are also provided; the user can choose to move day by day,

characteristics that shape the relationship between them at this

week by week, or month by month and see the changes

period is shown and sorted by percentage of importance. As seen

accordingly. In addition, the user can choose a specific interval of

in Figure 4; when selecting Jeff Bezos and Elon Musk under the

time, and track how the entities and their relations are changing

space topic between January and September 2015, we see a list of

when the interval moves slightly with respect to its length. An

the top events that involve both of them during this period. We

interval magnifier is also given if the user wants to get a closer

see also that the relationship between them is mainly about

look at the changes that happen in a small interval.

sending astronauts by rockets to the international space station, as



it can be understood from the top shared characteristics.

An example illustrating that can be seen in Figures 2 and 3. In



Figure 2, we see the top 10 entities under the refugee topic in the

last two months of 2015. When the interval is moved by 15 days,

we notice that some of the entities disappear, like European

Commission, indicating that they are no longer among the top 10

entities, whereas “United States House of Representitive” entity

emerges and connects to “Barack Obama” and “Repulican Party”.

The change in size indicates the change in the importance value of

each one, while Society is the general theme among all labels.





Figure 4: Relationship summary about Jeff Bezos and Elon



Musk between January and September 2015 under the Space

Figure 2: Top entities for the last two months of 2015 under

topic.

the refugee topic.





15





To illustrate how the importance of those top features with respect

6. CONCLUSION AND FUTURE WORK

to the relationship is changing through time, a stream graph is

In this paper, we provide a tool that uses events data from Event

used as shown in Figure 5.

Registry to show the main entities within each topic, and how the



characteristics of relationship among them is changing through

time. However, there are a couple of limitation to the tool that we

want to improve in the future. Although we were able to detect the

characterestics of the relationship between entities and how they

are changing through time, the main type of relation that we used

to label the links were very broad and hence rarely changing-

improving the methodology for relation extraction and

observation of relations in time will be the subject of future work.

In addition, we limited the search space for topics for the 100

topics we obtained from clustering, we would like to generalize

the search by enabling searches for any concept or keyword with

different options to filter the search.





7. ACKNOWLEDGMENTS

Figure 5: Stream graph showing how the effect of the main

features on the relationship between Jeff Bezos and Elon Musk

This work was supported by the euBusinessGraph (ICT-732003-

is changing through time.

IA) project [6].





8. REFERENCES

Finally, the set of all characteristics that affect the relationship is

visualized in a tag cloud to give a big picture about it. Figure 6



shows the tag cloud of the same relationship mentioned above.

[1] Gregor Leban, Blaz Fortuna, Janez Brank, and Marko



Grobelnik. 2014. Event registry: learning about world events

from news. In Proceedings of the 23rd International

Conference on World Wide Web (WWW '14 Companion).

ACM, New York, NY, USA, 107-110. DOI:

https://doi.org/10.1145/2567948.2577024



[2] Marko Grobelnik and Dunja Mladenić. 2004. Visualization

of news articles. Informatica 28.



[3] Sandeep Kaur and Navdeep Kaur Khiva. 2016. Online news

classification using Deep Learning Technique. IRJET 03/10

(Oct 2016).





[4] Janez Brank, Gregor Leban and Marko Grobelnik. 2017.

Annotating documents with relevant Wikipedia concepts. In

Figure 6: Tag cloud illustrating a general view about all the

Proceedings of siKDD2017. Ljubljana, Slovenia.

characteristics that affects the relationship between Jeff Bezos



and Elon Musk under the space topic.

[5] Dmoz, open directory project, http://dmoz-odp.org/



(accessed in July, 2018)





[6] euBusinessGraph

project,

http://eubusinessgraph.eu/



(accessed in July, 2018).





16





Cross-lingual categorization of news articles

Blaž Novak



Jožef Stefan Institute



Jamova 39



Ljubljana, Slovenia



+386 1 477 3778

blaz.novak@ijs.si





ABSTRACT

categories. We consider each document belonging to all

In this paper we describe the experiments and their results

categories that are explicitly stated, and all of their parents. We

performed with the purpose of creating a model for automatic

will compare the performance of model predictions on the same

categorization of news articles into the IPTC taxonomy. We show

language and in the cross-lingual setting, where we train the

that cross-lingual categorization is possible using no training data

model on the entire dataset available for one language, and

from the target language. We find that both logistic regression and

measure its performance on the other language.

support vector machines are good candidate models, while

Basic features of the dataset can be seen in the following 2

random forests do not perform acceptably. Furthermore, we show

figures. Figure 1 shows the distribution of number of articles in

that using Wikipedia-derived annotations provides more

each category, and Figure 2 shows that most categories contain a

information about the target class than using generic word

roughly even number of articles in both languages, but there are

features.

some outliers. We ignored categories with less than 15 examples

per language, which resulted in 308 categories.

General Terms

Algorithms, Experimentation



Keywords

News, articles, categorization, IPTC, Wikifier, SVM, Logistic

regression, Random forests.

1. INTRODUCTION

The JSI Newsfeed [1] system ingests and processes approximately

350.000 news articles published daily around the world, in over

100 languages. The articles are automatically cleaned up and

semantically annotated, and finally stored and made available for

downstream consumers.

One of the annotation tasks that we would like to perform in the



future is to automatically categorize articles into the IPTC “Media

Topics” subject taxonomy [2]. IPTC – the International Press

Figure 1. Number of articles in each category. Discrete

Telecommunications Council – provides a standardized taxonomy

categories on x axis are ordered by descending number of articles.

of roughly 1100 terms, arranged into a 5 level taxonomy,



describing subject matters relating to daily news. The vocabulary

is accessible in a machine readable format – RDF/XML and

RDF/Turtle – at http://cv.iptc.org/newscodes/mediatopic.

There are two relations linking concepts in the vocabulary – the

‘broader concept’ taxonomical relation, and a ‘related concept’

sibling relation. The ‘related concept’ links concepts both to other

concepts from the same taxonomy, and directly to external

Wikidata [3] entities.

The purpose of this work is to evaluate multiple machine learning

algorithms and multiple sets of features with which we could

automatically perform the categorization. As we would like to

categorize articles in all the languages the Newsfeed system



supports, but we only have example articles in English and

Figure 2. Language imbalance for each category. Discrete

French, the method needs to be language independent.

categories on x axis are ordered from “mostly English” to “mostly



French”.



2. EXPERIMENTAL SETUP

We compare three different machine learning models – random

The dataset that we have access consists of 30364 English and

forests, logistic regression (LR), and Support Vector Machines

29440 French articles, each of which is tagged with 1 to 10

(SVM).

17





We try two different types of features, and their combinations.

significantly worse. “Wiki-W” denotes the weighted version of

Wikifier annotations, and “Wiki-K” the combination of KCCA-

The first kind of a feature set we use is a projection of the bag-of-

derived features and Wikifier annotations. Every second line in

words representation of the document text into a 500 dimensional

vector space. The KCCA [4] method uses an aligned multi-lingual

the table is the standard deviation of the result when averaged

corpus to find such a mapping, that words with similar meanings

across all categories.

map to a similar vector, regardless of their language. We represent



a document as a sum of all word vectors.

Table 1. ROC scores by model and feature type, cross-

The second set of features we use is the output of the JSI Wikifier

validation

[5] system. The Wikifier links each word in a document to a set of



Rand. Forest

Log. Reg.

SVM



Wikipedia pages that might represent the meaning of that word.

For each such annotation, we also get a confidence weight.



EN

FR

EN

FR

EN

FR

We consider these annotations as a classical vector space model --

KCCA

0.75

0.71

0.96

0.95

0.95

0.94

as a bag-of-entities. We use two versions of the TF-IDF [7]

(stdev)

0.11

0.11

0.04

0.04

0.05

0.04

scheme: in the first case, we use the number of times an entity

Wiki

0.70

0.70

0.95

0.95

0.94

0.94

annotation is present for any word in a document as the TF (term

(stdev)

0.12

0.12

0.04

0.04

0.05

0.04

frequency) factor, and in the second version, we use the sum of

annotation weights of an entity across the document. In both

Wiki-W

0.71

0.71

0.95

0.95

0.94

0.94

cases, we perform L1 normalization of the vector containing TF

(stdev)

0.12

0.11

0.04

0.04

0.05

0.04

terms. For IDF terms, we use log �1 + 𝑁𝑁� where 𝑁𝑁 is the number

𝑛𝑛

Wiki+K

0.71

0.69

0.97

0.96

0.96

0.95

of all documents and 𝑛𝑛 the number of documents where an

(stdev)

0.12

0.11

0.03

0.03

0.03

0.04

annotation was present at least once.



Finally, we use a combination of both KCCA-derived and

Wikifier-derived features as the last feature set option.

Looking at the feature selections, we see almost no significant

difference -- both kinds of features -- KCCA and Wikipedia

For model training, we use Pythons scikit-learn [6] software

annotations have useful predictive value. The combination of both

package. In the case of logistic regression, we use L2 penalty,

feature types slightly improves the ROC score.

with automatic decision threshold fitting, using the liblinear

library backend.

Table 2 shows F1 cross-validation scores of all three models.

Logistic regression scores much higher than SVM here, possibly

For the SVM model, we use a stochastic gradient descent

indicating that the SVM model would benefit from a post-

optimizer. We performed a grid search for the optimal

processing step of optimizing the decision threshold on a separate

regularization constant 𝐶𝐶, but since there were no significant

training set.

accuracy changes, we used the default of 1.0 in all other

experiments.

Table 2. F1 scores by model and feature type, cross-validation

For the random forest model, we used 4 different parameter



Rand. Forest

Log. Reg.

SVM



combinations:



EN

FR

EN

FR

EN

FR

•

default – 10 trees, splitting until only one class is in the

KCCA

0.16

0.12

0.30

0.25

0.20

0.18

leaf

(stdev)

0.21

0.18

0.21

0.20

0.21

0.19

•

30 trees, maximum tree depth of 10

Wiki

0.07

0.07

0.41

0.44

0.25

0.29

•

50 trees, maximum tree depth of 10

(stdev)

0.15

0.15

0.21

0.21

0.22

0.22

•

30 trees, maximu tree depth of 20

Wiki-W

0.08

0.08

0.40

0.43

0.24

0.28

In all cases, GINI index was used as the node splitting criterion.

(stdev)

0.17

0.17

0.21

0.21

0.21

0.22

Since the majority of categories only have a small number of

Wiki+K

0.09

0.07

0.44

0.46

0.27

0.30

documents, we automatically weighed training examples by the

(stdev)

0.16

0.15

0.21

0.21

0.22

0.22

inverse of their class frequency. We also performed some



experiments without this weighting scheme, but got useless

models in all cases except for the couple largest categories.

The combination of both feature sets performs significantly better

than either alone, with generic word-based features providing the

All reported results are the average of a 3-fold cross-validation.

least amount of information.

So far, we only created one-versus-all models for each category

The feature usefulness changes when looking at cross-lingual

independently, and only used the taxonomy information of

classification performance. Table 3 shows the ROC score for all

categories to select all examples from sub-categories when

three models, when the model trained on English is used to predict

training the more general category.

categories of French articles, and vice versa. Decision trees give



essentially a random result, and SVM scores somewhat higher

than logistic regression.

3. RESULTS

Table 1 shows ROC scores for cross-validation of all three models

Table 3. ROC scores - cross-lingual classification

on four sets of feature combinations, for English and French



Rand. Forest

Log. Reg.

SVM



separately. SVM and logistic regression are comparable in



EN

FR

EN

FR

EN

FR

behavior and promising, while the random forest model performs

18





KCCA

0.50

0.50

0.50

0.50

0.50

0.51

(stdev)

0.00

0.00

0.01

0.03

0.04

0.08

Wiki

0.51

0.51

0.76

0.80

0.81

0.84

(stdev)

0.04

0.04

0.12

0.11

0.11

0.10

Wiki-W

0.51

0.52

0.78

0.82

0.82

0.84

(stdev)

0.04

0.05

0.11

0.10

0.10

0.10

Wiki+K

0.50

0.50

0.57

0.70

0.66

0.81

(stdev)

0.01

0.01

0.10

0.13

0.14

0.12

The biggest change here is the influence of KCCA cross-lingual

word embedding: by itself it provides no informative value, as

indicated by ROC value of 0.5 in all cases, and it even reduces the



performance of the combined Wikifier + KCCA model.

Figure 3. F1 score correlation for logistic regression

In the Table 4, F1 scores from the same experiment are shown.

Logistic regression still has a big advantage over SVM, as in the

same-language categorization setting. The change from previous

experiments is the influence of weighting of Wikipedia features --

it increases the performance of all models.

Table 4. F1 scores - cross-lingual classification



Rand. Forest

Log. Reg.

SVM





EN

FR

EN

FR

EN

FR

KCCA

0.00

0.00

0.00

0.01

0.00

0.02



0.02

0.02

0.02

0.06

0.01

0.06

Wiki

0.03

0.04

0.48

0.44

0.30

0.26



0.10

0.11

0.21

0.20

0.22

0.22



Wiki-W

0.03

0.05

0.49

0.44

0.29

0.26

Figure 4. F1 score correlation for SVM



0.11

0.13

0.20

0.21

0.22

0.22

Wiki+K

0.00

0.00

0.18

0.40

0.20

0.23



0.04

0.04

0.22

0.22

0.19

0.21

An interesting observation is that the performance of the cross-

lingual model is occasionally higher than that of the baseline

cross-validation experiment. This anomaly however disappears

for categories with large amount of positive training examples. It

also disappears if we reduce the amount of training examples in

the cross-lingual experiment by 1/3 – the effect seems to be

caused by cross-validation reducing the training dataset size.

KCCA cross-lingual word embedding feature generation used

here was tested in other experiments and systems and gives a

useful feature set for comparison of documents across languages,



so its negative impact on the performance of these models needs

Figure 5. ROC score correlation for logistic regression

to be investigated in the future.

As the weighted Wikipedia feature set appears to be the best for

the stated goal of cross-lingual article categorization, the results of

next experiments are shown only for it, but we performed the

same experiments on all other combinations, and the results

broadly follow the conclusions from the previous section.

The following figures show correlation of testing and cross-

lingual performance of logistic regression and SVM models. Both

F1 score and area under ROC curve are shown for each of 308

categories in the experiment, since they provide complementary

information. As the figures show, there is a good agreement

between the cross-validation and the cross-lingual classification

performance, giving us an ability to estimate cross-lingual

performance based on the cross-validation score in the production



environment. The difference between distributions for French and

Figure 6. ROC score correlation for SVM

English language models is consistent with the class imbalance

for each of the categories.

19





The SVM model seems to have a more consistent behavior, so we

will use it in the final application instead of logistic regression.

Figures 7 through 10 show the F1 and ROC score behavior of

logistic regression and SVM models for cross-validation and

cross-lingual classification with regard to the number of positive

examples in the category, separately for English and French

language. While the SVM model underperforms on the F1 metric

on average, it produces a better ranking of documents with respect

to a category, as seen on ROC plots, especially for smaller

categories. This further indicates the need for decision threshold

tuning in the SVM model before we use its predictions.



Figure 10 ROC score with respect to category size, cross-

lingual prediction

As expected, classification performance of all models improves

with the number of training examples, but in cases of small

categories, it appears that some are much easier to learn than

others.

4. CONCLUSIONS AND FUTURE WORK

We found that using a logistic regression model with weighted

Wikifier annotations gives us a good enough result to use IPTC



category tags as inputs for further machine processing in the

Figure 7. F1 score with respect to category size, cross-

Newsfeed pipeline. Before we can use this categorization for

validation

human consumption, we need to investigate automatic tuning of

SVM decision thresholds on this problem, and add an additional

filtering layer that takes into consideration interactions between

categories beyond the sub/super-class relation. Additionally, the

negative effect of KCCA-derived features for cross-lingual

annotation needs to be examined.

5. ACKNOWLEDGEMENTS

This work was supported by the Slovenian Research Agency as

well as the euBusinessGraph (ICT-732003-IA) and EW-Shopp

(ICT-732590-IA) projects.

6. REFERENCES

[1] Trampuš M., Novak B., “The Internals Of An Aggregated

Web News Feed” Proceedings of 15th Multiconference on



Information Society 2012 (IS-2012).

Figure 8. ROC score with respect to category size, cross-

[2] https://iptc.org/standards/media-topics/

validation

[3] https://www.wikidata.org/wiki/Wikidata:Main_Page

[4] Rupnik, J., Muhič, A., Škraba, P. “Cross-lingual document

retrieval through hub languages”. NIPS 2012, Neural

Information Processing Systems Worshop, 2012

[5] Brank J., Leban G. and Grobelnik M. “Semantic Annotation

of Documents Based on Wikipedia Concepts”. Informatica,

42(1): 2018.

[6] Pedregosa, F., Varoquaux, G., Gramfort, A. et al. “Scikit-

learn: Machine Learning in Python”. Journal of Machine

Learning Research, 12. 2011, pp. 2825-2830.

[7] K. Sparck Jones. "A statistical interpretation of term

specificity and its application in retrieval". Journal of

Documentation, 28 (1). 1972



Figure 9. F1 score with respect to category size, cross-lingual

prediction



20





Transporation mode detection using random forest

Jasna Urbančič

Veljko Pejović

Dunja Mladenić

Artificial Intelligence

Faculty of Computer and

Artificial Intelligence

Laboratory,

Information science,

Laboratory,

Jožef Stefan Institute

University of Ljubljana

Jožef Stefan Institute

Jamova 39, 1000 Ljubljana,

Večna pot 113, 1000 Ljubljana

Jamova 39, 1000 Ljubljana,

Slovenia

Slovenia

Slovenia

jasna.urbancic@ijs.si

veljko.pejovic@fri.uni-lj.si

dunja.mladenic@ijs.si

ABSTRACT

While the first attempts to recognize user activity were ini-

This paper addresses transportation mode detection for a

tiated before smart phones, the real effort in that direc-

mobile phone user using machine learning and based on mo-

tion begun with the development of mobile phones having

bile phone sensor data. We describe our approach to data

built-in sensors [10], including GPS and accelerometer sen-

collection, preprocessing and feature extraction. We eval-

sors. There are still some studies that use custom loggers

uate our approach using random forest classification with

to collect the data [11, 17] or use dedicated devices as well

focus on feature selection. We show that with feature selec-

as smart phones [5]. Although GSM triangulation and local

tion we can significantly improve classification scores.

area wireless technology (Wi-Fi) can be employed for the

purpose of transportation mode detection, their accuracy is

1.

INTRODUCTION

relatively low compared to GPS [11], so latest state of the art

In the recent years we have witnessed a drastic increase in

research is focused on transportation mode detection based

sensing and computational resources that are built in mo-

on GPS tracks and/or accelerometer data.

bile phones. Most of modern cell phones are equipped with a

Machine learning approaches for transportation mode detec-

set of sensors containing triaxial accelerometer, magnetome-

tion often rely on statistical, time-based, frequency-based,

ter, and gyroscope, in addition to having a Global Position-

peak-based and segment-based [8] features, however in most

ing System (GPS). Smart phone operating system APIs of-

cases statistical features and features based in frequency are

fer activity detection modules that can distinguish between

used [10, 11, 16]. Feature domains are described in Table

different human activities, for example: being still, walk-

1. Statistical, time-based, and spectral attributes are com-

ing, running, cycling or driving in a vehicle [2, 3]. However,

puted on a level of a time frame that usually covers a few sec-

APIs cannot distinguish between driving in different kind

onds, whereas peak-based features are calculated from peaks

of vehicles, for example driving a car or traveling by bus or

in acceleration or deceleration. On the other hand, segment-

by train. Recognizing different kind of transportation, also

based features are computed on the recordings of the whole

known as transportation mode detection, is crucial for mo-

trip, which means that they cover much larger scale. Statis-

bility studies, for routing purposes in urban areas where pub-

tical, time-based, and spectral features are able to capture

lic transportation is often available, for facilitating the users

the characteristics of high-frequency motion caused by user’s

to move towards more environmentally sustainable forms of

physical movement, vehicle’s engine and contact between

transportation [1], or to inspire them to exercise more.

wheels and surface. Peak-based features capture movement

In this paper we discuss the use of random forest in trans-

with lower frequencies, such as acceleration and breaking

portation mode detection based on accelerometer signal. We

periods, which are essential for distinguishing different mo-

focus on

torized modalities. Additionally, segment-based features de-

1. feature extraction, and

scribe patterns of such acceleration and deceleration periods

[8].

2. feature analysis to determine the most meaningful fea-

tures for this specific problem and the choice of classi-

Machine learning methods that are most commonly used

fier.

in accelerometer based modality detection include support

vector machines, decision trees and random forests, how-

Our main contribution is feature analysis, which revealed

ever naive Bayes, Bayesian networks and neural networks

the impact of each feature to the classification scores.

have been used as well [11, 12]. Often these classifiers are

2.

RELATED WORK

used in an ensemble [16]. The majority of algorithms addi-

tionally use Adaptive Boosting or Hidden Markov Model to

improve the performance of the methods mentioned above

[16, 8, 11, 10]. In the last years, deep learning has also been

used [6, 14].

Accelerometer-only approach where only statistical features

have been used reported 99.8% classification accuracy, how-

ever users were instructed to keep the devices fixed position

during a trip. Furthermore, only 0.7% of data was labeled

as train [11]. State of the art approach to accelerometer-only

21





Domain

Description

(1) Data

(1a) Mobile

Statistical

These features are include mean, standard de-

acquisition

applications

viation, variance, median, minimum, maximum,

range, interquartile range, skewness, kurtosis, root

(2b)

mean square.

(2) Pre-

(2a)

(2b)

Gravity

Resampling

Filtering

Time

processing

Time-based features include integral and double

estimation

integral of signal over time, which corresponds to

speed gained and distance traveled during that

(3) Feature

recording. Other time-based features are for ex-

extraction

ample auto-correlation, zero crossings and mean

crossings rate.

(4a)

(4b)

Frequency

Frequency-based features include spectral energy,

(4) Feature

Correlation

Statistical

spectral entropy, spectrum peak position, wavelet

analysis

analysis

analysis

entropy and wavelet coefficients.

These can be

computed on whole spectrum or only on spe-

(5) Clas-

(5a)

(5b)

cific parts, for example spectral energy bellow

Defining

Choosing

sification

50Hz.

Spectrum is usually computed using fast

feature sets

classifiers

Fourier transform, whereas wavelet is a result of

the Wavelet transformation. Entropy measures are

Figure 1:

Detailed work flow diagram of the

based on the information entropy of the spectrum

proposed approach.

We stacked general, high-

or wavelet [7].

level tasks common in other approaches vertically,

Peak

Peak-based features use horizontal acceleration

whereas subtasks specific to our approach are pic-

projection to characterize acceleration and decel-

tured horizontally.

eration periods.

These features include volume,

intensity, length, skewness and kurtosis.

Split the signal

Convolute with

Segment

Segment-based include peak frequency, stationary

Signal

on acceleration

Find peaks

a box window

and deceleration

duration, variance of peak features, and station-

ary frequency. The latter two are similar to ve-

locity change rate and stopping rate used by [17].

Count or compute

Segment-based features are computed on a larger

scale than statistical, time-based or frequency-

Number of peaks

based features.

Table 1:

Feature domains and their descriptions

Mean

Peak height

adopted from [8].

Peak-based

Standard

Peak width

features

deviation

transportation mode detection relies on long accelerometer

Skewness

Peak width height

samples. It uses features from all five domains for classifica-

tion with AdaBoost with decision trees as a weak classifier

Peak area

and achieves 80.1% precision and 82.1% recall [8].

Figure 2: Work flows for extraction of peak-based

The performance of transportation mode detection systems

features.

depends on the effectiveness of handcrafted features designed

by the researchers, researcher’s experience in the field af-

We collect five second samples of sensor data and resam-

fects the results. Thus, there have been approaches that use

ple them to sampling frequency 100 Hz in the preprocessing

deep learning methods, such as autoencoder or convolutional

phase. Resampling ensures us that our samples all contain

neural network, to learn the features used for classification.

500 measurements. The most prominent problem we face in

Using a combination of handcrafted and deep features for

preprocessing concerns the correlation of acceleration mea-

classification with deep neural network resulted in 74.1%

surements with the orientation of the phone in the three

classification accuracy [15].

dimensional space. Practically this means that gravity is

measured together with the dynamic acceleration caused by

3.

PROPOSED APPROACH

phone movements. To eliminate gravity from the samples

we perform gravity estimation on raw accelerometer signal.

Work flow of the proposed approach is visualized in Figure

We follow an approach proposed by Mizell [9]. Gravity es-

1. The first task is data collection. To collect data we use

timation splits the acceleration to static and dynamic com-

NextPin mobile library [4] developed by the Artificial In-

ponent. Static component represents the constant accelera-

telligence Laboratory at Jožef Stefan Institute. Library is

tion, caused by the natural force of gravity, whereas dynamic

embedded into two free mobile applications. The first one is

component is a result of user’s motion. Furthermore, using

OPTIMUM Intelligent Mobility [1]. OPTIMUM Intelligent

this approach we are able to calculate vertical and horizontal

Mobility is a multimodal routing application for three Eu-

components of acceleration.

ropean cities — Birmingham, Ljubljana, and Vienna. The

second one is Mobility patterns [4]. Mobility patterns is es-

We use preprocessed signal to extract features for classifica-

sentially a travel journal. Both applications send five second

tion. Features are divided into five domains, based on their

long accelerometer samples every time OS’s native activity

meaning and method of computation. We have listed the do-

recognition modules, Google’s ActivityRecognition API [2]

mains in Table 1. We extract features from three domains —

for Android and Apple’s CMMotionActivity API [3], de-

statistical, frequency, and peak. We extract statistical fea-

tect that the user is traveling in a vehicle.

We use that

tures (maximal absolute value, mean, standard deviation,

accelerometer samples for fine-grained classification of mo-

skewness, 5th percentile, and 95th percentile) from dynamic

torized means of transportation.

acceleration and its amplitude, horizontal acceleration and

22





Set

Accele.

Features

Size

Feature set

CA

RE

PR

F1

D-S

Dynamic

Statistical

54

D-S

0.48

0.41

0.39

0.37

D-SF

Dynamic

Statistical, Frequency

94

D-SF

0.60

0.41

0.41

0.39

D-SFP

Dynamic

Statistical, Frequency, Peak

172

D-SFP

0.46

0.39

0.40

0.35

H-S

Horizontal

Statistical

54

H-S

0.64

0.40

0.43

0.41

H-SF

Horizontal

Statistical, Frequency

94

H-SF

0.53

0.39

0.43

0.36

H-SFP

Horizontal

Statistical, Frequency, Peak

172

H-SFP

0.50

0.37

0.40

0.34

ALL

376

ALL

0.47

0.35

0.40

0.33

Table 2: Predefined feature sets used for classifica-

Table 3: Classification metrics for classification with

tion.

random forest on predefined feature sets.

Change model parameters

the training set we use the data from [13], whereas validation

and test sets were obtained during Optimum pilot testing in

2018. During validation step we are trying to maximize F1

(2)

(1)

score as our data set is imbalanced. We visualized the evalu-

Validate

Evaluate

Train

ation scenario in Figure 3, while the composition of the sets

Join datasets

in pictured in Figure 4.

(3)

(4) Test

Use best parameters

Join datasets

Train +

and

4.

RESULTS

Validate

evaluate

We trained random forest classifier on the predefined fea-

ture sets from Table 2. Classification metrics we report on

Figure 3: Schema of evaluation scenario.

include classification accuracy (CA), recall (RE), precision

its amplitude, amplitude of raw acceleration, and amplitude

(PS) and F1 score (F1) Results are listed in Table 3. Ta-

of vertical acceleration. From the same signals we extract

ble 3 shows that we achieved the highest F1 score of 0.41

frequency-based features using fast Fourier transformation.

using H-S feature set. This feature set consists of statisti-

As frequency-based features we use the power spectrum of

cal features calculated on the horizontal acceleration vector.

the signal aggregated in 5 Hz bins. Pipeline for extraction of

Classification accuracy in that case is also high, compared to

peak-based features from dynamic and horizontal in acceler-

other feature sets. The peak features seems to be the source

ation is pictured in Figure 2. To extract peak-based features

of noise in the data, as using peak features in combination

we first smooth out the signal with convolution with a box

with the other features sets decreases the performance, for

window and split it into moments of acceleration and mo-

example F1 drops from 0.39 for D-SF to 0.35 for D-SFP.

ments of deceleration.

Then we find peaks and compute

F1 score and classification for dynamic acceleration increase

peak heights, peak widths, peak width heights, and peak

when we add frequency-based features, whereas these two

areas. As there is usually more than one peak we aggregate

measures decrease in case of similar action for horizontal ac-

these values using mean, standard deviation, and skewness.

celeration. This offers two possible interpretations. One is

All together we extract 376 features. We organize features

that frequency-based features of dynamic acceleration carry

into seven predefined feature sets we use for classification.

more information compared to frequency-based features of

Feature sets are listed in Table 2.

horizontal acceleration. The second one is that statistical

To evaluate the capabilities and performance of the pro-

features of horizontal acceleration are much better than sta-

posed approach, we divide our dataset in 3 subsets — train,

tistical features from dynamic acceleration.

validation, and test set — based on the date the samples

We noticed that smaller feature sets generally perform better

were recorded on.

By doing so we avoided using in this

than larger so we focused on feature selection. We initially

domain methodologically questionable random assignment

train the model with all features and evaluate it on valida-

of samples collected during the same trip to different sub-

tion set. Then we remove each feature one by one, train the

sets. The reason why we did not apply cross-validation is

model, evaluate it on the validation set and compare all F1

similar. Using samples from the same trip in train and test

scores. We eliminate the feature with the highest F1 score,

set would result in significantly higher evaluation scores. For

as this means that when the model was trained without that

feature if performed better than when the eliminated feature

was included. We repeat this procedure until the feature set

consists of one feature. Similarly, we do feature addition

— we start with an empty feature set and gradually add

features one by one.

Using the described process of forward feature selection and

backward feature elimination we selected two feature sets

that performed the best — in case of forward selection the

best feature set has 10 features, whereas feature set pro-

duced with backward elimination has 28 features. Feature

set obtained by forward selection mostly contains statisti-

cal features, followed by peak-based. Only one frequency-

based features appears in that set. Additionally, features

Figure 4: Distribution of modes in train, validation,

are in vast majority extracted from dynamic acceleration.

and test set. We also added joint train and valida-

On the other hand feature set obtained by backward elim-

tion set, which we use to train the final model.

23





Feature set

CA

RE

PR

F1

J. Urbančič. Optimum project: Geospatial data

Forward selection (10)

0.70

0.50

0.47

0.48

analysis for sustainable mobility. In 24th ACM

Backward elimination (28)

0.73

0.50

0.48

0.49

SIGKDD International Conference on Knowledge

Table 4: Classification metrics for classification with

Discovery & Data Mining Project Showcase Track.

the selected features in feature selection.

ACM, 2018. http://www.kdd.org/kdd2018/files/

project-showcase/KDD18_paper_1797.pdf.

Forward selection

Backward elimination

[5] K.-Y. Chen, R. C. Shah, J. Huang, and L. Nachman.

T \P

Car

Bus

Train

T \P

Car

Bus

Train

Mago: Mode of transport inference using the

Car

0.78

0.27

0.05

Car

0.83

0.12

0.05

hall-effect magnetic sensor and accelerometer.

Bus

0.51

0.40

0.09

Bus

0.55

0.35

0.10

Proceedings of the ACM on Interactive, Mobile,

Train

0.47

0.21

0.32

Train

0.45

0.23

0.32

Wearable and Ubiquitous Technologies, 1(2):8, 2017.

Table 5: Confusion matrix for classification with the

[6] S.-H. Fang, Y.-X. Fei, Z. Xu, and Y. Tsao. Learning

selected features in feature selection.

transportation modes from smartphone sensors based

ination contains more peak-based features than statistical,

on deep neural network. IEEE Sensors Journal,

again only one frequency-based feature appears. Dynamic

17(18):6111–6118, 2017.

acceleration and horizontal acceleration appear in similar

[7] D. Figo, P. C. Diniz, D. R. Ferreira, and J. M.

proportions.

We evaluated the models trained with that

Cardoso. Preprocessing techniques for context

feature sets against the test set. Results are listed in Ta-

recognition from accelerometer data. Personal and

ble 4. Both feature sets achieve better F1 scores than any

Ubiquitous Computing, 14(7):645–662, 2010.

previous feature sets. Confusion matrix in Table 5 reveals

[8] S. Hemminki, P. Nurmi, and S. Tarkoma.

what are the differences between these two feature sets. We

Accelerometer-based transportation mode detection

can see that in case of eliminating features there is less cars

on smartphones. In Proceedings of the 11th ACM

missclassified as buses and more buses missclassified as cars.

Conference on Embedded Networked Sensor Systems,

Classification of trains is fairly consistent. For buses and

page 13. ACM, 2013.

trains the largest part of samples is still missclassified as

[9] D. Mizell. Using gravity to estimate accelerometer

cars.

orientation. In Proc. 7th IEEE Int. Symposium on

5.

CONCLUSIONS

Wearable Computers (ISWC 2003), page 252.

Citeseer, 2003.

We showed that while transportation mode with random for-

est is possible, careful feature selection is necessary. Using

[10] S. Reddy, M. Mun, J. Burke, D. Estrin, M. Hansen,

feature selection we were able to improve classification scores

and M. Srivastava. Using mobile phones to determine

for at least 0.04, in some cases even over 0.10. Although clas-

transportation modes. ACM Transactions on Sensor

sification scores improved, most of non-car samples were still

Networks (TOSN), 6(2):13, 2010.

misclassified as cars. We observed that even though peak-

[11] M. A. Shafique and E. Hato. Use of acceleration data

based features did not perform as well in predefined feature

for transportation mode prediction. Transportation,

sets, they appeared consistently among selected features in

42(1):163–188, 2015.

feature selection. That does not hold for frequency-based

[12] L. Stenneth, O. Wolfson, P. S. Yu, and B. Xu.

feature only one feature appeared among selected features.

Transportation mode detection using mobile phones

For the future work we suggest maximization of another clas-

and gis information. In Proceedings of the 19th ACM

sification score as we focused on maximization of F1 score.

SIGSPATIAL International Conference on Advances

in Geographic Information Systems, pages 54–63.

6.

ACKNOWLEDGMENTS

ACM, 2011.

This work was supported by the Slovenian Research Agency

[13] J. Urbančič, L. Bradeško, and M. Senožetnik. Near

under project Integration of mobile devices into survey re-

real-time transportation mode detection based on

search in social sciences: Development of a comprehensive

accelerometer readings. In Information Society, Data

methodological approach (J5-8233), and the ICT program of

Mining and Data Warehouses SiKDD, 2016.

the EC under project OPTIMUM (H2020-MG-636160).

[14] T. H. Vu, L. Dung, and J.-C. Wang. Transportation

7.

REFERENCES

mode detection on mobile devices using recurrent nets.

[1] Optimum project - European Union’s Horizon 2020

In Proceedings of the 2016 ACM on Multimedia

research and innovation programme under grant

Conference, pages 392–396. ACM, 2016.

agreement No 636160-2.

[15] H. Wang, G. Liu, J. Duan, and L. Zhang. Detecting

http://www.optimumproject.eu/, 2017. [Online;

transportation modes using deep neural network.

accessed 4-November-2017].

IEICE TRANSACTIONS on Information and

[2] ActivityRecognition. https://developers.google.

Systems, 100(5):1132–1135, 2017.

com/android/reference/com/google/android/gms/

[16] P. Widhalm, P. Nitsche, and N. Brändie. Transport

location/ActivityRecognition, 2018. [Online; mode detection with realistic smartphone sensor data.

accessed 31-August-2018].

In Pattern Recognition (ICPR), 2012 21st

[3] CMMotionActivity. https://developer.apple.com/

International Conference on, pages 573–576. IEEE,

library/ios/documentation/CoreMotion/

2012.

Reference/CMMotionActivity_class/index.html#//

[17] Y. Zheng, Q. Li, Y. Chen, X. Xie, and W.-Y. Ma.

apple_ref/occ/cl/CMMotionActivity, 2018. [Online; Understanding mobility based on gps data. In

accessed 31-August-2018].

Proceedings of the 10th international conference on

[4] L. Bradeško, Z. Herga, M. Senožetnik, T. Šubic, and

Ubiquitous computing, pages 312–321. ACM, 2008.

24





FSADA, an anomaly detection approach

A modern, cloud-based approach to anomaly-detection, capable of monitoring

complex IT systems

Viktor Jovanoski

Jan Rupnik

Jozef Stefan International Postgraduate School

Jozef Stefan Institute

Jamova 39

Jamova 39

Ljubljana, Slovenia

Ljubljana, Slovenia

viktor@carvic.si

jan.rupnik@ijs.si

ABSTRACT

huge volumes or just a few data points per day. Designing

Modern IT systems are becoming increasingly complex and

a system that can cope with such diverse situations can be

inter-connected, spanning over a range of computing de-

challenging.

vices.

As software systems are being split into modules

and services, coupled with an increasing parallelization, de-

Another important aspect is ”actionability” of the reported

tecting and managing anomalies in such environments is

anomalies. When human operator is presented with a new

hard. In practice, certain localized areas and subsystems

alert, the message as to what is wrong needs to be clear. The

provide strong monitoring support, but cross-system error-

operator must be able to immediately start addressing the

correlation, root-cause analysis and prediction are an elusive

problem. Sometimes all we need is a different presentation

target.

of the result, but most often the easy-to-describe algorithms

and models are used - e.g. linear regression or nearest neigh-

We propose a general approach to what we call Full-spectrum

bour.

anomaly detection - an architecture that is able to detect lo-

cal anomalies on data from various sources as well as creating

This high velocity of data (volume and rate) makes some

high-level alerts utilizing background knowledge, historical

of the algorithms less usable in such scenarios - specifically

data and forecast models. The methodology can be imple-

batch processing that requires random access to all past

mented either completely or partially.

data is not desired. Ideally, we would only use streaming

algorithms - algorithms that live on the stream of incoming

Keywords

data, where each data point is processed only once and then

discarded.

Anomaly detection, Outlier detection, Infrastructure moni-

toring, Cloud

The contribution of this paper is a hollistic approach to

anomaly detection system that clearly defines different parts

1.

INTRODUCTION

and stages of the processing, including active learning as a

Modern IT systems need several key capabilities, apart from

crucial part of the processing loop. The design addresses

tracking and directing the underlying businesses. They need

modern challenges in IT system monitoring and is suitable

to manage errors and failures - predict them in advance,

for cloud deployment.

detect them in their early stages, help limit the scope of the

damage and mitigate their consequences. All this is achieved

by analyzing past and current data and detecting outliers in

2.

ANOMALY-DETECTION

it.

The most common definition of an anomaly is a data point

that is significantly different from the majority of other data

Anomaly detection must happen in near-real time, while si-

points. See [2] for a detailed explanation. This definition is

multaneously analyzing potentially thousands of data points

strictly analytical. But most often the users define it within

per second. Incoming data that such a system can monitor

the scope of their operation, such as finding abnormal engine

is very diverse. Data can come in different shapes (numeric,

performance in order to prevent catastrofic failure, flagging

discrete or text), in regular time intervals or sporadically, in

unexpected delays in manufacturing pipeline in order to pre-

vent shipment bottlenecks, detecting unusual user behavior

that indicates intrusion and identifying market sectors that

exibit unusual trends to detect fraud and tax evasion.

The anomaly-detection process is thus heavily influenced by

the target domain. It also needs process-specific way of mea-

suring the detection efficiency. For instance, in classification

problems we can use several established measures such as

accuracy, recall, precision or F 1. In anomaly detection,

on the other hand, we often don’t have classes to work with

25

and secondly, we need strong user feedback to evaluate our

3.

THE SYSTEM ARCHITECTURE

results. Sometimes anomaly detection looks more like a fore-

To create a system that is able to ingest such huge amount of

casting and optimization problem. We measure how much

different data streams, detect anomalies in them and present

the current state of a complex system is different from the

user with a manageable amount of actionable alerts we pro-

optimal or predicted value.

pose a reference architecture of such system (figure 1). The

acronym FSADA stands for Full-Spectrum Anomaly Detec-

2.1

Actionability

tion Architecture, is based on the Kappa architecture [5] and

It is not sufficient for algorithms to just detect unusual pat-

comprises the components described below.

terns. Human operators that get notified about them must

clearly understand the detected problems and be able to act

• Storage module contains historical data (raw and

upon them - we call this property of alerts actionability. For

derived), background knowledge as well as generated

instance, it is not enough to report “the euclidian distance

alerts and incidents.

between multi-dimensional vectors of regularized input val-

ues is too big” - end-users will have no clue about what is

• Stream-processing module performs incoming-data

wrong here. Instead, the system should report something

pre-processing, as well as signal- and incident-detection.

like “The average processing time of customer orders is well

above its usual values. This situation will very likely re-

• Batch processing module calculates aggregations,

sult in a significant drop of daily productivity.” Some algo-

pattern discovery as well as background knowledge re-

rithms produce models that are easier to translate into hu-

fresh.

man language than others. This feature needs to be taken

•

into account when an anomaly-detection system is being im-

User-interface module (commonly abbreviated as

plemented.

GUI) displays raw-data, generated alerts along with

feedback and active learning support.

2.2

Modern challenges

In the era of big data there are many systems that produce

3.1

Terminology

data and a lot of the generated data can be used to monitor,

From now on we will be using the following terminology:

maintain and improve the target system. The data volumes

are staggering and need to be addressed properly within the

anomalies - any kind of abnormal behavior inside the sys-

system implementation.

tem, regardless of the effect on the system performance.

Users expect results to be available as soon as possible -

signals - low-level anomalies that have been detected on

within hours, sometimes even minutes or seconds. In cases

single data-stream.

where automated response in possible, this time-frame short-

ens to miliseconds (e.g. stock trading, network intrusion).

incidents - complex anomaly or a group of them with major

impact on the system. Its time duration is usually limited

Current systems for anomaly detection are developed as add-

to several minutes or hours. They are closely related to the

ons to the existing systems for collecting and processing

way users perceive the system problems and outages.

data. This makes sense, since they developed organically,

during the usage by the competent users, which identified

alerts - an anomaly that is reported to the user, self-contained

areas that require advanced monitoring. We belive this pro-

with explanation and basic context.

vides necessary business validation of anomaly detection sys-

tems. However, there are limitations of such approach.

3.2

Storage module

The system needs to store several types of data that per-

• Data that is available in one part of the system might

form different functions. Each part of the storage layer can

not be available in another part, where anomaly-

be located in separate system that best matches the require-

detection could greatly benefit from it.

ments.

• Data volume could prove to be too big for effective

Measurement data represents raw values that were ob-

anomaly detection analysis, because needed resources

served and processed in order to monitor the system. This

might not be available (e.g. computing power is needed

data is strictly speaking not necessary when our algorithms

for main processing and anomaly detection should not

are designed to work on a stream, but they are required

interfere with it).

for batch algorithms, for model retraining, active learning

• Anomaly detection has local scope as it only pro-

and for ad-hoc analytics. Generated signals and inci-

cesses data from one part of the system. The alerts

dents are stored, additionally processed and viewed by the

are thus not aware of the potential problems in other

user. The storage needs to support flexible format of alerts,

parts of the system, so resolving issues takes longer

since each one of them is ideally an independent chunk of

and involves more people from several departments to

data that can be visualized without additional data retrieval.

coordinate during problem escalations.

The algorithms can use domain knowledge to guide their

execution. To facilitate this, the data needs to be stored in

• There is no systematic way of collecting the user feed-

a storage system that provides fast searching, in order to be

back that would guide and improve the anomaly de-

used in stream processing steps for enrichment, routing and

tection process.

aggregation. The algorithms inside the system create and

26

Figure 1: The big picture - displays the main building layers such as stream processing and storage, as well as the flow of the data between different components of the system.

update their models all the time, so this part of the stor-

is required. This enables handling of previously unseen data

age needs to support reliable and robust storing of possibly

partitions as well as scalability in parallel processing.

large binary files.

These anomalies (signals) have simple models and conse-

3.3

Stream processing module

quently alert explanations. But they are local in nature -

their scope is most often very limited. They also operate

This module contains the most important part of the system

on single-data stream, so they don’t take into account the

- the components that transport the data, run the processing

anomalies in ‘‘the neighbourhood”. To overcome this defi-

and generate alerts.

ciencies, we propose the third step of stream processing, to

which signals should be sent.

3.3.1

Incoming data pre-processing

Incoming data that the system analyses arrives at different

3.3.3

Incident detectors

volumes and speeds (high-velocity), as well as in many differ-

This stage of the processing receives signals from different

ent types and formats. This data needs to be pre-processed

parts of the system, performing scoring of their importance,

before it reaches any anomaly detection algorithms.

combining them into incidents and thus achieving several

advantages.

Coping with such high-volume data stream requires special

technologies. Streaming solutions such as Apache Kafka [4]

The scoring algorithm provides option to assign user-guided

have been developed and battle tested for processing millions

subjective importance to signals - e.g. two statistically equally

of data records per second in a distributed manner. This

important anomalies can have completely difference per-

step needs to perform several functions.

ceived value to the user. This step can also can correlate

data across data-streams, a step that is hard to achieve and

Data formatting and enrichment - transform messages

that proves to be very valuable. Given data from differnt

from the input format into a common format that is accepted

parts of the system it can create more complex constructs

by the internal algorithms. Also, additional data fields can

that better evaluate the impact of the current problem on

be attached, based on background knowledge.

the whole system.

Data aggregation - sometimes we want to measure char-

This level of abstraction is the main access point for end-

acteristics of all the data within some time intervals (e.g.

users - it more closely follows their way of addressing system

average speed in the last 10 minutes).

malfunctions (e.g ”if module A breaks, it will have impact on

modules B and C, but module D should remain unaffected”).

Data routing - send the transformed and aggregated data

to relevant receivers. There may be several listeners for the

3.3.4

Background knowledge

same type of input data.

To help guide the algorithms during the signal detection

we can provide additional background knowledge in differ-

3.3.2

Signal detectors

ent forms, such as metadata, manual thresholds and rules,

When data is ready for processing, it is routed to signal

graphs and other structures. All this data can be used to

detectors. They operate on a single data stream, often only

perform various enhancements of basic algorithms, such as

on a small partition of it - e.g. single stock, group of related

creation of additional features in data pre-processing, up-

stocks. They handle huge data volumes, so they need to be

and down-voting of results (e.g. estimated impact of de-

fast, using very little resources. To achieve great flexibility

tected anomaly), pruning of search space in optimization

regarding input data a dynamic allocation of such processors

steps, estimation of affected entities for given anomaly

27

or support for complex simulations when current per-

terval. Users feel comfortable with seeing the big picture

formance is measured against historical values. These rules

(an indcident) in then drill down into specific data (indi-

and metadata can be acquired by analyzing historical data

vidual signal). They reported this feature enables them to

as well as collecting knowledge from end-user, e.g. manual

cut down time for understanding the problem by an order

feedback/sign-off and active learning.

of magnitude (from hours to minutes).

3.4

Improving actionability

Active-learning component was well received, as it made

The system modules presented so far are mostly established

manual work more efficient. The users noticed how the al-

components that are used also in normal processing steps

gorithm was choosing more and more complex learning ex-

of modern, cloud-based systems (see [1]). We propose that

amples for manual classification. This helped them feel pro-

they should be upgraded with the following functionalities

ductive and engadged. They also reported positive impact

in order to achieve the goal of high-quality actionable alerts,

of active learning on their problem understanding, as they

empowering users to manage their complex systems in the

were presented with some potentially problematic situations

most efficient way.

that went unnoticed in the past.

Based on above observations were conclude that our pro-

3.4.1

Feedback

posed approach has positively impact on the organization,

Historical incidents are very valuable for learning of informa-

both for technologies as well as human operators. Additional

tive features that aid detection of anomalies. They are also

ideas that were collected from users are listed under future

used for calibrating scoring algorithm that assigns relevance

work.

scores to generated signals and incidents. It is common for

the organization to require every major detected incident to

5.

CONCLUSIONS AND FUTURE WORK

be manually signed off - a relevance tag (e.g. high-relevant,

The focus of our future work is on several advanced scenar-

semi-relevant, not-relevant, noise, new-normal) has to be as-

ios where a lot of added value for users is expected, mix-

signed to it by the operators. These tags are used for train-

ing anomaly detection, optimization and simulation. Main

ing of incident-classification algorithms, but we can also con-

gains are expected to come from feedback collection and ac-

struct a more complex setting where a form of backtracking

tive learning. Apart from monitoring IT systems, the target

is used to calibrate signal detectors.

domains are also manufacturing and smart cities. We also

collected some features that users commonly inquired about:

3.4.2

Active learning

The active learning approach [3] can be used to make the

manual classification effort more efficient. The system pro-

• Root-cause analysis - when a major incident occurs,

vides untagged examples/incidents where the criteria func-

many parts of the system get affected. To resolve is-

tion returns the value that is the closest to the decision

sues as quickly as possible, the operators should be

boundary. The user then manually classifies the incident

pointed to the origin of the problem. The anomaly de-

and the classification model is re-trained with this new data.

tection system should thus have a capability to point

By guiding users in this way the system requires relatively

to the first signal with high-impact on the final rele-

small number of steps to cover the search space and obtain

vance score.

good learning examples.

• Predictions - The goal for all monitoring systems is

to detect problems as soon as possible. The system

Our proposed approach incorporates this continuous activ-

must that not only be able to detect signals, but also

ity in several areas. GUI module should contain appropriate

forecast them, based on past behavior. In order to do

pages where user can enter his feedback and active-learning

that, the algorithms require more metadata and struc-

input. Storage module contains alerts historical data that

ture to properly model inter-dependencies between sig-

can be used for re-training of incident detectors. Storage

nals. Mere observation is much easier than simulation

module also contains old and new incident-detector mod-

of a complex system with many moving parts. But it

els that can be picked up automatically by the processing

is possible and is what users expect from a modern AI-

module.

based system. Our future reserach will be oriented to-

wards providing and efficiently integrating these func-

4.

VALIDATION AND DISCUSSION

tionalities into our anomaly-detection approach.

Based on our extensive experience with practical anomaly

detection implementation, we identified several new require-

6.

REFERENCES

ments for these systems. The presented approach satisfies

[1] Anodot anomaly detection system.

them by supporting big-data real-time analytics on one side

http://www.anodot.com, 2018.

and actionability via active-learning support on the other.

[2] C. C. Aggarwal. Outlier Analysis. Springer New York,

New York, New York, 2013.

The system architecture is deployable to cloud environment

[3] D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active

by design. We also employ modern streaming and storage

learning with statistical models. CoRR, cs.AI/9603104,

technologies for transporting and storing of different input

1996.

data and alerts.

[4] N. Garg. Apache Kafka. Packt Publishing, 2013.

We observed that users appreciate our notion of incidents

[5] J. Lin. The lambda and the kappa. IEEE Internet

- a grouping of alerts that occur in certain small time in-

Computing, 21(5):60–66, 2017.

28





Predicting customers at risk with machine learning



David Gojo

Darko Dujič

Jožef Stefan International Postgraduate School,

Ceneje d.o.o.,

Jamova 39, 1000 Ljubljana, Slovenia

Jožef Stefan International Postgraduate School,

david.gojo@ijs.si

Štukljeva cesta 40, 1000 Ljubljana, Slovenia



darko.dujic@ceneje.si



ABSTRACT

2. RELATED WORK

Today’s market landscape is becoming increasingly competitive

Improvements in tracking technology have enabled data driven

as more advanced methods are used to understand customer’s

industries to analyze data and create insights previously

behavior. One of key techniques are churn mitigation tactics

unavailable to the business. Data mining techniques have evolved

which are aimed at understanding which customers are at risk to

to now support the prediction of behavior of customers such risk

leave the service provider and how to prevent this departure. This

of leaving due to the attributes that are trackable [2]. The use of

paper presents analyzes accounts renewal rates and uses easily

data mining methods has been widely advocated as machine

applicable models to predict which accounts will be decreasing

learning algorithms, such as random-forest approaches have

spend at the time when they are due to renew their existing

several advantages over traditional explanatory statistical

contract based on number of attributes. Key questions it tries to

modeling [3].

explore is if customer behavioral or customer characteristic data

Lack of predefined hypothesis makes algorithms excel in these

(or combination of both) is better at predicting accounts that will

tasks as it is making it less likely to overlook predictor variables

renew at lower than renewal target amount (churn rate).

or potential interactions that would otherwise be labelled

Categories and Subject Descriptors

unexpected [4]. Models are often labelled as business intelligence

models aimed at finding customers that are about to switch to

F.2.1 [Numerical Algorithms and Problems]: Data mining,

competitors or leave the business [5].

Structured prediction

Key classifications are observed in work related to churn that we

General Terms

will use in our data set for review [6]:

Algorithms,

Management,

Measurement,

Documentation,

-

Behavioral data - will consist of attributes that we have

Performance

historically observe that play a role in whether the

account will renew or not: product utilization, activity

Keywords

levels of the account, number of successful actions in

Data Mining, Analysis, Churn prediction.

the account and number of upsells done ahead of

renewal.

1. INTRODUCTION

-

Characteristic attributes - will consist of size of the

The main issue of business is how to make educated decision with

account in terms of spend, number of members in the

support of analysis that dissect complex decisions on addressable

company, number of active users of the products in the

problems using measurements and algorithms. Where there are

company, payment method and how they renew the

many disciplines are researching methodological and operational

contract, geography and what level of support the

aspects of decision making, at the main level, we distinguish

product is given (number of sales visits and interactions

between decision sciences and decision systems [1]. With

with the customer).

increasing number of companies trying to use machine learning



to assist in their decision-making process we examined how

decision science can be supplemented by applying machine

learning models to the company’s customer data. We partnered

3. DATA ACQUISITION

with the medium sized B2B business operating in Europe and

3.1 Data understanding

Africa with the aim to help them better understand their

‘customers at risk’ segment of clients.

Working with the customer we have arranged a set of interviews



with the leadership to better understand their business and

To this end we developed two easily applicable performance

challenges they are experiencing together with ambitions they

algorithms designed to highlight customers at risk and company

have in the business. After the interview rounds we focused on

can address to mitigate their risk of leaving as clients.

reviewing 2 hypotheses flagged in the examination process:

The paper has the following structure: in section 2 we are

- What is driving churn numbers: behavior of the customers or

presenting related work to the area recorded historically. Next,

better structure of the base?

data acquisition is explained in section 3 followed by results

- Does acquisition of new accounts represent a risk in churn

acquired from the tested algorithms in section 4. We then

number with historic observation of accounts renewing lower /

conclude our observations in section 5.

not renewing in their first-year renewal?

29





3.2 Data pre-processing

or if no descriptor can be found that would result to the

The data we used derives from company’s internal customer

information gain.

bookings and customer databases we consolidated. As customers

Random Forest. We assume that the user knows about the

are on yearly renewals we have taken the renewal and the data on

construction of single classification trees. Random Forests grows

the account before the renewal as the key building block for

many classification trees. To classify a new object from an input

analysis.

vector, put the input vector down each of the trees in the forest.

Each tree gives a classification, and we say the tree "votes" for

3.3 Feature engineering

that class. The forest chooses the classification having the most

We enriched the data using SQL joins on the customer numbers

votes (over all the trees in the forest) [7]. Both methods were

to include key characteristics of accounts, tenure of the client,

applied to imported dataset numerous times with continuous

products utilization information, behavior of the customer before

testing of parameters to improve performance.

the renewal and their usage of the product.



In terms of regional split of the market the dataset consists of 4

4.2 Application of J48

key geo and segment regions in Europe and Africa:

-

Medium-business segment

Working with Weka on the dataset of the customer we tried to

-

UK & Ireland market

launch the model to tune the parameters. Key modifications:

-

Europe Enterprise segment

-

“10-fold cross validation” used to improve accuracy

-

Eastern Europe, Middle-East and Africa

-

Minimum number of objects moved to 100



Through feature engineering and reviewing descriptive statistics

As Figure 2 shows this reduced the number of leaves to 16 which

key attributes we nominalized are 11.

was something comprehendible enough.

For the machine learning purposes for the calls we have selected



3 possible outcomes related to the outcome of customer spend

Summary of results below:

after it’s renewal:



-

Customer_Renew (Not renew, Partial renew, Full renew)

3.4 Data Set Statistics

We selected bookings period from 2016 to end of 2017 including

23,043 instances in above selected renewal of 12,872 accounts.

The attributes that were nominalized are listed below:

-

(nom) Has main product – has product 1

-

(nom) Has_assisting_product – has product 2



-

(nom) Has_media_product – product 3

-

(nom) Account_potential – size and potential of the account

Figure 1: J-48 model error estimates

-

(nom) Is_Auto_Renew – auto renewal option enabled

4.3 Application of Random forest

-

(nom) First_renewal – is the client renewing first time

We ran several tests on Random forest vs Random trees. When

-

(nom) Upsold_Before_renewal – was there an upsell before

tuning parameters Random tree tended to not respond well to

-

(nom) JS_Utilization – utilization of product 2 - indicator

pruning so Random forest was a preferred option. Like J48,

-

(nom) Score_Engagement – engagement of the recruiter

application with key modifications was focused on validation and

-

(nom) LRI_Score – savviness of the user of the product

additionally on setting maximum depth of the random forest:



4. RESULTS

-

“10-fold cross validation”

-

Max. depth set at 3

4.1 Brief description of the methods used

Summary of results below:

Where multiple algorithms were used during the testing due to

important feature that the result needed to have at least one

interpretable model, so we went in the direction of nominalizing

attributes and decided to use J-48 model and Random forest

classifier on the data set.

J48. Decision trees C4.5 (J48 in Weka) algorithm: deals with

continuous attributes as observed in the related work.

Where the method is classification-only the main machine

learning method applied is J48 pruned tree or WEKA-J48



machine learning method. Tree tries to partition the data set into

Figure 2: Random forest model error estimates

subsets by evaluating the normalized information gain from



choosing a descriptor for splitting the data. The training process

stops when the resulting nodes contain instances of single classes

4.4 Comparisons of models

Overall the J48 model has surprisingly 0.7pp points higher

Classification Accuracy than the Random forest model.

30



Validation Measures

J48

Random Forest

J48 provided a significantly better interpretability and

classification accuracy than the Random forest or any test on the

Classification Accuracy

72.9%

72.2%

Random tree model. Some additional tests were done on Naïve

Bayes model and J48 was superior in the results. Key area it

accelerated was in predicting accounts that will not renew. Where

Mean absolute error 0.276

0.280

the precision is just above 38% this is almost double comparing

to Random forest model.

Table 1. Baseline benchmark validation measures





3 key takeaways observed that the company found the most

Key observation analyzing the data was that neither model was

insightful were:

predicting any partially churned accounts after we forced their

trees to be pruned.

- One of the new features designed by the product team that

encouraged the auto-renew of their clients played the most



important at predicting the renewal rate

J48 predictions:

- Customer behavior is a better signal for probability of

a b c <-- classified as

renewal vs general account characteristics

- Account potential which is the predictor of account potential

0 2745 285 | a = PARTIAL_RENEW

and size plays the role only after product utilization and

0 1528 789 | b = FULL_RENEW

engagement of the account with our products

0 2434 1504 | c = NOT_RENEWED





potent



al

potent



al



5. CONCLUSION AND FUTURE WORK

Figure 3: The J48 decision tree



For the acceleration of performance, the decision tree is of

paramount importance and value. Insight that Auto renew as a

Random forest predictions:

feature is one of the key predictors if the account will renew fully

a b c <-- classified as

is truly remarkable based on the simplicity of the models and how

easily applicable they are.

0 2857 173 | a = PARTIAL_RENEW

0 15591 483 | b = FULL_RENEW

Applications of this models will be of great foundation for driving

the discussion on different account features and metrics. This is

0 2894 1044 | c = NOT_RENEWED

especially true as it is tackling one of the key challenges observed



in hypothesis as in how important ‘account potential’ is for the

account ahead of the renewal.

Even though Random forest has a lower classification accuracy

J48 offers significantly higher interpretability with tree pruning

Observing the attributes added into the analysis scope and

offering valuable insights, short description below and discussed

optimizing them for the J48 we were able to get valuable insight

in evaluation of models.

which account characteristics vs account behaviors ahead of the

renewal are the best predictors for the account to renew at the full

amount.

31

6. REFERENCES

[5] M. Óskarsdóttir, B. Baesens and J. Vanthienen, "Profit-





Based Model Selection for Customer Retention Using

Individual Customer Lifetime Values," Big Data, vol.

[1] M. Bohanec, Decision Making: A Computer-Science

6, no. 1, pp. 53-65, 3 2018.

and Information-Technology Viewpoint, vol. 7, 2009,

pp. 22-37.

[6] S. Kim, D. Choi, E. Lee and W. Rhee, "Churn

prediction of mobile and online casual games using

[2] A. Rodan, A. Fayyoumi, H. Faris, J. Alsakran and O.

play log data," PLOS ONE, vol. 12, no. 7, p. e0180735,

Al-Kadi, "Negative correlation learning for customer

5 7 2017.

churn

prediction:

a

comparison

study.,"

TheScientificWorldJournal, vol. 2015, p. 473283, 23 3

[7] J. Hadden, A. Tiwari, R. Roy and D. Ruta, "Computer

2015.

assisted customer churn management: State-of-the-art

and future trends," Computers & Operations Research,

[3] A. K. Waljee, P. D. R. Higgins and A. G. Singal, "A

vol. 34, no. 10, pp. 2902-2917, 10 2007.

Primer

on

Predictive

Models,"

Clinical

and

Translational Gastroenterology, vol. 5, no. 1, pp. e44-

[8] A. K. Meher, J. Wilson and R. Prashanth, "Towards a

e44, 2 1 2014.

Large Scale Practical Churn Model for Prepaid Mobile

Markets," Springer, Cham, 2017, pp. 93-106.

[4] Y. Zhao, B. Li, X. Li, W. Liu and S. Ren, "Customer

Churn Prediction Using Improved One-Class Support

[9] M. Ballings, D. Van den Poel and E. Verhagen,

Vector Machine," Springer, Berlin, Heidelberg, 2005,

"Improving Customer Churn Prediction by Data

pp. 300-306.

Augmentation Using Pictorial Stimulus-Choice Data,"

Springer, Berlin, Heidelberg, 2012, pp. 217-226.





32





Text mining MEDLINE

to support public health





João Pita Costa, Luka Stopar,

Raghu Santanam,

Paul Carlin

Michaela Black,



Flavio Fuart, Marko Grobelnik

Chenlu Sun

South Eastern Health and

Jonathan Wallace

Jožef Stefan Institute, Ljubljana

Arizona State University, USA

Social Care Trust, UK

Ulster University, UK



Quintelligence, Ljubljana, Slovenia





ABSTRACT

demonstration paper focuses on this large open dataset, and the

exploration of its structured data.

Today’s society is data rich and information driven, with access

to numerous data sources available that have the potential to

provide new insights into areas such as disease prevention,

personalised medicine and data driven policy decisions. This

paper describes and demonstrates the use of text mining tools

developed to support public health institutions to complement

their data with other accessible open data sources, optimize

analysis and gain insight when examining policy. In particular

we focus on the exploration of MEDLINE, the biggest

structured open dataset of biomedical knowledge. In

MEDLINE we utilize its terminology for indexing and

cataloguing biomedical information – MeSH – to maximize the

efficacy of the dataset.





Categories and Subject Descriptors

Figure 1. MIDAS platform dashboard, composed of visualisation

modules customized to the public health data sourced in each

H.4 [Information Systems Applications]: Miscellaneous

governamental institution, and combined with open data.

General Terms



2. THE MEDLINE BIOMEDICAL OPEN DATA

Measurement, Performance, Health.

SET AND IT’S CHALLENGES.

Keywords

2.1. MEDLINE DATASET.

Big Data, Public Health, Healthcare, Text Mining, Machine

Learning, MEDLINE, MeSH Headings.

With the accelerating use of big data, and the analytics and

visualization of this information being used to positively affect



the daily life of people worldwide, health professionals require

1. MEANINGFUL BIG DATA TOOLS TO

more and more efficient and effective technologies to bring

SUPPORT PUBLIC HEALTH

added value to the information outputs when planning and

delivering care. The day-to-day growth of online knowledge

The Meaningful Integration of Data, Analytics and Service

requires that the high quality information sources are complete,

[MIDAS], Horizon 2020 (H2020) project [1] is developing a

high quality and accessible. A particular example of this is the

big data platform that facilitates the utilisation of healthcare

PubMed system, which allows access to the state-of-the-art in

data beyond existing isolated systems, making that data

medical research. This tool is frequently used to gain an

amenable to enrichment with open and social data. This

overview of a certain topic using several filters, tags and

solution aims to enable evidence-based health policy decision-

advanced search options. PubMed has been freely available

making, leading to significant improvements in healthcare and

since 1997, providing access to references and abstracts on life

quality of life for all citizens. Policy makers will have the

sciences and biomedical topics. MEDLINE is the underlying

capability to perform data-driven evaluations of the efficiency

open database [7], maintained by the United States National

and effectiveness of proposed policies in terms of expenditure,

Library of Medicine (NLM) at the National Institutes of Health

delivery, wellbeing, and health and socio-economic

(NIH). It includes citations from more than 5200 journals

inequalities, thus improving current policy risk stratification,

worldwide journals in approximately 40 languages (about 60

formulation, implementation and evaluation. MIDAS enables

languages in older journals). It stores structured information on

the integration of heterogeneous data sources, provides privacy-

more than 27 million records dating from 1946 to the present.

preserving analytics, forecasting tools and visualisation

About 500,000 new records are added each year. 17.2 million of

modules of actionable information (see the dashboard of the

these records are listed with their abstracts, and 16.9 million

first prototype in Figure 1). The integration of open data is

articles have links to full-text, of which 5.9 million articles have

fundamental to the participatory nature of the project and core

full-text available for free online use. In particular, it includes

ideology, that heterogeneity brings insight and value to

443.218 full-text articles with the key-words string “public

analysis. This will democratize, to some extent, the

health”.

contribution to the results of MIDAS. Moreover, it enables the

2.2. MEDLINE STRUCTURE.

MIDAS user to profit from the often powerful information that

exists in these open datasets. A point in case is MEDLINE, the

The MEDLINE dataset includes a comprehensive controlled

scientific biomedical knowledge base, made publicly available

vocabulary – the Medical Subject Headings (MeSH) – that

through PubMed. The set of tools described in this

33





delivers a functional system of indexing journal articles and

public health policy making, a suitable MIDAS PubMed

books in the life sciences. It has proven very useful in the search

repository had to be developed. This repository has to allow

of specific topics in medical research, which is particularly

exploration of a wide range of different visualisation techniques

useful for researchers conducting initial literature reviews

in order to evaluate their applicability to policy-making tasks

before engaging in particular research tasks. Humans annotate

within the policy cycle. Therefore, there was a need for a

most of the articles in MEDLINE with MeSH Heading

selection of a powerful, semi-structured text index, that would

descriptors. These descriptors permit the user to explore a

allow free text searches, but also allow the creation of complex

certain biomedical related topic, which relies on curated

queries based on available metadata. An obvious choice is

information made available by the NIH. MeSH is composed of

elasticSearch, which combines features provided by NoSQL

16 major categories (covering anatomical terms, diseases, drugs,

databases with standard full text indexes, as it is based on the

etc) that further subdivide from the most general to the most

Apache Lucene Index. The main design challenge when

specific in up to 13 hierarchical depth levels.

choosing this particular toolset was that querying based on

arrays or parent-child relations are not supported, meaning that

for complex use-cases different indexes based on the same

source dataset have to be created. Nevertheless, excellent

results, particularly with regards to the area of performance have

been obtained.

2.4. MEDLINE DASHBOARD.

One of the identified needs motivating this work is assuring the

availability of a dynamic dashboard that permits the user to

explore data visualisation modules, representing the queries to

the MEDLINE dataset through pie charts, bar charts, etc [5].

The dashboard that we made available (in Figure 2) feeds on

that dataset through the elasticSearch index earlier discussed. It

is composed of several interactive visualisation modules that

utilises the mouse hover when interacting and provide

information through pop-up messages on several aspects of the

data based on particular queries of interest (e.g. a pie chart

representing the “public health” citations that talk about

“childhood obesity” during a selected period of time; or a bar

chart showing different concepts included in the articles related

to “mental health” in Finnish scientific journals). The

MEDLINE dataset is mostly in the English language but

includes a significant volume of translated abstracts of scientific

articles that were written in several other languages. The open

source data visualisation Kibana is a plugin to elasticSearch that



supports the described dashboard. Thus, it is the data

visualisation dashboard of choice for elasticSearch-based

Figure 2. MEDLINE data visualisation tool enabling exploration of that

indexes, such as the one we present here. It is used in the context

open dataset in its full potential, based on data representations easy to

understand and to communicate. It provides an interactive public

of MIDAS for fast prototyping and support of part of MIDAS

instance that can be managed at the dashboard management tool

use-cases. While the dashboard itself serves the less technical

(below) for which the visualisation modules are constructed (in the

user to explore the data available (over a subset of the data

center) based on the queries made to the MEDLINE dataset (above).

generated by a topic of interest), other options are available that

permit more control of the data by the data scientists at a more

2.3. MEDLINE INDEX.

operational level. These are: (i) the management dashboard,

This paper demonstrates the interactive data visualisation text-

where the technical user can perform the appropriate

mining tools that enable the user to extract meaningful

subsampling based on the topics of interest and the required

information from MEDLINE. To do that we are using the

advanced options over the available data features; and (ii) the

underlying ontology-like structure MeSH. MEDLINE data,

visual modules creator permitting the technical user to easily

together with the MeSH annotation, that is indexed with

create new interactive visualisation modules. Moreover, this tool

ElasticSearch and made available to data analytics and

enables one to query large datasets and produce different types

visualisation tools. This will be discussed in more detail in the

of visualisation modules that can be later integrated into

next section.

customized dashboards. The flexibility of such dashboards

The manipulation and visualization of such a complete data

permits the user to profit from data visualisations that feed on

source brings challenges, particularly in the efficient search,

his/her preferences, previously set up as filters to the dataset.

review and presentation when choosing appropriate scientific

The MIDAS data visualisation tools permit the user to explore

knowledge. The manipulation and visualisation of complex text

the potential of the MEDLINE dataset, based on pie charts and

data is an important step in extracting meaningful information

other representations that are easy to comprehend, interact with,

from a dataset such as MEDLINE. Although powerful, the

and to communicate. It also enables a public instance based on a

online search engine provided by the NLM does not provide

particular query to the dataset, which includes different types of

suitable tools for in-depth analysis and the emergence of

data visualisation modules that can later integrate a customised

scientific information. As one of the main goals of MIDAS is to

dashboard, designed in agreement with the workflows and

experiment with advanced visualisation techniques in support of

preferences of the end-user. This live dashboard can easily be



2

34





integrated through an iframe in any website, not showing the

could be transformed as D1 = (1, 1, 1, 0) and D2 = (1, 1, 0, 1).

customization settings but maintaining the interaction capability

Then the documents are clustered into k groups G1, G2, ..., Gk

and the real-time update. It permits a complete base solution to

using the k-means algorithm. For each group we compute the

further explorer the MEDLNE index and the associated dataset

"average" document (centroid), which is the representative of

[6].

the group. The most frequent words in the "average" document

are drawn in the word cloud - the central grey word cloud is the

"average" of all the documents in S. We can calculate how

similar a particular document d is to a group Gi by calculating

the cosine of the angle between the vector representation of d

and the "average" document (centroid) of the group Gi. By

dragging the red SearchPoint ball over the word-groups, we

provide the relevance criteria to the search result, thus bringing

to the top results the articles we are most interested in (see

Figure 4). When that ball is moved, for each document, we

calculate the similarity to each of the word-groups and combine

it with the distance between the ball and the group. The result is

used as the ranking weight where the document with the highest

cumulative weight is ranked first. When having the mouse over

the word-clouds we get a combination of the most frequent

words shown in the tag clouds that change based on how close

the ball is to a particular group. After getting to a position with



the SearchPoint over the word cloud highlighting “primary

Figure 4. A screenshot of the MEDLINE SearchPoint, with groups of

care”, a qualitative study in primary care on childhood obesity

keywords (on the right) extracted from the search results, represented

that occupied the position 188 is now in the first position. The

by different colors, and on the left the reindexed search results

user can read its title and first lines of abstract, and when

themselves with the number that they appear in the original index [6].

clicking on it, the system opens the article in the browser at its

PubMed URL location.

4. MEDLINE SEARCHPOINT.

3. MeSH CLASSIFIER

The efficient visualisation of complex data is today an important

step in obtaining the research questions that describe the

This rich data structure in the MEDLINE open set is annotated

problem that is extracted from that data. The MEDLINE

by human hand (although assisted by semi-automated NIH

SearchPoint is an interactive exploratory tool refocused from the

tools) and therefore is not available in the most recent citations.

proprietary open source technology SearchPoint [8] (available at

However, in the context of MIDAS we made available an

searchpoint.ijs.si) to support health professionals in the search

automated classifier based on [2] that is able to suggest the

for appropriate biomedical knowledge. It exhibits the clustered

categories of any health related free text. It learns over the part

keywords of a query, after searching for a topic. When we use

of the MEDLINE dataset that is already annotated with MeSH,

indexing services (such as standard search engines) to search for

and is be able to suggest categories to the submitted text

information across a huge amount of text documents – the

snippets. These snippets can be abstracts that do not yet include

MEDLINE index described in Section 2 being an example – we

MESH classification, medical summary records or even health

usually receive the answer as a list sorted by a relevance criteria

related news articles. To do that we use a nearest centroid

defined by the search engine. The answer we get is biased by

classifier [3] constructed from the abstracts from the MEDLINE

definition. Even by refining the query further, a time consuming

dataset and their associated MeSH headings. Each document is

process, we can never be confident about the quality of the

embedded in a vector space as a feature vector of TF-IDF

result. This interactive visual tool helps us in identifying the

weights. For each category, a centroid is computed by averaging

information we are looking for by reordering the positioning of

the embeddings of all the documents in that category. For higher

the search results according to subtopics extracted from the

levels of the MeSH structure, we also include all the documents

results of the original search by the user. For example, when we

from descendant nodes when computing the centroid. To

enter a search term ‘childhood obesity, the system performs an

classify a document, the classifier first computes its embedding

elasticSearch search over the MEDLINE dataset, extracts groups

and then assigns the document to one or more categories whose

of keywords that best describe different subgroups of results

centroids are most similar to the document’s embedding. We

(these are most relevant, and not most frequent terms). This tool

measure the similarity as the cosine of the angle between the

gives us an overview of the content of the retrieved documents

embeddings. Preliminary analysis shows promising results. For

(e.g. we see groups of results about prevention, pregnancy,

instance when classifying the first paragraph of the Wikipedia

treatments, etc.) represented by: (i) a numbered list of 10

page for “childhood obesity”, excluding the keyword “childhood

MEDLINE articles with a short description extracted from the

obesity” from the text, the classifier returns the following MeSH

first part of the abstract; (ii) a word-cloud representing the k-

headings:

means clusters of topics in the articles that include the searched

keywords; (iii) a pointer that can be moved through the word-

"Diseases/Pathological Conditions, Signs and

cloud and that will change the priority of the listed articles. The

Symptoms/Signs and Symptoms/Body Weight/Overweight",

word-cloud in (ii) is done by taking a set of MEDLINE

"Diseases/Pathological Conditions, Signs and

documents S and transforming them into vectors using TF-IDF,

Symptoms/Signs and Symptoms/Body

where each dimension represents the "frequency" of one

Weight/Overweight/Obesity".

particular word. For example, lets say that we have document

The demonstrator version of the MeSH classifier is already

D1: "psoriasis is bad" and document D2: "psoriasis is good". This available through a web app, as well as through a REST API



3

35





using a POST call, and is at the moment under qualitative

the research topic over time windows that enable filtering to

evaluation. This is being done together with health professionals

avoid known unreliable results.

with years of practical experience in using MeSH themselves

through PubMed. In addition, we aim to further explore the

In line with this work we have been developing research to

potential of the developed classifier in several public health

contribute to the smart automation of the production of

related contexts including non classified scientific articles of

biomedical review articles. This collaborative research lead by

three types: (i) review articles; (ii) clinical studies; and (iii)

Raghu T. Santanam at Arizona State University, aims to provide

standard medical articles. The potential impact of this

a wide knowledge over a restricted topic over the wider

technology will also include electronic health records and the

knowledge available at MEDLINE. We utilize the deep learning

monitoring health related news sources. We believe that his

algorithm Doc2vec [4] to create similarity measures between

approach will address an identified recurrent need of health

articles in our corpus. In that we built a balanced test dataset and

departments to enhance the biomedical knowledge, and motivate

three different representations of the corpus, and compared the

a step change in health monitoring.

performance between them. The implementation currently

builds a matrix of similarity scores for each article in the corpus.

In the next steps, we will compare similarity of documents from

our implementation against the baseline for a randomly chosen

set of articles in the corpus.

The further development of the MeSH classifier will consider

the feedback of the usability of health professionals working in

partner institutions, profiting of their years of experience with

MEDLINE and MeSH itself, to tune the system to ensure the

best usability in the domain. Furthermore, we will use the

outcomes of the final version of this classifier to label health

related news with the MeSH Headings descriptors, potentiating

a new approach on the processing and monitoring of population

health, population awareness of certain diseases, and the general



public acceptance of public health decisions through news.

Figure 3. A screenshot of the web app to the MEDLINE classifier, when

ACKNOWLEDGMENTS

requesting the automated MeSH annotation of a scientific review article

abstract extracted from PubMed (in the body of text above) and the

We thank the support of the European Commission on the

results as MeSH headings descriptors including their tree path in the

H2020 project MIDAS (G.A. nr. 727721).

MeSH ontology-like structure (center), their similarity measure (right)

REFERENCES

and their positioning in the classification (left).



[1] B. Cleland et al (2018). Insights into Antidepressant

5. CONCLUSION AND FUTURE WORK

Prescribing Using Open Health Data, Big Data Research,

doi.org/10.1016/j.bdr.2018.02.002

To further extend the usability of the MEDLINE SearchPoint,

[2] L. Henderson, Lachlan (2009). Automated text classification

we are developing a data visualisation tool that permits the user

in the dmoz hierarchy. TR.

to plot the top results mostly related with a topic of interest, as

[3] C. Manninget al (2008), “Introduction to Information

explored with SearchPoint. Based on the choice of a time

Retrieval,” Cambridge Univ. Press, 2008, pp. 269-273.

window and a certain topic, such as “mental health”, the user is

[4] T. Mikolov et al (2013). Efficient Estimation of Word

able to view the clustered MEDLINE documents, rolled over the

Representations in Vector Space, arXiv:1301.3781.

plot or click to view the plotted points, each of which represents

[5] J. Pita Costa et al (2017). Text mining open datasets to

an article in PubMed. This will be done through

support public health. In Conf. Proceedings of WITS 2017.

multidimensional scaling, plotting the articles in the subsample

[6] J. Pita Costa et al (2018). MIDAS MEDLINE Toolset Demo.

using cosine text similarity. The difficulties to plot large datasets

http://midas.quintelligence.com (accessed in 28-8-2018).

using these methods, and the lack of potential in the outcomes of

[7] F. B. Rogers, (1963). Medical subject headings. Bull Med

that heavy computation, provided a focus for the team to only

Libr Assoc. 51: 114–6.

plot the first hundred results of the explorations done within

[8] L. Stopar, B. Fortuna and M. Grobelnik (2012). Newssearch:

MEDLINE SearchPoint. With this extended tool the healthcare

Search and dynamic re-ranking over news corpora. In Conf.

professional will be able to: (i) explore a certain area of research

Proceedings of SiKDD2012.

with the aim of a more accessible scientific review, in



identifying the evidence base for a medical study or a diagnostic



decision; (ii) identify areas of dense scientific research



corresponding to searchable topics (e.g. the evaluation of the



coverage of certain rare diseases that need more biomedical



research, or the identification of alternative research paths to



overpopulated but inefficient research); and (iii) exploration of





4

36





Crop classification using PerceptiveSentinel

Filip Koprivec

Matej Čerin

Klemen Kenda

Jožef Stefan Institute

Jožef Stefan Institute

Jožef Stefan Institute

Jamova 39, 1000 Ljubljana,

Jamova 39, 1000 Ljubljana,

Jožef Stefan International

Slovenia

Slovenia

Postgraduate School

filip.koprivec@ijs.si

matej.cerin@ijs.si

Jamova 39, 1000 Ljubljana,

Slovenia

klemen.kenda@ijs.si

ABSTRACT

data labels, which will become apparent when interpreting

Efficient and accurate classification of land cover and land

results.

usage can be utilized in many different ways: ranging from

natural resource management, agriculture support to legal

Another class of problems is posed by the spatial resolu-

and economic processes support. In this article, we present

tion of images. Since satellite images provided by the ESA

an implementation of land cover classification using the Per-

Sentinel-2 mission have a resolution of 10 m × 10 m on most

ceptiveSentinel platform. Apart from using base 13 bands,

granular bands and 60 m × 60 m on bands used for atmo-

only minor feature engineering was performed and different

spheric correction, land cover irregularities falling in this

classification methods were explored. We report an F1 and

order of magnitude might not be detected and correctly

accuracy score (80-90%) in range of state of the art when

learned on. This problem is especially prevalent in smaller

using pixel-wise classification and even comparable to time

and more diverse regions, where individual fields are smaller

series based land cover classification.

and land usage is more fragmented.

The current state of the art land classification focuses heav-

Keywords

ily on the temporal dimension of acquired data [1], [13],

remote sensing, earth observation, machine learning, classi-

[14]. The time-based analysis offers clear advantages since

fication

it considers growth cycles of sample crops, enables continu-

ous classification etc., and generally produces better results,

1.

INTRODUCTION

with reported F1 scores for crop/no-crop classification in a

range from 0.80-0.93 [14].

One major drawback of time-

Specific aspects of earth observation (EO) data (huge amount

based classification is the problem of missing data. In our

of data, widespread usage, many different problem settings

test drive scenario, 70% of images are heavily obscured by

etc.), coupled with the recent launch of ESA Sentinel mission

clouds [9], a problem which removes a lot of the advantages

that provides a huge volume of data relatively frequently (ev-

of time-based classification and demands major compensa-

ery 5-10 days), present an environment that is suitable for

tions with missing data imputation.

current machine learning approaches.

In this paper, we present a possible approach on a land cover

Efficient and accurate land cover classification can provide

classification of single time image acquired using the Percep-

an important tool for coping with current climate change

tiveSentinel 1 platform, using multiple classification meth-

trends. Classification of crops, their location and potentially

ods for tulip field classification in Den Helder, Netherlands.

their yield prediction provide various interested parties with

information on crop resistance, adapting to changes in tem-

perature and water level changes. Along with direct help,

2.

PERCEPTIVESENTINEL PLATFORM

accurate crop classification tools can be used in a variety of

2.1

Data

other programs, from government based subsidies to various

Data used in this article is provided by ESA Sentinel-2 mis-

insurance schemes.

sion. The Sentinel-2 mission comprises a constellation of two

polar-orbiting satellites placed in the same orbit, phased at

Along with previously highly promising features of EO data,

180◦ to each other [2]. Sentinel-2A was launched on 23rd

data acquisition and processing pose some specific challenges.

June 2015, while the second satellite was launched on 7th

Satellite acquired data is highly prone to missing data due

March 2017. Revisit time for equator is 10 days for each

to various reasons; mostly cloud coverage, (cloud) shadows,

satellite, so since the launch of the second satellite, each

atmospheric refraction due to changes in atmospheric con-

data point is sampled at least every 5 days (a bit more fre-

ditions. Additionally, correct training data, either for clas-

quently when away from the equator).

sification or regression, is hard to come by, must be rela-

tively recent, and regularly updated due to changes in land

Each satellite collects data in 13 different wavelength bands

use. Furthermore, correct labels and crop values are almost

presented in figure 1, with varying granularity. Data ob-

impossible to verify and usually self-reported, which often

tained for each pixel is firstly preprocessed by ESA where

means that quality of training data is not perfect. Valero et

al. [13] raise the problem of incorrect (or partially correct)

1http://www.perceptivesentinel.eu/

37





atmospheric reflectance and earth surface shadows are cor-

3.

METHODOLOGY

rected [4].

3.1

Sample Data

For purpose of this article, a sample patch of fields in Den

Helder, Netherlands, with coordinates: (4.7104, 52.8991),

(4.7983, 52.9521) was analyzed.

Three different datasets

were considered: tulip fields in year 2016 and 2017 and

arable land in 2017. For each of these datasets, the first ob-

served date with no detected clouds was selected and binary

classification (tulips vs no-tulips and arable vs non-arable

land) was performed on the image from that date. The date

selection was based on the fact that tulips’ blooms are most

apparent during late April and beginning of May [9].

3.2

Feature Vectors

Three additional earth observation indices that were used as

features are presented in Table 1 as suggested by [8].

Figure 1: Sentinel 2 spectral bands [12]

Name

Formula

2.2

Data Acquisition

Satellites provide around 1TB of raw data per day, which

B08 − B04

NDVI

is provided for free on Amazon AWS. Images are then pro-

B08 + B04

cessed and indexed by Sinergise and subsequently provided

for free along with their SentinelHub [11] library. As part

2.5(B08 − B04)

EVI

of the PerceptiveSentinel project, a sample platform was de-

(B08 + 6B04 − 7.5B02 + 1)

veloped on top of SH library, which eases data acquisition,

cloud detection and further data analysis on acquired data.

B08 − B04

SAVI

(1 + 0.5) B08 + B04 + 0.5

The whole dataset currently consists of images captured

from the end of June 2015 till August 2018. Images are avail-

able for use in a few hours after being taken. Since working

Table 1: Additional indices

with data for the whole world is impractical, smaller geo-

graphical regions are usually queried and analyzed on their

For each selected image, all 13 Sentinel2-bands were consid-

own. One important aspect when analyzing larger regions

ered as feature vectors for each pixel, in the second experi-

that must be taken care of is the fact that EO data is ac-

ment, additional land cover based classification indices from

quired in swaths with the width of approximately 290 km

Table 1 were added.

[3]. Since the swaths overlap a bit, care must be taken when

sampling larger areas (in a size of small state), as the area

3.3

Experiment

might be chopped into a few irregular tiles covering only

The experiment was conducted in the Den Helder region

part of an area of interest.

to asses the effectiveness of classification and improvement

with additional features. The same region is also used as a

Corrected images are available using the SentinelHub li-

test drive location for the PerceptiveSentintel project. One

brary. PerceptiveSentinel platform provides an easy to use

important characteristic to keep in mind is the fact that

framework that combines satellite data acquisition, subse-

classification classes are not uniformly distributed.

Tulip

quent cloud detection enables an easy way to pipeline ma-

fields constitute 17.1% and 17.7% of all pixels in the year

chine learning framework. They also provide an easy way

2016 and 2017 respectively, while arable land accounts for

to integrate (vectorized or rasterized) geopedia layers as a

64% of pixels in 2017 data set. Care must, therefore, be

source of ground truth for classification.

taken when assessing the predictive power of a model.

2.3

Data Preprocessing

For each dataset, multiple classification algorithms were tested

Most of the preprocessing is already done by ESA (atmo-

on base band feature vectors and feature vectors enriched

spheric reflectance, projection . . . ). The data is mostly clean

with calculated indices from Table 1.

Experiments were

and presented as a pixel array with dimensions H×W×B,

carried out using python library scikit-learn and default

where W and H are image dimensions (in our case 589 and

parameters were used for each type of classifier. For each

590) and B is number of bands selected (in our case 13, but

data set and each classifier (Ada Boost, Logistic regression,

we may individually preconfigure the Sentinelhub library to

Random Forest, Multilayer perceptron, Gradient Boosting,

return variable number of bands and even custom calcula-

Nearest neighbors and Naive Bayes), 3-fold cross-validation

tions based on other bands).

was performed. Folds were generated on a non-shuffled dataset

with balanced classes ratios.

When preprocessing we only need to consider one part of

problematic data, namely clouded parts of images.

ESA

4.

RESULTS

provides some sort of cloud detection, but our experiments

Results of selected classifiers are presented in Tables 2–4 (ind

proved it unsatisfactory, so we used the s2cloudless library

column indicates additional indices as features) are compa-

developed by Sinergise for this task [10].

rable with results from related works [5], [6] which report

38



accuracy results from 60-80%, although our experimental

Alg.

Ind

Prec

Rec

Acc

F1

T

dataset was quite small and homogeneous, which might of-

Logistic

No

0.853

0.913

0.843

0.882

2.7

fer some advantage over larger plots of land.

Regression

Yes

0.854

0.908

0.841

0.880

3.2

Decision

No

0.878

0.868

0.837

0.873

9.6

Alg.

Ind

Prec

Rec

Acc

F1

T

Tree

Yes

0.885

0.868

0.842

0.876

14.5

Logistic

No

0.895

0.551

0.912

0.682

2.8

Random

No

0.928

0.889

0.884

0.908

17.3

Regression

Yes

0.877

0.564

0.912

0.686

3.6

Forest

Yes

0.934

0.891

0.889

0.912

26.3

Decision

No

0.640

0.697

0.881

0.667

7.9

ML

No

0.929

0.932

0.911

0.931

522.4

Tree

Yes

0.629

0.698

0.878

0.662

11.3

Perceptron

Yes

0.926

0.940

0.913

0.933

586.2

Random

No

0.870

0.675

0.927

0.760

15.0

Gradient

No

0.899

0.921

0.883

0.910

82.6

Forest

Yes

0.867

0.680

0.927

0.762

21.7

Boosting

Yes

0.905

0.926

0.890

0.915

118.4

ML

No

0.875

0.720

0.935

0.790

184.2

Naive

No

0.823

0.830

0.776

0.827

0.4

Perceptron

Yes

0.835

0.740

0.931

0.784

241.3

Bayes

Yes

0.814

0.806

0.757

0.810

0.6

Gradient

No

0.878

0.657

0.926

0.751

84.8

Boosting

Yes

0.856

0.657

0.923

0.743

120.6

Table 4: Arable land in 2017 results

Naive

No

0.343

0.800

0.704

0.480

0.4

Bayes

Yes

0.316

0.808

0.669

0.454

0.6

Table 2: Tulip fields in 2016 results

Alg.

Ind

Prec

Rec

Acc

F1

T

Logistic

No

0.514

0.561

0.829

0.537

2.8

Regression

Yes

0.545

0.615

0.841

0.578

4.0

Decision

No

0.574

0.633

0.852

0.602

7.3

Tree

Yes

0.565

0.634

0.849

0.598

11.2

Random

No

0.786

0.599

0.900

0.680

13.8

Forest

Yes

0.788

0.604

0.901

0.683

20.5

ML

No

0.790

0.673

0.911

0.727

375.9

Perceptron

Yes

0.780

0.693

0.911

0.734

419.8

Gradient

No

0.786

0.613

0.902

0.689

84.4

Boosting

Yes

0.785

0.614

0.902

0.689

120.3

Naive

No

0.330

0.861

0.666

0.477

0.4

Bayes

Yes

0.318

0.858

0.649

0.464

0.6

Table 3: Tulip fields in 2017 results

For each test, precision, recall, accuracy, and F1 score were

reported along with the timing of the whole process. As

seen from the tables, multilayer perceptron achieved best

results when comparing F

Figure 2: Graphical representation of errors in ML

1 score across all data sets, but its

training was considerably slower than all other classification

perceptron classification of tulip fields in 2017 (TP

methods (in fact, it’s training time was comparable to all

in purple, FP in blue, FN in red)

other classification times combined). Multilayer perceptron

was followed closely by random forest, which achieved just

marginally worse results, but was way less expensive to train

seen, that classification produced quite satisfactory results.

and evaluate, while still retaining score that was higher or

An important thing to notice when inspecting Figure 2 is

comparable with related works.

that the true positive pixels were classified in densely packed

groups with clear and sharp edges, which correspond nicely

Adding additional features to feature vector did not signif-

to field boundaries seen with the naked eye (both RF and

icantly improve classification score and has in some cases

Gradient boosting decision trees produced visually very sim-

even hampered performance while having a significant im-

ilar results). This might suggest that algorithms have de-

pact on the training time.

tected another culture similar to tulips and classified it as

tulips (or conversely, that the ground truth might not be

Using classifier trained on 2016 tulips data and predicting

that accurate). A lot of FN pixels can also be spotted on

data in 2017 yielded an F1 score of 0.57, while classifier

field boundaries, which may correspond to different quality

trained on 2017 data yielded an F1 score of 0.67 on 2016

or mixing of different plant cultures near field boundaries.

data, indicating the robustness of the classifier.

Similarly, observing results of arable land classification, one

Graphical representation of classification errors can be seen

immediately notices small (false positive) blue patches and

in Figure 2 and 3 which show true positive (TP) pixels in

some red patches. Most notably, a long blue line is spotted

purple color, false positive (FP) in blue color and false neg-

in the left part of the image (near the sea), which may in-

ative (FN) in red. Looking at the images it can easily be

dicate some sort of wild culture near the sea that was not

39





[2] ESA. Satellite constellation / Sentinel-2 / Copernicus

/ Observing the Earth / Our Activities / ESA.

https://www.esa.int/Our_Activities/Observing_

the_Earth/Copernicus/Sentinel-2/Satellite_

constellation. Accessed 13 August 2018.

[3] ESA. Sentinel-2 - Missions - Resolution and Swath -

Sentinel Handbook. https://sentinel.esa.int/web/

sentinel/missions/sentinel-2/

instrument-payload/resolution-and-swath.

Accessed 13 August 2018.

[4] ESA. User Guides - Sentinel-2 MSI - Level-2

Processing - Sentinel Online. https:

//sentinel.esa.int/web/sentinel/user-guides/

sentinel-2-msi/processing-levels/level-2.

Accessed 13 August 2018.

[5] Guida-Johnson, B., and Zuleta, G. A. Land-use

land-cover change and ecosystem loss in the espinal

ecoregion, argentina. Agriculture, Ecosystems &

Environment 181 (2013), 31 – 40.

[6] Gutiérrez-Vélez, V. H., and DeFries, R. Annual

multi-resolution detection of land cover conversion to

oil palm in the peruvian amazon. Remote Sensing of

Environment 129 (2013), 154 – 167.

Figure 3: Graphical representation of errors in ML

[7] Gómez, C., White, J. C., and Wulder, M. A.

perceptron classification of arable land in 2017 (TP

Optical remotely sensed time series data for land cover

in purple, FP in blue, FN in red)

classification: A review. ISPRS Journal of

Photogrammetry and Remote Sensing 116 (2016), 55 –

72.

included in the original mask. Further manual observation

[8] Jiang, Z., Huete, A. R., Didan, K., and Miura,

of misclassified red patch in the middle of arable land sug-

T. Development of a two-band enhanced vegetation

gests that this field is barren (easily seen in Figure 2) and

index without a blue band. Remote Sensing of

possibly wrongly classified as arable land in training data.

Environment 112, 10 (2008), 3833 – 3845.

[9] Kenda, K., Kažič, B., Čerin, M., Koprivec, F.,

5.

CONCLUSIONS

Bogataj, M., and Mladenić, D. D4.1 Stream

Learning Baseline Document. Reported 30. April 2018.

In our work, we have examined the use of different classifica-

tion methods and additional features for land cover classifi-

[10] Sinergise. sentinel-hub/sentinel2-cloud-detector:

cation problem on a single image. Our results are compara-

Sentinel Hub Cloud Detector for Sentinel-2 images in

ble with results from the related literature. We propose that

Python. https://github.com/sentinel-hub/

classification strength and adaptability be further improved

sentinel2-cloud-detector. Accessed 14 August by considering time series and stream aggregates for each

2018.

pixel as researched in [14] [7]. Additionally, pixels might be

[11] Sinergise. sentinel-hub/sentinelhub-py: Download

grouped together into logical objects to enable object (field)

and process satellite imagery in Python scripts using

level classification as proposed by [13].

Sentinel Hub services.

https://github.com/sentinel-hub/sentinelhub-py.

Furthermore, results have shown, that correct ground truth

Accessed 14 August 2018.

mask is essential for good classification performance.

As

[12] Spaceflight 101. Sentinel-2 Spacecraft Overview.

seen from our results, even seemingly correct labels might

http://spaceflight101.com/copernicus/

miss some cultures or classify empty straits of land as crops.

wp-content/uploads/sites/35/2015/09/8723482_

orig-1024x538.jpg. Accessed 14 Aug. 2018.

[13]

6.

ACKNOWLEDGMENTS

Valero, S., Morin, D., Inglada, J., Sepulcre, G.,

Arias, M., Hagolle, O., Dedieu, G., Bontemps,

This work was supported by the Slovenian Research Agency

S., Defourny, P., and Koetz, B. Production of a

and the ICT program of the EC under project PerceptiveSen-

dynamic cropland mask by processing remote sensing

tinel (H2020-EO-776115). The authors would like to thank

image series at high temporal and spatial resolutions.

Sinergise for it’s contribution to sentinelhub and cloudless

Remote Sensing 8(1) (2016), 55.

library along with all help with data analysis.

[14] Waldner, F., Canto, G. S., and Defourny, P.

Automated annual cropland mapping using

7.

REFERENCES

knowledge-based temporal features. ISPRS Journal of

[1]

Photogrammetry and Remote Sensing 110 (2015), 1 –

Belgiu, M., and Csillik, O. Sentinel-2 cropland

mapping using pixel-based and object-based

13.

time-weighted dynamic time warping analysis. Remote

Sensing of Environment 204 (2018), 509 – 523.

40





Towards a semantic repository of data mining and

machine learning datasets

Ana Kostovska

Sašo Džeroski

Panče Panov

Jožef Stefan IPS &

Jožef Stefan Institute &

Jožef Stefan Institute & Jožef

Jožef Stefan Institute

Jožef Stefan IPS

Stefan IPS

Ljubljana, Slovenia

Ljubljana, Slovenija

Ljubljana, Slovenia

ana.kostovska@ijs.si

saso.dzeroski@ijs.si

pance.panov@ijs.si

ABSTRACT

The benefits of publishing FAIR data are manifold. It spe-

With the exponential growth of data in all areas of our lives,

eds up the process of knowledge discovery and reduces the

there is an increasing need of developing new approaches for

consumption of resources. When the FAIR-compliant data

effective data management. Namely, in the field of Data Mi-

at hand does not contain all the information needed it can be

ning (DM) and Knowledge Discovery in Databases (KDD),

easily integrated with data from external sources and boost

scientists often invest a lot of time and resources for collec-

the overall KDD performance [12].

ting data that has already been acquired. In that context, by

publishing open and FAIR (Findable, Accessible, Interope-

Semantic data annotation, being very powerful technique,

rable, Reusable) data, researchers could reuse data that was

is massively used in some domains, i.e. medicine, but it is

previously collected, preprocessed and stored. Motivated by

sill in the early phases in the domain of data mining and

this, we conducted extensive review on current approaches,

machine learning. To the best of our knowledge, there are

data repositories and semantic technologies used for annota-

no semantic data repositories that adhere to the FAIR prin-

tion, storage and querying of datasets for the domain of ma-

ciples. We recognize the ultimate benefits of having one and

chine learning (ML) and data mining. Finally, we identify

we are going in depths of the research covering semantic data

the limitations of the existing repositories of datasets and

annotation, ontology usage, storing and querying of data.

propose a design of a semantic data repository that adheres

to FAIR principles for data management and stewardship.

2.

BACKGROUND AND RELATED WORK

The Semantic Web (Web 3.0) is an extension of the World

1.

INTRODUCTION

Wide Web in which information is given semantic meaning,

One of the main use of data is in the process of knowledge

enabling machines to process that information. The aim of

discovery, where scientist employ ML and DM methods and

the Semantic Web initiative is to enhance web resources with

try to solve various real-life problems from diverse fields,

highly structured metadata, known as semantic annotations.

from systems biology and medicine, to ecology and enviro-

When one resource is semantically annotated, it becomes a

nmental sciences. In order to obtain their objectives, they

source of information that is easy to interpret, combine and

need high-quality data. The quality of the data is crucial to a

reuse by the computers [13]. In order to achieve this, the

DM project’s success. Ultimately, no level of algorithmic so-

Semantic Web uses the concept of Linked Data. Linked data

phistication can make up for low-quality data. On the other

is build upon standard web technologies [7] including HTTP,

hand, progress in science is best achieved by reproducing,

RDF, RDFS, URIs, Ontologies, etc.

reusing and improving someone else’s work. Unfortunately,

datasets are not easily obtained, and even if they are, they

For uniquely identifying resources across the whole Linked

come with limited reusability and interoprability.

Data, each resource is given a Unified Resource Iden-

tifier (URI). The resources are then enriched with terms

A key-aspect in advancing research is making data open

from controlled vocabularies, taxonomies, thesauruses, and

and FAIR. FAIR are four principles that have been recen-

ontologies. The standard metadata model used for logical

tly introduced to support and promote good data manage-

organization of data is called Resource Description Fra-

ment and stewardship [17]. Data must be easily findable

mework (RDF). Its basic unit of information is the triplet

(Findability) by both humans and machines.

This me-

compiled from a subject, a predicate, and an object. These

ans data should be semantically annotated with rich meta-

three components define the concepts and relations, the bu-

data and all the resources must be uniquely identified. The

ilding blocks of an ontology.

metadata should always be accessible (Accessibility) by

standardized communication protocols such as HTTP(S) or

In the context of computer science, ontology is “an expli-

FTP, even when the data itself is not. Data and metadata

cit formal specifications of the concepts and relations among

from different data sources can be automatically combined

them that can exist in a given domain” [3]. As computational

(Interoperabiliy). To do so, the benefits of formal voca-

artifacts, they provide the basis for sharing meaning both

bularies and ontologies should be exploited. Data and me-

at machine and human level. When creating an ontology,

tadata is released with provenance details and data usage

there are multiple languages to choose from. RDF Shema

licence, so that humans and machines know whether data

(RDFS) is ontology language with small expressive power.

can be replicated and reused or not (Reusabiliy).

It provides mechanisms for creating simple taxonomies of

41



concepts and relations. Another commonly used ontology

There are numerous repositories of ML datasets available

language is the Web Ontology Language (OWL). OWL

online.

The UCI repository3 is the most popular reposi-

supports creation of all ontology components: concepts, in-

tory of ML datasets. Each dataset is annotated with several

stances, properties (or relations).

Finally, SPARQL1 is

descriptors such as dataset characteristics, attribute charac-

standard, semantic query language used for querying fast-

teristics, associated task, number of instances, number of

growing private or public collections of structured data on

attributes, missing values, area, etc. Similarly, Kaggle Da-

the Web or data stored in RDF format.

tasets4, Knowledge Extraction based on Evolutionary Le-

arning (KEEL), and Penn Machine Learning Benchmarks

There are diiferent technologies for storing data and meta-

(PMLB)5 are well-known dataset repository that provide

data. The most broadly used are relational databases,

users with data querying based on the descriptors attached

digital databases based on the relational model of data or-

to the datasets. OpenML6 is an open source platform desi-

ganized in tables, forming entity-relational model. Another

gned with the purpose of easing the collaboration of resear-

approach that became popular with the appearance of Big

chers within the machine learning community [14]. Resear-

Data are NoSQL databases [5], which are flexible databases

chers can share datasets, workflows and experiments in such

that do not use relational model. Triplestores are specific

a way that they can be found and reused by others. When

type of NoSQL databases, that store triples instead of rela-

the data format of the datasets is supported by the platform,

tional data. Triplestores use URIs and can be queried over

the datasets are annotated with measurable characteristics

trillions of records, which makes them very applicable.

[15]. These annotations are saved as textual descriptors and

are used for searching through the repository.

Data in an information system can reside in different hete-

rogeneous data sources, both internal and external to the

In contrast to the above mentioned repositories, there are

organization.

In this setting, the relevant data from the

frameworks in other domains that offer advanced techniques

diverse sources should be integrated. Accessing disparate

for describing, storing and querying datasets. One cutting-

data sources has been a difficult challenge for data analysts

edge framework in the domain of neuroscience is Neurosci-

to achieve in modern information systems, and an active re-

ence Information Framework(NIF) [4]. Its core objec-

search area. OBDA [1, 11] is much longed-for method that

tive is to create a semantic search engine that benefits from

addresses this problem. It is a new paradigm, based on a

semantic indexes when querying distributed resources by

three-level architecture constituted of the ontology, the data

keywords. The Gene Ontology Annotation (GOA),

sources, and the mappings between the two (see Figure 1).

is a database that provides high-quality annotations of ge-

With this approach, OBDA provides data structure descrip-

nome data [2]. The annotations are based on GO, a voca-

tion, as well as semantic description of the concepts in the

bulary that defines concepts related to gene functions and

domain of interest and roles between them.

relation among them. Large part of the annotations are ge-

nerated electronically by converting existing knowledge from

the data to GO terms. Electronic annotations are associated

with high-level ontology terms. The process of generating

more specific annotations can hardly be automated with the

current technologies, therefore it is done manually.

3.

CRITICAL ASSESSMENT

Figure 1. The OBDA architecture

In this section, we conduct critical assessment of the cur-

rent research based on the review presented in the previous

section.

In the context of semantic ML data repository, we group

ontologies in three categories, i.e., ontologies for describing

Semantic Web technologies. The whole stack of seman-

machine learning and data mining, ontologies for provenance

tic technologies provide ways of making the content readable

information, and domain ontologies. OntoDM ontology de-

by machines. The metadata that describes the content can

scribes the domain of data mining. It is composed of three

be used not only to disregard useless information, but also

sub-ontologies: OntoDT [10] - generic ontology for repre-

for merging results to provide a more constructed answer.

sentation of knowledge about datatypes; OntoDM-core [8] -

A major drawback of this process of giving data a semantic

ontology of core data mining entities (e.g., data, DM task,

meaning is that it is time consuming and requires great amo-

generalizations, DM algorithms, implementations of algori-

unt of resources, thus people sometimes feel unmotivated to

thms, DM software); OntoDM-KDD [9] - ontology for repre-

do it. Another point to make is that semantic annotations

senting the knowledge discovery process following CRISP-

cannot solve the ambiguities of the real world.

DM process model.

The Data Mining OPtimization

Ontology (DMOP) [6] has been designed to support au-

Technologies for storing data and metadata.

The

tomation at various choice points of the DM process, i.e.,

data in relational databases is stored in a very structured

choosing algorithms, models, parameters. The PROV On-

way, making them a good choose for applications that relay

tology (PROV-O)2 and Dublin Core vocabulary [16]

3

facilitate the discovery of electronic resources by providing a

https://archive.ics.uci.edu/ml/

4

base for describing provenance information about resources.

https://www.kaggle.com/datasets

5https://github.com/EpistasisLab/

1https://www.w3.org/TR/rdf-sparql-query/

penn-ml-benchmarks

2https://www.w3.org/TR/prov-o/

6https://www.openml.org/

42

on heavy data analysis. Moreover the referential integrity

the approaches and technologies. Each of the proposed ar-

guarantees that transactions are processed reliably. While

chitectures has positive and negative sides, so there will be

relational databases are a suitable choice for some applica-

trade-off when choosing one.

tions, they have difficulties dealing with large amounts of

data. On the other hand, NoSQL databases were designed

The common part of the three designs is that DM and ML

primarily for big data and can be run on cluster architectu-

datasets will be annotated through a semantic annotation

res. Non-relational databases store unstructured data, with

engine. The semantic query engine will receive SPARQL

no logical schema. They are flexible, but this comes with

query as input, and it will bring back results in form of set

the price of potentially inconsistent data.

of RDF triples. There will be SPARQL endpoint through

which users can specify the query used as input in the se-

Describing data and metadata. OntoDM is an ontology

mantic query engine. Another open possibility is to enable

that describes the domain of DM, ML and KDD with a great

users to query data and metadata by simply writing key-

level of detail. Because it covers a wide area, some parts

words. Later, the system itself generates SPARQL query

would be irrelevant for our application. DMOP is ontology

based on those keywords. The anotation schema used by

built with the special use case of optimizing the DM process.

the semantic annotation engine will be based on three di-

Nevertheless, both of them can be used for describing ML

fferent types of ontologies such as ontologies for DM and

and DM datasets. DC vocabulary and PROV-O define a

ML (e.g., OntoDT, OntoDM-core, Onto-KDD, DMOP), do-

wide range of provenance terms, therefore both of them can

main ontologies, and ontologies and schemes for describing

be employed in the provenance metadata generation.

provenance information (e.g., Dublin Core ontology, PROV-

O). Part of the annotations will be generated automatically,

Repositories of machine learning datasets. The UCI

e.g., annotations related to datatypes, while others will be

repository offers a wide range of datasets, but they are not

semi-automatically because they require concept mapping,

available through a uniform format or API. Although it also

e.g., annotations based on domain ontologies.

provides data descriptors for searching the data, a major

setback is that none of the descriptors is based on any vo-

We plan to build a web-based user interface that will enable

cabulary or ontology, which certainly limits interoperabi-

users to search and query both datasets and metadata anno-

lity. Kaggle Datasets, KEEL, PMLB also provide similar

tations. Users will be given a chance of uploading new data-

meta annotations, but they all lack semantic interpretabi-

sets in CSV or ARFF fromat. Besides the dataset, users will

lity. Another shortcoming of the UCI repository, KEEL and

be expected to specify some additional information about it

PMLB is that they don’t allow uploading new datasets. All

such as data mining task they plan to execute on the data,

datasets stored in the OpenML repository can be downlo-

domain, provenance information, descriptions of the attri-

aded in CSV or ARFF format. The annotation are based

butes, etc. Since the whole process of semantic annotation

on Exposé ontology, and they can be downloaded in JSON,

can’t be automatic, when new dataset is uploaded, it won’t

XML or RDF format. A major weakness of this repository

be immediately available on the site. First it must be cura-

is that annotations are not stored, but they are calculated

ted, and only when the complete set of metadata annotati-

on-the-fly and can not be used for semantic inference.

ons is generated, the metadata will be published online. The

dataset itself will be released under clear data usage licence.

Frameworks for describing, storing and querying do-

main datasets. The NIF framework is very progressive in

The three architectural designs differ in the way of storing

terms of semantic annotation, storing, and querying. Its ad-

the datasets. The metadata annotations will be RDF tri-

vantages come from providing domain experts with the abi-

plets and they will be stored in triplestore that optimizes

lity to contribute to the ontology development, by adding

physical storage. Next, we briefly explain the differences

new terms through the use of Interlex. It has a powerful

between storing the datasets and what are the effects on

search engine, and it follows the OBDA paradigm. Hetero-

querying.

geneous data is stored in its original format. The user defi-

ned, keyword query is mapped to ontological terms to find

Proposal I. The simplest approach of storing a dataset

synonyms, and then translated to a query relevant to the in-

would be to store it in RDF format in the same triplestore

dividual data store. With respect to the genomics domain,

as the metadata. The datasets from their original format,

GOA database is favourable because of its high-quality an-

will be converted to RDF triples. Having only one triplestore

notations. Curators put extreme efforts in generating ma-

will ease querying, but it will require more storage capacity

nual annotations. To speed up the query execution it uses

(see Figure 2).

the Solr document store. Another superiority of GOA data-

Proposal II. The second option is to store the datasets in

base is that it provides advanced filtering of the annotations,

a relational database and the metadata in RDF triplestore.

for downloading customized annotation sets. The deficiency

Datasets from CSV or ARFF format will be translated into

of NIF and GOA database is that they are not able to query

a relational database. Here, querying becomes more compli-

and access the annotations in RDF format, which is an emer-

cated, for which we will need a federated query engine. A

ging standard for representing semantic information

federated query engine allows simultaneous search on multi-

ple data sources. A user makes a single query request, which

4.

PROPOSAL FOR SEMANTIC

is distributed across the management systems participating

REPOSITORY OF DM/ML DATASETS

in the federation and translated to a query written in a lan-

guage relevant to the individual system. We will have two

In this section, we propose three possible architecture desi-

data stores, one for the data itself and one for the metadata.

gns of the semantic data repository for the domain of ML

For querying the two data stores, we will still use the same

and DM. The proposals are based on the critical review of

43





querying of ML and DM datasets. We also examined speci-

fic implementations of frameworks in the domain of neuro-

science and genomics. Taking into consideration the critical

assessment of the current state-of-the-art we will construct

semantic data repository for ML and DM datasets.

The

semantic repository would be utilized for easy access of se-

mantically rich annotated datasets and semantic inference.

This, will improve the reproducibility and reusability in ML

and DM research area. Moreover, annotating the datasets

with domain ontologies will facilitate the process of under-

standing the analyzed data. As of now, we have three pro-

posed architectural designs for the semantic data repository

Figure 2. Architectural design I

that differ in the way of storing the datasets. We will either

store both data and metadata in a triplestore, or we will

have multiple data stores which will require usage of tools

RDF query language, SPARQL. In order to query the rela-

and methods from the ontology based data access paradigm.

tional database with SPARQL, it will be mapped to virtual

RDF graph (see Figure 3).

Acknowledgements

The authors would like to acknowledge the support of the Slovenian

Research Agency through the projects J2-9230, N2-0056 and L2-7509

and the Public Scholarship, Development, Disability and Maintenance

Fund of the Republic of Slovenia through its scholarship program.

6.

REFERENCES

[1] Mihaela A Bornea et al. Building an efficient rdf store over a

relational database. In Proceedings of the 2013 ACM

SIGMOD International Conference on Management of Data,

pages 121–132. ACM, 2013.

[2] Gene Ontology Consortium. Gene ontology consortium: going

forward. Nucleic acids research, 43(D1):D1049–D1056, 2014.

[3] Thomas R Gruber. Toward principles for the design of

ontologies used for knowledge sharing? International journal

of human-computer studies, 43(5-6):907–928, 1995.

[4] Amarnath Gupta et al. Federated access to heterogeneous

information resources in the neuroscience information

framework (nif). Neuroinformatics, 6(3):205–217, 2008.

[5] Jing Han et al. Survey on nosql database. In Pervasive

Figure 3. Architectural design II

computing and applications (ICPCA), 2011 6th international

conference on, pages 363–366. IEEE, 2011.

[6] C Maria Keet et al. The data mining optimization ontology.

Web Semantics: Science, Services and Agents on the World

Proposal III. Instead of mapping the relational database

Wide Web, 32:43–53, 2015.

to virtual RDF graph, we can use the OBDA methodology

[7] Brian Matthews. Semantic web technologies. E-learning, 6(6):8,

2005.

and federated querying to use a combination of SQL que-

[8] Panče et al. Panov. Ontology of core data mining entities. Data

ries and SPARQL queries. Metadata will be queried with

Mining and Knowledge Discovery, 28(5-6):1222–1265, 2014.

SPARQL queries, but for the datasets, they will be mapped

[9] Panče Panov et al. Ontodm-kdd: ontology for representing the

to SQL queries. The integrated results are brought back to

knowledge discovery process. In International Conference on

Discovery Science, pages 126–140. Springer, 2013.

the user (see Figure 4).

[10] Panče Panov et al. Generic ontology of datatypes. Information

Sciences, 329:900–920, 2016.

[11] Antonella Poggi et al. Linking data to ontologies. In Journal

on data semantics X, pages 133–173. Springer, 2008.

[12] Petar Ristoski and Heiko Paulheim. Semantic web in data

mining and knowledge discovery: A comprehensive survey. Web

semantics: science, services and agents on the World Wide

Web, 36:1–22, 2016.

[13] Gerd Stumme et al. Semantic web mining: State of the art and

future directions. Web semantics: Science, services and

agents on the world wide web, 4(2):124–143, 2006.

[14] Jan N Van Rijn et al. Openml: A collaborative science

platform. In Joint European Conference on Machine Learning

and Knowledge Discovery in Databases, pages 645–649.

Springer, 2013.

[15] Joaquin Vanschoren et al. Taking machine learning research

online with openml. In Proceedings of the 4th International

Conference on Big Data, Streams and Heterogeneous Source

Figure 4. Architectural design III

Mining, pages 1–4. JMLR. org, 2015.

[16] Stuart Weibel. The dublin core: a simple content description

model for electronic resources. Bulletin of the Association for

5.

CONCLUSION

Information Science and Technology, 24(1):9–11, 1997.

[17] Mark D Wilkinson et al. The fair guiding principles for scientific

We have conducted a literature overview of research be-

data management and stewardship. Scientific data, 3, 2016.

ing done in the field of semantic annotation, storage, and

44





Towards a semantic store of

data mining models and experiments

Ilin Tolovski

Sašo Džeroski

Panče Panov

Jožef Stefan International

Jožef Stefan Institute & Jožef

Jožef Stefan Institute & Jožef

Postgraduate School & Jožef

Stefan International

Stefan International

Stefan Institute

Postgraduate School

Postgraduate School

Ljubljana, Slovenia

Ljubljana, Slovenia

Ljubljana, Slovenia

ilin.tolovski@ijs.si

saso.dzeroski@ijs.si

pance.panov@ijs.si

ABSTRACT

sible, Interoperable, Reusable) data principles, introduced

Semantic annotation provides machine readable structure to

by Wilkinson et al. [9]. Implementing these principles for

the stored data. We can use this structure to perform seman-

the annotation, storing, and querying of data mining models

tic querying, based on explicitly and implicitly derived infor-

and experiments will provide a solid ground for researchers

mation. In this paper, we focus on the approaches in seman-

interested in reproducing and reusing the results from the

tic annotation, storage and querying in the context of data

previous research on which they can build and improve.

mining models and experiments. Having semantically anno-

tated data mining models and experiments with terms from

In the literature, there exist some approaches that address

domain ontologies and vocabularies will enable researchers

some of these problems. In both ontology engineering and

to verify, reproduce, and reuse the produced artefacts and

data mining community, there are approaches that aim to-

with that improve the current research. Here, we first pro-

wards describing the data mining domain, as described in

vide an overview of state-of-the-art approaches in the area of

Section 2. Furthermore, Vanschoren et al. [5] developed the

semantic web, data mining domain ontologies and vocabu-

OpenML system, a machine learning experiment database

laries, experiment databases, representation of data mining

for storing various segments of a machine learning experi-

models and experiments, and annotation frameworks. Next,

ment such as datasets, flows (algorithms), runs, and com-

we critically discuss the presented state-of-the-art. Further-

pleted tasks.

more, we sketch our proposal for an ontology-based system

for semantic annotation, storage, and querying of data min-

In other domains, such as life sciences, storing annotated

ing models and experiments. Finally, we conclude the paper

data about experiments and their results is a common prac-

with a summary and future work.

tice. This is mostly due to the fact that the experiments

are more expensive to conduct, and require specific prepara-

tions. From the perspective of annotation frameworks, there

1.

INTRODUCTION

are significant advances in these domains, such as The Cen-

Storing big amounts of data from a specific domain comes in

ter for Expanded Data Annotation and Retrieval (CEDAR)

hand with several challenges, one of them being to seman-

workbench [8] , and the OpenTox framework [11].

tically represent and describe the stored data.

Semantic

representation enables us to infer new knowledge based on

This paper is organized as follows. First, we make an overview

the one that we assert, i.e. the description and annotation

of the state-of-the-art approaches in annotating, storing, and

of the data. This can be done by providing semantic annota-

querying of models and experiments. Next, we critically as-

tions of the data with terms originating from a vocabulary or

sess these approaches and sketch a proposal for a system for

ontology describing the domain at hand. In computer and

annotating, storing and querying data mining models and

information science, ontology is a technical term denoting

experiments. Finally, we provide a summary and discuss

an artifact that is designed for a purpose, which is to en-

the possible approaches for further work.

able the modeling of knowledge about some domain, real or

imagined [15]. Ontologies provide more detailed description

of a domain, first by organizing the classes into a taxonomy,

2.

BACKGROUND AND RELATED WORK

and further on by defining relations between classes. With

The state-of-the-art in semantic annotation of data min-

semantic annotation we attach meaning to the data, we can

ing models and experiments provides very diverse research,

infer new knowledge, and perform queries on the data.

ranging from domain-specific data mining ontologies, exper-

iment databases, to new languages for deploying annotations

Data mining and machine learning experiments are con-

in unified format. Here, we provide an introduction to the

ducted with faster pace than ever before, in various settings

state-of-the-art in semantic web, ontologies and vocabular-

and domains. In the usual practice of conducting data min-

ies, representations of data mining models and experiments,

ing experiments, almost none of the settings are recorded,

experiment databases, and annotation frameworks.

nor the produced models are stored. These predicaments

make for a research that is hard to verify, reproduce and up-

Semantic technologies.

The Semantic Web is defined

grade. This is also in line with the FAIR (Findable, Acces-

as an extension of the current web in which information is

45

given well-defined meaning, enabling computers and people

for (semi) automatically or manually annotating data, there

to work in cooperation [14]. The stack of technologies con-

are several solutions that exist outside of the data min-

sists of multiple layers, however, in this paper we will focus

ing domain, which provide innovative approaches and good

on the ones essential for our scope of research. Resource

foundation for development in the direction of creating a

Description Framework (RDF) represents a metadata data

software to enable ontology-based semantic annotation of

model for the Semantic Web, where the core unit of informa-

models and experiments, their storage and querying. The

tion is presented as a triple. A triple describes the subject by

CEDAR Workbench [13] provides an intuitive interface for

its relationship, which is what the predicate resembles, with

creating templates and metadata annotation with concepts

the object. RDF files are stored in triple store (typically or-

defined in the ontologies available at BioPortal4. On the

ganized as relational or NoSQL databases [12]), on which we

other hand, OpenTox [11] represents domain specific frame-

can perform semantic queries, by using querying languages

work that provides unified representation of the predictive

such as SPARQL. Finally, ontology languages, such as Re-

modelling in the domain of toxicology.

source Description Framework Schema (RDFS) and Ontol-

ogy Web Language (OWL), are formal languages used to

3.

CRITICAL ASSESSMENT

construct ontologies. RDFS provides the basis for all ontol-

ogy languages, defining basic constructs and relations, while

In this section, we will critically assess the presented state-

OWL is far more expressive enabling us to define classes,

of-the-art in Section 2 in the context of semantic annota-

properties, and instances.

tion, storage and querying of data mining models and ex-

periments.

Ontologies & vocabularies. Currently, there are several

ontologies that describe the data mining domain.

These

The state-of-the-art in ontology design for data mining pro-

include the OntoDM ontology [16], DMOP ontology [7], Ex-

vides well documented research with various ontologies that

pose [4], KDDOnto [1], and KD ontology [10]. MEX [2] is an

thoroughly describe the domain from different aspects and

interoperable vocabulary for annotating data mining mod-

can be used in various applications. OntoDM provides uni-

els and experiments with metadata. In addition there have

fied framework of top level data mining entities. Building

been developments in formalism for representing scientific

on this, it describes the domain in great detail, containing

experiments in general, such as the EXPO ontology [6].

definitions for each part of the data mining process. Because

of the wide reach, it lacks a particular use case scenario. On

Representation of models.

With the constant devel-

the other hand, this same property makes this ontology suit-

opment of new environments for developing data mining

able for wide range of applications where there is a need of

software, it is necessary to have a unified representation

describing a part of the domain.

of the constructed data mining models and the conducted

experiments. The first open standard was the Predictive

Ontologies like EXPO and Exposé have a essential meaning

Model Markup Language (PMML). For a period of time it

in the research since the first one describes a very wide and

provided transparent and intuitive representation of data

important interdisciplinary domain, while the latter uses it

mining models and experiments.

However, due to the

as a base for defining a specific sub-domain. DMOP ontol-

fast growth in the development of new data mining meth-

ogy describes the process of algorithm and model selection in

ods, PMML was unable to follow the pace and extend its

the context of semantic meta mining. Both the KD ontology

more and more complicated specification. Its successor, the

and KDDOnto describe the knowledge discovery process in

Portable Format for Analytics (PFA), was developed having

the context of constructing knowledge discovery workflows.

the PMML’s drawbacks as guidelines for improvement.

They differ mainly in the key concepts on which they were

built. At the same time, the MEX vocabulary provides a

Experiment and model databases. Storing already con-

lightweight framework for automating the metadata gener-

ducted experiments in a well structured and transparent

ation. Since it is tied with Java environment, it provides

manner is essential for researchers to have available, veri-

a library which only uses the MEX API and can also be

fiable, and reproducible results. An experiment database is

implemented in other programming languages.

designed to store large number of experiments, with detailed

information on their environmental setup, the datasets, algo-

All in all, the current state of the art in ontologies for data

rithms and their parameter settings, evaluation procedure,

mining provides a good foundation for development of ap-

and the obtained results [3]. The state-of-the-art in storing

plications which will be based on one or several of these

setups and results is abundant with approaches and solu-

ontologies. Given the wide of coverage they can be easily be

tions in different domains. For example, OpenML1 is the

combined in a manner to suit the application at hand.

biggest machine learning repository of data mining datasets,

tasks, flows, and runs, the BioModels2 repository stores

In the area of descriptive languages for data mining models

more than 8000 experiments and models from the domains

and experiments, one can see the path of progress in re-

of systems biology, and ModelDB3 is an online repository

search. PMML was the first, ground-breaking, XML-based

for storing computational neuroscience models.

descriptive language. However, with the expansion of the

data mining domain, several weaknesses of PMML emerged.

Annotation frameworks. When it comes to frameworks

The language was not extensible, users could not create

chains of models, and it was not compatible with the dis-

tributed data processing platforms.

Therefore, the same

1https://www.openml.org/

community started working on a new, extensible, portable

2http://www.ebi.ac.uk/biomodels/

3https://senselab.med.yale.edu/modeldb/

4https://bioportal.bioontology.org/

46

language. Since its inception, the PFA format was intended

need to have complete information about the conditions in

to fill the small gaps that PMML had. Made up of analytic

which that experiment was conducted. Namely, we need to

primitives, implemented in Python and Scala, it provides the

have an annotated dataset, annotation of the algorithm and

users with more customizable framework, where they can

its parameter settings for the specific run of the experiment.

create custom models, model chains, and implement them

Since one experiment usually consists of multiple algorithm

in a parallelized setting.

runs we annotate each run separately, as well as each of the

results from each of them. For annotating the results, we use

Storing and annotating experiments is of great significance

the definitions of the performance metrics formalized in the

in multiple scenarios. First, in domains where conducting

data mining ontologies. A sketched example of the proposed

the experiment is not a trivial task, i.e.

the physical or

solution is shown in Figure 1.

financial conditions challenge the process, there needs to be a

database where the setup and the findings of the experiment

The proposed system for ontology-based annotation, stor-

will be saved. For example, in BioModels.net we have two

age, and querying of data mining experiments and models

groups of experiments: Manually curated with structured

will consist of several components. The users will interact

metadata, and experiments without structure. The main

with the system through an user interface enabling them

drawback with this type of storage is the need for manual

to run experiments on a data mining software, which will

curation of the metadata. It is repetitive, time consuming

export models and experiment setups to a semantic anno-

task for which there is a strong need to be automated.

tation engine. For example, for testing purposes we plan to

use CLUS5 software for predictive clustering and structured

In the domain of neuroscience, ModelDB provides an online

output prediction, which generates different types of models

service for storing and searching computational neuroscience

and addresses different data mining tasks.

models. In this database, alongside the files that constitute

the models, researchers also need to upload the code that

In the semantic annotation engine, the data mining mod-

defines the complete specification of the attributes of the

els and experiments will be annotated with terms from the

biological system represented in the model, together with

extended OntoDM ontology and then stored in a database.

files that describe the purpose and application of the model.

Once stored, the users will be able to semantically query

Therefore, researchers can search the database for models

the models and experiments in order to infer new knowl-

with specific applications describing biological systems.

edge. This will be done through a querying engine based on

the SPARQL language, accessible through a user interface.

OpenML provides a good framework for storing and anno-

tating data mining datasets, experimental setups and runs,

In order to perform annotation, we will extend the exist-

as well as algorithms. One particular drawback of OpenML

ing OntoDM ontology by adding a number of new terms,

is that it does not store the actual models that are produced

linking it to other domain ontologies, such as Exposé and

from each experimental run, and one can not query the mod-

EXPO. Linking OntoDM to these ontologies will extend the

els. Furthermore, it’s founded on relational-database which

domain of OntoDM towards connecting the data mining en-

can not provide execution of semantic queries.

tities that it already covers with new entities that describe

the experimental setup and principles. With this we will

All in all, these three examples show significant advances in

obtain a schema for annotation of data mining models and

storing and annotating models and experiments. However,

experiments. The schema will then be used to annotate the

there is also a significant room for improvement in the di-

data mining models and experiments through a semantic an-

rection of storing the models and experiments into NoSQL

notation engine. The engine will have to read the models

databases that are better suited for this task.

and experiments from a data mining software system, anno-

tate them with terms from developed schema and produce

Finally, in the context of annotation tools the CEDAR Work-

an RDF representation of the annotated data.

bench and the OpenTox Framework provide a good insight

in annotation frameworks. CEDAR enables the user to ex-

Furthermore, the RDF graphs will be stored in a triple store

ecute the annotations in modular manner by creating tem-

database. Since the data mining models and experiments

plates and adding elements to them.

After curating the

differ a lot in their structure, we have yet to decide on the

annotations, they can export the schemas either in JSON,

type of database in which we will store them. The data

JSON-LD, or RDF file. OpenTox [11] is also based on on-

stored in this way is set for performing semantic queries

tology terms and represents a complete framework that de-

on top of it. Therefore, we will develop a SPARQL-based

scribes the predictive process in toxicology, starting with

querying enigne so the users can perform predefined or cus-

toxicity structures and ending with the predictive modelling.

tom semantic queries on top of the storage base.

Finally, the format of the results is another point where we

4.

A PROPOSAL FOR SEMANTIC STORE

need to decide whether the results will be presented as RDF

OF MODELS AND EXPERIMENTS

graphs, or in a different format (such as JSON) that is easier

to interpret. This software package along with the storage

After analysing the previous and current research, we can

will then be added as a module to the CLUS software, de-

conclude that despite the great achievements, there is a wide

veloped at the Department of Knowledge Technologies.

area for improvement in which we will contribute in the up-

coming period by developing an ontology-based framework

for storage and annotation of data mining models and exper-

iments. In order to annotate a data mining experiment, we

5http://sourceforge.net/projects/clus

47

Annotation Schema



Semantically



RDF Triples

Domain

Extends

Annotated

OntoDM

Storage of DM

Ontology 2

Experiment

Ontology

experiments

SPARQL Query



SPARQL Query



Domain

Extends

Storage of

Querying

Ontology 1

DM models

RDF Triples

Engine

Semantically

Experiment

Annotated

Semantic

Model

Annotation

User defined query

Model

Engine



Runs experiments

Results

Data Mining

User interface

Software

Figure 1. Schema of the proposed solution

5.

CONCLUSION & FURTHER WORK

and the Public Scholarship, Development, Disability and Maintenance

In this paper, we presented the state-of-the-art in annota-

Fund of the Republic of Slovenia through its scholarship program.

tion, storage and querying in the light of designing a se-

mantic store of data mining models and experiments. We

6.

REFERENCES

first gave an overview of semantic web technologies, such as

[1] Claudia Diamantini et al. KDDOnto: An ontology for discovery

RDF, SPARQL, RDFS, and OWL that provide a complete

and composition of kdd algorithms. Towards Service-Oriented

Knowledge Discovery (SoKD’09), pages 13–24, 2009.

foundation for annotation and querying of data.

[2] Diego Esteves et al. MEX Vocabulary: a lightweight

interchange format for machine learning experiments. In

Furthermore, we critically reviewed the state-of-the-art on-

Proceedings of the 11th International Conference on

tologies and vocabularies for describing the domain of data

Semantic Systems, pages 169–176. ACM, 2015.

[3] Hendrick Blockheel et al. Experiment databases: Towards an

mining provide detailed description of the domain of data

improved experimental methodology in machine learning. In

mining and machine learning (OntoDM, Expose, KD On-

European Conference on Principles of Data Mining and

tology, DMOP and KDDOnto, MEX). Next, we focused on

Knowledge Discovery, pages 6–17. Springer, 2007.

experiment databases as repositories where the experiment

[4] Joaqin Vanschoren et al. Exposé: An ontology for data mining

experiments. In Towards service-oriented knowledge discovery

datasets, setups, algorithm parameter settings, and the re-

(SoKD-2010), pages 31–46, 2010.

sults are available for the performed experiments in various

[5] Joaqin Vanschoren et al. Taking machine learning research

domains. Furthermore, we saw that annotation frameworks

online with OpenML. In Proceedings of the 4th International

provide environments for (semi) automatically or manually

Conference on Big Data, Streams and Heterogeneous Source

Mining, pages 1–4. JMLR. org, 2015.

annotating data, by discussing two frameworks from the do-

[6] Larisa N Soldatova et al. An ontology of scientific experiments.

mains of biomedicine and toxicology in order to analyze best

Journal of the Royal Society Interface, 3(11):795–803, 2006.

practices present in those domains.

[7] Maria C Keet et al. The Data Mining OPtimization Ontology.

Web Semantics: Science, Services and Agents on the World

Wide Web, 32:43–53, 2015.

Finally, given the performed analysis of the state-of-the-art,

[8] Mark A Musen et al. The Center for Expanded Data

we outlined our proposal for an ontology-based framework

Annotation and Retrieval. Journal of the American Medical

for annotation, storage, and querying of data mining mod-

Informatics Association, 22:1148–1152, 2015.

els and experiments. The proposed framework consists of an

[9] Mark D. Wilkinson et al. The FAIR guiding principles for

scientific data management and stewardship. Scientific Data,

annotation schema, a semantic annotation engine, and stor-

3, 2016.

age for data mining models and experiments with a querying

[10] Monika Záková et al. Automating knowledge discovery

engine, all of which will be controlled from an user interface.

workflow composition through ontology-based planning. IEEE

Transactions on Automation Science and Engineering,

It will allow users to semantically query their data mining

8:253–264, 2011.

models and experiments in order to infer new knowledge.

[11] Olga Tcheremenskaia et al. OpenTox predictive toxicology

framework: toxicological ontology and semantic media

In the future, we plan to adapt this framework for the needs

wiki-based openToxipedia. In Journal of biomedical semantics,

page S7, 2012.

of research groups or companies that conduct high volume of

[12] Olivier Curé et al. RDF database systems: triples storage and

data mining experiments, enabling them to obtain a queryable

SPARQL query processing. Morgan Kaufmann, 2014.

knowledge base consisting of annotated metadadata for all

[13] Rafael S Gonçalves et al. The CEDAR Workbench: An

experiments and produced models. This will enable them

Ontology-Assisted Environment for Authoring Metadata that

Describe Scientific Experiments. In International Semantic

to reuse existing models on new data for testing purposes,

Web Conference, pages 103–110. Springer, 2017.

infer knowledge based on past experimental results, all while

[14] Tim Berners-Lee et al. The semantic web. Scientific American,

saving time and computational resources.

284:34–43, 2001.

[15] Tom Gruber. Ontology. Encyclopedia of database systems,

pages 1963–1965, 2009.

Acknowledgements

[16] Panče Panov. A Modular Ontology of Data Mining. PhD

The authors would like to acknowledge the support of the Slovenian

thesis, Jožef Stefan IPS, Ljubljana, Slovenia, 2012.

Research Agency through the projects J2-9230, N2-0056 and L2-7509

48





A Graph-based prediction model with applications

[Extended Abstract]

∗

András London

József Németh

Miklós Krész

University of Szeged, Institute

University of Szeged, Institute

InnoRenew CoE

of Informatics

of Informatics

University of Primorska, IAM

Poznan University of

University of Szeged, Institute

Economics, Department of

of Applied Sciences

Operations Research

ABSTRACT

and later it appeared in many areas from social network

We present a new model for probabilistic forecasting using

analysis to optimization in technical networks (e.g.

road

graph-based rating method. We provide a “forward-looking”

and electric networks) [16].

type graph-based approach and apply it to predict football

game outcomes by simply using the historical game results

Making predictions in general, and especially in sports as

data of the investigated competition. The assumption of our

well, is a difficult task. The predictions generally appear in

model is that the rating of the teams after a game day cor-

the form of betting odds, that, in the case of “fixed odds”,

rectly reflects the actual relative performance of them. We

provide a fairly acceptable source of expert’s predictions re-

consider that the smaller the changing of the rating vector –

garding sport games outcomes [21]. Thanks to the increasing

contains the ratings of each team – after a certain outcome

quantity of available data the statistical ranking, rating and

in an upcoming single game, the higher the probability of

prediction methods have become more dominant in sports

that outcome. Performing experiments on European foot-

in the last decade.

A key question is that how accurate

ball championships data, we can observe that the model per-

these evaluations are, more concretely, the outcomes of the

forms well in general and outperforms some of the advanced

upcoming games how accurately can be predicted based on

versions of the widely-used Bradley-Terry model in many

the statistics, ratings and forecasting models in hand.

cases in terms of predictive accuracy. Although the appli-

cation we present here is special, we note that our method

Statistics-based forecasting models are used to predict the

can be applied to forecast general graph processes.

outcome of games based on some relevant information of the

competing teams and/or players of the teams. A detailed

Categories and Subject Descriptors

survey of the scientific literature of rating and forecasting

I.6 [Simulation and Modeling]: Applications; I.2 [Artificial

methods in sports is beyond the scope of this paper, we

Intelligence]: Learning

refer only some important and recent results in the topic.

For some papers with detailed literature overview and sport

applications of the the celebrated Bradley-Terry model [3],

1.

INTRODUCTION

see e.g. [5, 7, 24]). Other popular approach is the Poisson

The problem of assigning scores to a set of individuals based

goal-distribution based analysis. For some references, see

on their pairwise comparisons appears in many areas and ac-

for instance [10, 15, 20]. In these models the goals scored

tivities. For example in sports, players or teams are ranked

by the playing teams follow a Poisson distribution with pa-

according to the outcomes of games that they played; the

rameter that is a function of attack and defense “rate” of

impact of scientific publications can be measured using the

the respective teams. A large family of prediction models

relations among their citations. Web search engines rank

only consider the game results win, loss (and tie) and usu-

websites based on their hyperlink structure. The centrality

ally uses some probit regression model, for instance [11] and

of individuals in social systems can also be evaluated accord-

[13]. More recently, well-known data mining techniques, like

ing to their social relations. Ranking of individuals based

artificial neural networks, decision trees and support vector

on the underlying graph that models their bilateral relations

machines have also become very popular; some references -

has become the central ingredient of Google’s search engine

without being exhaustive - see e.g [8, 9, 14, 18].Based on

∗Corresponding author, email: london@inf.u-szeged.hu

the huge literature it can be concluded that the prediction

accuracy strongly depends on the investigated sport and the

feature set of the machine learning algorithms used. A no-

table part of prediction models based on the historical data

of game results use the methodology of ranking and rat-

ing. Some recent articles in the topic are e.g. [2, 6, 12, 17,

23]. Specifically highlighting [2] the authors analyzed the he

predictive power of eight sports ranking methods using only

win-loss and score difference data of American major sports.

They found that the least squares and random walker meth-

49

ods have significantly better predictive accuracy than other

to a successful bettor is less than that represented by the

methods. Moreover, utilizing score-differential data are usu-

true chance of the event occurring. This means mathemat-

ally more predictive than those using only win-loss data.

ically that 1/odds(i) + 1/odds(j) is more than one. This

profit expected by the agency is known as the “over-round

In contrast to those techniques that use the actual respective

on the book”.

strength of the two competing teams, we provide a graph-

based and forward-looking type approach. The assumption

2.2

The Bradley-Terry Model

of our model is that if a rating of the teams after a game day

The Bradley-Terry model [3] is a widely-used method to as-

correctly reflects the actual relative performance, then the

sign probabilities to the possible outcomes when a set of

smaller the change in that rating after a certain result occurs

n individuals are repeatedly compared with each other in

(in an upcoming single game) the higher the probability of

pairs. For two elements i and j, the probability that i beats

that event occur.

j defined as

The structure of this paper is follows.

After presenting

πi

Pr(i beats j) =

,

the classical approaches (“Betting Odds” and “The Bradley-

πi + πj

Terry Model”), our new model is introduced. Then in Sec. 3

where πi > 0 is a parameter associated to each individual i =

we present our preliminary experimental results, and finally

1, . . . , n, representing the overall skill, or “intrinsic strength”

in Sec. 4 we conclude and discuss some possible research

of it. Equivalently, πi/πj represents the odds in favor i beats

directions.

j, therefore this is a “proportional-odds model”. Suppose

that i and j played Nij games against each other with i

2.

MODELS

winning Wij of them, and all games are considered to be

Let V = (1, . . . , n) be the set of n teams (or players) and

independent. The likelihood is given by

let R be the number of game days in a competition among



W

N

the teams in V . A rating is a function φr : V →

n

ij

ij −Wij

R

that

Y

πi

πj

L(πi, . . . , πn) =

.

assigns a score to each team after each game day r (r =

πi + πj

πi + πj

i<j

1, . . . , R). This is considered as the quantitative “strength”

of the teams. A ranking σr : V → V , after game day r, is

Then the log-likelihood is

an ordering of the teams that is simply obtained by sorting

X



the teams according to the rating φr. Using the game re-

`(πi, . . . , πn) =

Wij log πi − Wij log(πi − πj )

sults data set, one can define a directed multigraph (i.e. a

1≤i6=j≤n

graph where multiple links are allowed), where nodes repre-

n

X

X

sent teams, while links between them represent outcomes of

=

Wij log πi −

Nij log(πi + πj )

games they played. The links are directed and each of them

i=1

1≤i<j≤n

is going from the loser team to the winning team. If ties are

which need to be maximized.

also considered they can be represented by two directed links

with opposite directions and half weight. An edge weighting

One possible derivation of the model assumes team i pro-

can be naturally considered if the final scores of the games

duces an unobserved score S

are given

i, no matter which is the op-

posing team, with the cumulative distribution function

2.1

Betting Odds

Si ∼ Fi(s) = exp[−e−(s−log πi )].

Bookmakers determine betting odds for the games accord-

ing to their expectations of outcome probabilities. Here we

It follows that distribution of the difference Si − Sj follows

deal with fixed odds, means that they do not vary over time

a logistic distribution function

depending on the betting volumes. These “fixed-odds” rep-

1

resent the predictions of bookmakers [21]. The meaning of

Si − Sj ∼ Fij (s) =

,

1 + e−(s−(log πi−log πj)

the betting odds for an upcoming game is the following: As-

sume that the betting odds between team i and team j are

which implies that

odds(i) and odds(j), respectively. It means that if one bets

1

$1 to i’s win and it comes out, he wins odds(i) dollars, while

Pr(Si > Sj ) = Pr(Si − Sj > 0) = 1 − 1 + elogπi−logπj

if j wins, then the bettor loses his $1. We can calculate the

πi

probabilities of the respective events as

=

.

πi + πj

1/odds(i)

Pr(i beats j) = 1/odds(i) + 1/odds(j)

Extension with Home advantage and Tie. A natu-

and

ral extension of the Bradley-Terry model with “home-field

1/odds(j)

advantage”, according to [1], say, is to calculate the proba-

Pr(j beats i) =

.

1/odds(i) + 1/odds(j)

bilities as

(

θπ

We should note here that odds provided by betting agen-

i

,

if i is at home

Pr(i beats j) =

θπi+πj

cies do not represent the true chances (as imagined by the

πi

,

if j is at home

πi+θπj

bookmaker) that the event will or will not occur, but are the

amount that the bookmaker will pay out on a winning bet.

where θ > 0 measures the strength of the home-field advan-

The odds include a profit margin meaning that the payout

tage (or disadvantage). Considering also a tie as a possible

50

final result of a game, the following calculations, proposed

node home-i with weight x and an edge from node home-

in [22], can be used :

i to node away-j with weight y are added to the graph,

π

respectively. Our assumption is that if an outcome x : y

i

Pr(i beats j) =

,

has a high probability and it occurs, then it causes a small

πi + απj

change in the PageRank vector; hence δxy will be small. To

simplify the notations let {δ1, . . . , δm} be the distance val-

(α2 − 1)πiπj

Pr(i ties j) =

ues obtained by considering different results {E1, . . . , Em}

(πi + απj )(απi + πj )

of the upcoming game between i and j. The goal now is

where α > 1. Combining them is straightforward. In our

to calculate the probability that a certain result occurs if

experiments, we used the Matlab implementations found at

{δ1, . . . , δm} is given. To do this, we use the following sim-

ple statistics-based machine learning method. Let f +() be

http://www.stats.ox.ac.uk/~caron/code/bayesbt/ using

the expectation maximization algorithm, described in detail

the probability density function of δi random variable where

in [7].

the event (game result) Ei occurred. In our implementa-

tion Ei ∈ {0 : 0, 1 : 0, 1 : 1, . . . , 5 : 5}, assuming that the

probability of other results equals 0. Similarly, let f −() be

2.3

Rating-based Model with Learning

the probability density function of δi random variable in

Our new model is designed as follows. We will use the term

which case the event (game result) Ei did not occur. To

“game day” in each case when at least one match is played

approximate the f +() and f −() functions, for each game

on the given day.

For any game day in which we make

we use the training data set contains all results and related

a forecast, we consider the results matrix that contains all

δi (i = 1, . . . , m) values of the preceding T = 40 game days

the results of the previous T = 40 game days. For the 40

of the considered game. In our experiments, the gamma dis-

game days time window, the entries of the results matrix S

tribution (and its density function) turned out to be a fairly

are defined as Sij = #{scores team home-i achieved against

good approximate for f +(δ) and f −(δ).

team away-j}. To take into account the home-field effect, for

each team i we distinguish team home-i and team away-i.

Assuming that δ1, . . . , δm are independent, using the Bayes

Thus, we define a 2n × 2n results matrix, which, in fact,

theorem and the law of total probability, we can calculate

describes a bipartite graph where each team appears both

that

in the home team side and the away team side of the graph.

For rating the teams, a time-dependent PageRank method

f +(δi) Q

f −(δ

k6=i

k )

Pr(Ei|{δ1, . . . , δm}) =

.

is used. The PageRank scores are calculated according the

P

f +(δ

f −(δ

`

`) Qk6=l

`)

time-dependent PageRank equation

We should note here that in this way we assign probabilities

λ

φ = Π =

[I − (1 − λ)St

to concrete game final results, which is another novelty of

N

mod(l1t)−1]−11,

(1)

our model. Then, for the upcoming game between i and j,

defined in [19]. The damping factor is λ = 0.1, while we may

the outcome probability of the event “i beats j” is calculated

multiply each entry of S with the exponential function 0.98α

as

to consider time-dependency and obtaining S

X

mod, where α

Pr(i beats j) =

Pr(Ek|{δ1, . . . , δm}),

denotes the number of game days elapsed since a given result

k: Ek encodes a result

occurred (and stored in S). Note, that a home team and an

of team-i win

away team PageRank values are calculated for each team.

where we sum over those Ek results for which i beats j (i.e.

We would like to establish a connection between team home-

1:0, 2:0, 2:1, 3:0, 3:1, etc.). The probabilities Pr(i ties j)

i and team away-i using the assumption that home-i is not

and Pr(j beats i) can be calculated in a similar way.

weaker than away-i. In our implementation we assumed that

home-i had a win 2 : 1 against away-i to give a positive bias

for home-i at the beginning. In our experiments this setup

3.

EXPERIMENTAL RESULTS

performed well, but it was not optimized precisely.

To measure the accuracy of the forecasting we calculate the

mean squared error, which is often called Brier scoring rule

Using the above-defined results matrix S and the PageR-

in the forecasting literature [4]. The Brier score measures the

ank rating vector φ, we assign probabilities to the outcomes

mean squared difference between the predicted probability

{home team win, tie, away team win} of an upcoming game

assigned to the possible outcomes for event E and the actual

in game day r between home-i and away-j as follows. Be-

outcome oE. Suppose that for a single game g, between i and

fore the game day in which we make the forecast, let the

j, the forecast is pg = (pgw, pg, pg) contains the probabilities

t

l

calculated PageRank rating vector be φr−1(V ). We use δr

of i wins, the game is a tie and i loses, respectively. Let

40

xy

to measure how the rating vector of the teams changes if

the actual outcome of the game be og = (ogw, og, og), where

t

l

the result of an upcoming game between teams i and j

exactly one element is 1, the other two are 0. Noting that

is x : y, where x, y = 0, 1, . . . are the scores achieved by

the number of games played (and predicted) is N , BS is

team i and team j, respectively1. We define δrxy as the Eu-

defined as

clidean distance between φr−1(V ) and φr

40

40(V ) that is the

N

rating vector for the new results matrix obtained by adding

1 X

BS =

||pg − og||22

x to S

N

ij and y to Sn+j,i.

In the results graph interpreta-

g=1

tion this simply means that an edge from node away-j to

N

1 X

1We should note here that if the result is 0 : 0, then x =

=

[(pg

− og)2 + (pg − og)2].

N

w − og

w )2 + (pg

t

t

l

l

y = 1/2 is used.

g=1

51

The best score achievable is 0. In the case of three pos-

6.

REFERENCES

sible outcomes (win, lost, tie) we can easily see that the

[1] A. Agresti. Categorical data analysis. John Wiley &

forecast pg = (1/3, 1/3, 1/3) (for each game g and any N )

Sons, New York, 1996.

gives accuracy BS = 2/3 = 0.666. We consider this value

[2] D. Barrow, I. Drayer, P. Elliott, G. Gaut, and

B. Osting. Ranking rankings: an empirical comparison

as a worst-case benchmark. One question of our investiga-

of the predictive power of sports ranking methods.

tion is that how better BS values can be achieved using our

Journal of Quantitative Analysis in Sports,

method, and how close we can get to the betting agencies’

9(2):187–202, 2013.

fairly good predictions.

[3] R. A. Bradley and M. E. Terry. Rank analysis of

incomplete block designs: I. the method of paired

The data set we used contained all final results of given

comparisons. Biometrika, 39(3-4):324–345, 1952.

seasons of some football leagues, listed in the first two col-

[4] G. W. Brier. Verification of forecasts expressed in

terms of probability. Monthly Weather Review,

umn of Table 1. We tested our method as it was described

78(1):1–3, 1950.

in Sec. 2.3. We start predicting games starting from the

[5] K. Butler and J. T. Whelan. The existence of

41th game day; for each single game predictions are made

maximum likelihood estimates in the Bradley-Terry

using the results of the previous 40 game day before that

model and its extensions. arXiv preprint

game.

The Brier scores were calculated using all predic-

math/0412232, 2004.

tions we made. Our initial results are summarized in Ta-

[6] T. Callaghan, P. J. Mucha, and M. A. Porter.

ble 1. To calculate the betting odds probabilities we used

Random walker ranking for NCAA division IA

football. American Mathematical Monthly,

the betting odds provided by bet365 bookmaker available

114(9):761–777, 2007.

at http://www.football-data.co.uk/. We could see that

[7] F. Caron and A. Doucet. Efficient bayesian inference

these predictions gave the best accuracy score (BS) in each

for generalized Bradley–Terry models. Journal of

case. We highlighted the values where the difference between

Computational and Graphical Statistics,

the Bradley-Terry method and the PageRank method was

21(1):174–196, 2012.

higher than 0.01. Although we can see that slightly more

[8] A. C. Constantinou, N. E. Fenton, and M. Neil.

than half of the cases the Bradley-Terry model gives a better

Pi-football: A bayesian network model for forecasting

accuracy, the results are still promising considering the fact

association football match outcomes. Knowledge-Based

Systems, 36:322–339, 2012.

that the parameters of our method and the implementation

[9] D. Delen, D. Cogdell, and N. Kasap. A comparative

are far from being optimized.

analysis of data mining methods in predicting NCAA

bowl outcomes. International Journal of Forecasting,

28(2):543–552, 2012.

4.

CONCLUSIONS

[10] M. J. Dixon and P. F. Pope. The value of statistical

We presented a new model for probabilistic forecasting in

forecasts in the UK association football betting

sports, based on rating methods, that simply use the histor-

market. International Journal of Forecasting,

20(4):697–711, 2004.

ical game results data of the given sport competition. We

[11] D. Forrest, J. Goddard, and R. Simmons. Odds-setters

provided a forward-looking type graph based approach. The

as forecasters: The case of English football.

assumption of our model is that the rating of the teams after

International Journal of Forecasting, 21(3):551–564,

a game day is correctly reflects their current relative perfor-

2005.

mance. We consider that the smaller the changing in the

[12] R. Gill and J. Keating. Assessing methods for college

rating vector after a certain result occurs in an upcoming

football rankings. Journal of Quantitative Analysis in

single game, the higher the probability that this event will

Sports, 5(2), 2009.

occur. Performing experiments on results data sets of Eu-

[13] J. Goddard and I. Asimakopoulos. Forecasting football

results and the efficiency of fixed-odds betting.

ropean football championships, we observed that this model

Journal of Forecasting, 23(1):51–66, 2004.

performed well in general in terms of predictive accuracy.

[14] A. Joseph, N. E. Fenton, and M. Neil. Predicting

However, we should note here, that parameter fine tuning

football results using bayesian nets and other machine

and optimizing certain parts of our implementation are tasks

learning techniques. Knowledge-Based Systems,

of future work.

19(7):544–553, 2006.

[15] D. Karlis and I. Ntzoufras. Analysis of sports data by

We emphasize, that our methodology can be also useful to

using bivariate Poisson models. Journal of the Royal

compare different rating methods by measuring that which

Statistical Society: Series D (The Statistician),

52(3):381–393, 2003.

one reflects better the actual strength (rating) of the teams

[16] A. N. Langville and C. D. Meyer. Google’s PageRank

according to our interpretation. Finally we should add that

and beyond: The science of search engine rankings.

the model is general and may be used to investigate such

Princeton University Press, 2011.

graph processes where the number of nodes is fixed and edges

[17] J. Lasek, Z. Szlávik, and S. Bhulai. The predictive

are changing over time; moreover it also has a potential to

power of ranking systems in association football.

link prediction.

International Journal of Applied Pattern Recognition,

1(1):27–46, 2013.

[18] C. K. Leung and K. W. Joseph. Sports data mining:

5.

ACKNOWLEDGMENTS

Predicting results for the college football games.

Procedia Computer Science, 35:710–719, 2014.

This work was partially supported by the National Research,

[19] A. London, J. Németh, and T. Németh.

Development and Innovation Office - NKFIH, SNN117879.

Time-dependent network algorithm for ranking in

sports. Acta Cybernetica, 21(3):495–506, 2014.

Miklós Krész acknowledges the European Commission for

[20] M. J. Maher. Modelling association football scores.

funding the InnoRenew CoE project (Grant Agreement #739574)

Statistica Neerlandica, 36(3):109–118, 1982.

under the Horizon2020 Widespread-Teaming program.

52

Table 1: Accuracy results on football data sets. The values where the difference between the Bradley-Terry method and the PageRank method was higher than 0.01 are shown in bold.

League

Season

Betting odds error

Bradley-Terry error

PageRank method error

2011/12

0.58934

0.60864

0.59653

Premier League

2012/13

0.56461

0.59744

0.58166

2013/14

0.54191

0.55572

0.59406

2014/15

0.55740

0.60126

0.60966

2011/12

0.58945

0.59994

0.59097

Bundesliga

2012/13

0.57448

0.59794

0.58622

2013/14

0.55724

0.57803

0.60125

2014/15

0.57268

0.60349

0.60604

2011/12

0.54598

0.57837

0.58736

La Liga

2012/13

0.56417

0.58916

0.60205

2013/14

0.57908

0.58016

0.60473

2014/15

0.52317

0.55888

0.56172

[21] P. F. Pope and D. A. Peel. Information, prices and

efficiency in a fixed-odds betting market. Economica,

pages 323–341, 1989.

[22] P. Rao and L. L. Kupper. Ties in paired-comparison

experiments: A generalization of the Bradley-Terry

model. Journal of the American Statistical

Association, 62(317):194–204, 1967.

[23] J. A. Trono. Rating/ranking systems, post-season

bowl games, and ’the spread’. Journal of Quantitative

Analysis in Sports, 6(3), 2010.

[24] C. Wang and M. L. Vandebroek. A model based

ranking system for soccer teams. Research report,

available at SSRN 2273471, 2013.

53





54





Indeks avtorjev / Author index



Black Michaela ............................................................................................................................................................................. 33

Carlin Paul .................................................................................................................................................................................... 33

Čerin Matej ................................................................................................................................................................................... 37

Dujič Darko .................................................................................................................................................................................. 29

Džeroski Sašo ......................................................................................................................................................................... 41, 45

Fuart Flavio .................................................................................................................................................................................. 33

Gojo David ................................................................................................................................................................................... 29

Grobelnik Marko ................................................................................................................................................................ 9, 13, 33

Jenko Miha ..................................................................................................................................................................................... 5

Jovanoski Viktor .......................................................................................................................................................................... 25

Kenda Klemen .............................................................................................................................................................................. 37

Koprivec Filip .............................................................................................................................................................................. 37

Kostovska Ana ............................................................................................................................................................................. 41

Krész Miklós ................................................................................................................................................................................ 49

London András ............................................................................................................................................................................. 49

Massri M. Besher ......................................................................................................................................................................... 13

Mladenić Dunja ............................................................................................................................................................................ 21

Németh József .............................................................................................................................................................................. 49

Novak Blaž ................................................................................................................................................................................... 17

Novak Erik ..................................................................................................................................................................................... 5

Novalija Inna ............................................................................................................................................................................ 9, 13

Panov Panče ........................................................................................................................................................................... 41, 45

Pejović Veljko .............................................................................................................................................................................. 21

Pita Costa Joao ............................................................................................................................................................................. 33

Rupnik Jan .................................................................................................................................................................................... 25

Santanam Raghu ........................................................................................................................................................................... 33

Stopar Luka .................................................................................................................................................................................. 33

Sun Chenlu ................................................................................................................................................................................... 33

Tolovski Ilin ................................................................................................................................................................................. 45

Urbančič Jasna ......................................................................................................................................................................... 5, 21

Wallace Jonathan.......................................................................................................................................................................... 33





55





56





Konferenca / Conference

Uredila / Edited by

Odkrivanje znanja in podatkovna

skladišča - SiKDD /

Data Mining and Data Warehouses - SiKDD

Dunja Mladenić, Marko Grobelnik





Document Outline


01 - Naslovnica-sprednja-C

02 - Naslovnica - notranja - C

03 - Kolofon - C

04 - 05 - IS2018 - Skupni del

07 - Kazalo - C

08 - Naslovnica podkonference - C

09 - Predgovor podkonference - C

10 - Programski odbor podkonference - C

11 - Clanki - C 01 - NovakErik Abstract

1 Introduction

2 Related Work

3 Data Preprocessing

4 Recommender Engine 4.1 Recommendation Results





5 Future Work and Conclusion

Acknowledgments

References





02 - Novalija 1. INTRODUCTION

2. BACKGROUND

The development of smart labour market statistics touches a number of issues from labour market policies area and would provide contributions to questions related to:

- job creation,

- education and training systems,

- labour market segmentation,

- improving skill supply and productivity.

For instance, the analysis of the available job vacancies could offer an insight into what skills are required in the particular area. Effective trainings based on skills demand could be organized and that would lead into better labour market integrat...

A number of stakeholder types will benefit from the development of smart labour market statistics. In particular, the targeted stakeholders are:

3. RELATED WORK

The European Data Science Academy (EDSA) [1] was an H2020 EU project that ran between February 2015 and January 2018. The objective of the EDSA project was to deliver the learning tools that are crucially needed to close the skill gap in Data Science ...

- Analyzed the sector specific skillsets for data analysts across Europe with results reflected at EDSA demand and supply dashboard;

- Developed modular and adaptable curricula to meet these data science needs; and

- Delivered training supported by multiplatform resources, introducing Learning pathway mechanism that enables effective online training.

4. PROBLEM DEFINITION 4.1 DATA SOURCES

4.2 CONCEPTUAL ARCHITECTURE

4.3 SCENARIOS 4.3.1 DEMAND ANALYSIS

4.3.2 SKILLS ONTOLOGY DEVELOPMENT

4.3.3 SKILLS ONTOLOGY EVOLUTION





5. STATISTICAL INDICATORS

6. CONCLUSION AND FUTURE WORK

7. ACKNOWLEDGMENTS

8. REFERENCES





03 - Massri 1. INTRODUCTION

2. RELATED WORK

3. DESCRIPTION OF DATA

4. METHODOLOGY 4.1 Clustering and Formatting Data

4.2 Choosing the Main Entities

4.3 Detecting the Characteristics of Relationship





5. VISUALIZING THE RESULTS 5.1 Characteristics of the Main Graph

5.2 Main Functionality

5.3 Displaying Relation Information





6. CONCLUSION AND FUTURE WORK

7. ACKNOWLEDGMENTS

This work was supported by the euBusinessGraph (ICT-732003-IA) project [6].

8. REFERENCES





04 - NovakBlaz 1. INTRODUCTION

2. EXPERIMENTAL SETUP

3. RESULTS

4. CONCLUSIONS AND FUTURE WORK

5. ACKNOWLEDGEMENTS

6. REFERENCES





05 - Urbancic Introduction

Related work

Proposed approach

Results

Conclusions

Acknowledgments

References





06 - Jovanoski

07 - Gojo

08 - PitaCosta

09 - Koprivec Introduction

PerceptiveSentinel Platform Data

Data Acquisition

Data Preprocessing





Methodology Sample Data

Feature Vectors

Experiment





Results

Conclusions

Acknowledgments

References





10 - Kostovska

11 - Tolovski

12 - London





12 - Index - C

13 - Naslovnica-zadnja-C

Blank Page

Blank Page

Blank Page

Blank Page

Blank Page

Blank Page

04 - 05 - IS2018 - Predgovor in odbori.pdf 04 - IS2018 - Predgovor

05 - IS2018 - Konferencni odbori