Zbornik 20. mednarodne multikonference

INFORMACIJSKA DRUŽBA - IS 2017

Zvezek C

Proceedings of the 20th International Multiconference

INFORMATION SOCIETY - IS 2017

Volume C

Odkrivanje znanja in podatkovna skladišča - SiKDD

Data Mining and Data Warehouses - SiKDD

Uredila / Edited by

Dunja Mladenić, Marko Grobelnik

http://is.ijs.si

9.–13. oktober 2017 / 9–13 October 2017

Ljubljana, Slovenia





Zbornik 20. mednarodne multikonference

INFORMACIJSKA DRUŽBA – IS 2017

Zvezek C





Proceedings of the 20th International Multiconference

INFORMATION SOCIETY – IS 2017

Volume C





Odkrivanje znanja in podatkovna skladišča - SiKDD

Data Mining and Data Warehouses - SiKDD





Uredila / Edited by



Dunja Mladenić, Marko Grobelnik





http://is.ijs.si





9. - 13. oktober 2017 / 9th – 13th October 2017

Ljubljana, Slovenia





Urednika:





Dunja Mladenić

Laboratorij za umetno inteligenco

Institut »Jožef Stefan«, Ljubljana



Marko Grobelnik

Laboratorij za umetno inteligenco

Institut »Jožef Stefan«, Ljubljana





Založnik: Institut »Jožef Stefan«, Ljubljana

Priprava zbornika: Mitja Lasič, Vesna Lasič, Lana Zemljak

Oblikovanje naslovnice: Vesna Lasič





Dostop do e-publikacije:

http://library.ijs.si/Stacks/Proceedings/InformationSociety





Ljubljana, oktober 2017





Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni

knjižnici v Ljubljani

COBISS.SI-ID=292473088

ISBN 978-961-264-114-6 (pdf)





PREDGOVOR MULTIKONFERENCI

INFORMACIJSKA DRUŽBA 2017



Multikonferenca Informacijska družba (http://is.ijs.si) je z dvajseto zaporedno prireditvijo osrednji srednjeevropski dogodek na področju informacijske družbe, računalništva in informatike. Letošnja prireditev je ponovno na več lokacijah, osrednji dogodki pa so na Institutu »Jožef Stefan«.



Informacijska družba, znanje in umetna inteligenca so spet na razpotju tako same zase kot glede vpliva na človeški razvoj. Se bo eksponentna rast elektronike po Moorovem zakonu nadaljevala ali stagnirala? Bo umetna inteligenca nadaljevala svoj neverjetni razvoj in premagovala ljudi na čedalje več področjih in s tem omogočila razcvet civilizacije, ali pa bo eksponentna rast prebivalstva zlasti v Afriki povzročila zadušitev rasti? Čedalje več pokazateljev kaže v oba ekstrema – da prehajamo v naslednje civilizacijsko obdobje, hkrati pa so planetarni konflikti sodobne družbe čedalje težje obvladljivi.



Letos smo v multikonferenco povezali dvanajst odličnih neodvisnih konferenc. Predstavljenih bo okoli 200

predstavitev, povzetkov in referatov v okviru samostojnih konferenc in delavnic. Prireditev bodo spremljale okrogle mize in razprave ter posebni dogodki, kot je svečana podelitev nagrad. Izbrani prispevki bodo izšli tudi v posebni številki revije Informatica, ki se ponaša s 40-letno tradicijo odlične znanstvene revije. Odlične obletnice!



Multikonferenco Informacijska družba 2017 sestavljajo naslednje samostojne konference:

 Slovenska konferenca o umetni inteligenci

 Soočanje z demografskimi izzivi

 Kognitivna znanost

 Sodelovanje, programska oprema in storitve v informacijski družbi

 Izkopavanje znanja in podatkovna skladišča

 Vzgoja in izobraževanje v informacijski družbi

 Četrta študentska računalniška konferenca

 Delavnica »EM-zdravje«

 Peta mednarodna konferenca kognitonike

 Mednarodna konferenca za prenos tehnologij - ITTC

 Delavnica »AS-IT-IC«

 Robotika

Soorganizatorji in podporniki konference so različne raziskovalne institucije in združenja, med njimi tudi ACM

Slovenija, SLAIS, DKZ in druga slovenska nacionalna akademija, Inženirska akademija Slovenije (IAS). V imenu organizatorjev konference se zahvaljujemo združenjem in inštitucijam, še posebej pa udeležencem za njihove dragocene prispevke in priložnost, da z nami delijo svoje izkušnje o informacijski družbi. Zahvaljujemo se tudi recenzentom za njihovo pomoč pri recenziranju.



V 2017 bomo petič podelili nagrado za življenjske dosežke v čast Donalda Michija in Alana Turinga. Nagrado Michie-Turing za izjemen življenjski prispevek k razvoju in promociji informacijske družbe bo prejel prof. dr.

Marjan Krisper. Priznanje za dosežek leta bo pripadlo prof. dr. Andreju Brodniku. Že šestič podeljujemo nagradi

»informacijska limona« in »informacijska jagoda« za najbolj (ne)uspešne poteze v zvezi z informacijsko družbo.

Limono je dobilo padanje slovenskih sredstev za akademsko znanost, tako da smo sedaj tretji najslabši po tem kriteriju v Evropi, jagodo pa »e-recept«. Čestitke nagrajencem!



Bojan Orel, predsednik programskega odbora

Matjaž Gams, predsednik organizacijskega odbora



i

FOREWORD - INFORMATION SOCIETY 2017



In its 20th year, the Information Society Multiconference (http://is.ijs.si) remains one of the leading conferences in Central Europe devoted to information society, computer science and informatics. In 2017 it is organized at various locations, with the main events at the Jožef Stefan Institute.



The pace of progress of information society, knowledge and artificial intelligence is speeding up, and it seems we are again at a turning point. Will the progress of electronics continue according to the Moore’s law or will it start stagnating? Will AI continue to outperform humans at more and more activities and in this way enable the predicted unseen human progress, or will the growth of human population in particular in Africa cause global decline? Both extremes seem more and more likely – fantastic human progress and planetary decline caused by humans destroying our environment and each other.



The Multiconference is running in parallel sessions with 200 presentations of scientific papers at twelve conferences, round tables, workshops and award ceremonies. Selected papers will be published in the Informatica journal, which has 40 years of tradition of excellent research publication. These are remarkable achievements.



The Information Society 2017 Multiconference consists of the following conferences:

 Slovenian Conference on Artificial Intelligence

 Facing Demographic Challenges

 Cognitive Science

 Collaboration, Software and Services in Information Society

 Data Mining and Data Warehouses

 Education in Information Society

 4th Student Computer Science Research Conference

 Workshop Electronic and Mobile Health

 5th International Conference on Cognitonics

 International Conference of Transfer of Technologies - ITTC

 Workshop »AC-IT-IC«

 Robotics



The Multiconference is co-organized and supported by several major research institutions and societies, among them ACM Slovenia, i.e. the Slovenian chapter of the ACM, SLAIS, DKZ and the second national engineering academy, the Slovenian Engineering Academy. In the name of the conference organizers we thank all the societies and institutions, and particularly all the participants for their valuable contribution and their interest in this event, and the reviewers for their thorough reviews.



For the fifth year, the award for life-long outstanding contributions will be delivered in memory of Donald Michie and Alan Turing. The Michie-Turing award will be given to Prof. Marjan Krisper for his life-long outstanding contribution to the development and promotion of information society in our country. In addition, an award for current achievements will be given to Prof. Andrej Brodnik. The information lemon goes to national funding of the academic science, which degrades Slovenia to the third worst position in Europe. The information strawberry is awarded for the medical e-recipe project. Congratulations!



Bojan Orel, Programme Committee Chair

Matjaž Gams, Organizing Committee Chair





ii

KONFERENČNI ODBORI

CONFERENCE COMMITTEES



International Programme Committee

Organizing Committee

Vladimir Bajic, South Africa

Matjaž Gams, chair

Heiner Benking, Germany

Mitja Luštrek

Se Woo Cheon, South Korea

Lana Zemljak

Howie Firth, UK

Vesna Koricki

Olga Fomichova, Russia

Mitja Lasič

Vladimir Fomichov, Russia

Robert Blatnik

Vesna Hljuz Dobric, Croatia

Aleš Tavčar

Alfred Inselberg, Israel

Blaž Mahnič

Jay Liebowitz, USA

Jure Šorn

Huan Liu, Singapore

Mario Konecki

Henz Martin, Germany



Marcin Paprzycki, USA



Karl Pribram, USA

Claude Sammut, Australia

Jiri Wiedermann, Czech Republic

Xindong Wu, USA

Yiming Ye, USA

Ning Zhong, USA

Wray Buntine, Australia

Bezalel Gavish, USA

Gal A. Kaminka, Israel

Mike Bain, Australia

Michela Milano, Italy

Derong Liu, Chicago, USA

Toby Walsh, Australia





Programme Committee

Bojan Orel, chair

Mitja Luštrek

Niko Schlamberger

Franc Solina, co-chair

Marko Grobelnik

Stanko Strmčnik

Viljan Mahnič, co-chair

Nikola Guid

Jurij Šilc

Cene Bavec, co-chair

Marjan Heričko

Jurij Tasič

Tomaž Kalin, co-chair

Borka Jerman Blažič Džonova

Denis Trček

Jozsef Györkös, co-chair

Gorazd Kandus

Andrej Ule

Tadej Bajd

Urban Kordeš

Tanja Urbančič

Jaroslav Berce

Marjan Krisper

Boštjan Vilfan

Mojca Bernik

Andrej Kuščer

Baldomir Zajc

Marko Bohanec

Jadran Lenarčič

Blaž Zupan

Ivan Bratko

Borut Likar

Boris Žemva

Andrej Brodnik

Janez Malačič

Leon Žlajpah

Dušan Caf

Olga Markič

Saša Divjak

Dunja Mladenič

Tomaž Erjavec

Franc Novak

Bogdan Filipič

Vladislav Rajkovič

Andrej Gams

Grega Repovš

Matjaž Gams

Ivan Rozman





iii

Invited lecture



AN UPDATE FROM THE AI & MUSIC FRONT





Gerhard Widmer

Institute for Computational Perception

Johannes Kepler University Linz (JKU), and

Austrian Research Institute for Artificial Intelligence (OFAI), Vienna





Abstract

Much of current research in Artificial Intelligence and Music, and particularly in the field of Music

Information Retrieval (MIR), focuses on algorithms that interpret musical signals and recognize musically relevant objects and patterns at various levels -- from notes to beats and rhythm, to melodic and harmonic patterns and higher-level segment structure --, with the goal of supporting novel applications in the digital music world. This presentation will give the audience a glimpse of what musically "intelligent" systems can currently do with music, and what this is good for. However, we will also find that while some of

these capabilities are quite impressive, they are still far from (and do not require) a deeper

"understanding" of music. An ongoing project will be presented that aims to take AI & music research a bit closer to the "essence" of music, going beyond surface features and focusing on the expressive aspects of music, and how these are communicated in music. This raises a number of new research challenges for the field of AI and Music (discussed in much more detail in [Widmer, 2016]). As a first step, we will look at recent work on computational models of expressive music performance, and will show some examples

of the state of the art (including the result of a recent musical 'Turing test').



References

Widmer, G. (2016).

Getting Closer to the Essence of Music: The Con Espressione Manifesto.

ACM Transactions on Intelligent Systems and Technology 8(2), Article 19.





iv





KAZALO / TABLE OF CONTENTS



Odkrivanje znanja in podatkovna skladišča - SiKDD / Data Mining and Data Warehouses - SiKDD ................. 1

PREDGOVOR / FOREWORD ................................................................................................................................. 3

PROGRAMSKI ODBORI / PROGRAMME COMMITTEES ..................................................................................... 5

Anotating Documents with Relevant Wikipedia Concepts / Brank Janez, Leban Gregor, Grobelnik Marko ......... 7

Impact of News Events on the Financial Markets / Torkar Miha, Mladenic Dunja ............................................... 11

Challenges in media monitoring of worldwide news sources to support public health / Pita Costa Joao,

Fuart Flavio, Grobelnik Marko, Leban Gregor, Belyaeva Evgenia ................................................................... 15

Ontology-based translation memory maintenance / Repar Andraz, Pollak Senja ............................................... 19

Audience Segmentation Based on Topic Profiles / Kladnik Matic, Mladenic Dunja ............................................. 23

Building Client’s Risk Profile Based on Cal Detail Records / Herga Zala, Doyle Casey, Moore Pat .................. 27

Connecting Professional Skill Demand with Supply / Novak Erik, Novalija Inna ................................................. 31

Analyzing raw log files to find execution anomaliess / Jovanoski Viktor, Rupnik Jan, Karlovcec Mario,

Fortuna Blaz ..................................................................................................................................................... 35

Usage of SVM for a Triggering Mechanism for Higgs Boson Detection / Kenda Klemen, Mladenic Dunja ......... 39

A methodology to evaluate the evolution of networks using topological data analysis / Pita Costa Joao,

Galinac Grbac Tihana....................................................................................................................................... 43

Improving mortality prediction for intensive care unit patients using text mining techniques / Kocbek

Primoz, Fijacko Nino ......................................................................................................................................... 47

Indeks avtorjev / Author index ................................................................................................................................ 51





v





vi





Zbornik 20. mednarodne multikonference

INFORMACIJSKA DRUŽBA – IS 2017

Zvezek C





Proceedings of the 20th International Multiconference

INFORMATION SOCIETY – IS 2017

Volume C





Odkrivanje znanja in podatkovna skladišča - SiKDD

Data Mining and Data Warehouses - SiKDD





Uredila / Edited by



Dunja Mladenić, Marko Grobelnik





http://is.ijs.si





9. oktober 2017 / 9th October 2017

Ljubljana, Slovenia



1





2





PREDGOVOR



Tehnologije, ki se ukvarjajo s podatki so v devetdesetih letih močno napredovale. Iz prve faze, kjer je šlo predvsem za shranjevanje podatkov in kako do njih učinkovito dostopati, se je razvila industrija za izdelavo orodij za delo s podatkovnimi bazami, prišlo je do standardizacije procesov, povpraševalnih jezikov itd. Ko shranjevanje podatkov ni bil več poseben problem, se je pojavila potreba po bolj urejenih podatkovnih bazah, ki bi služile ne le transakcijskem procesiranju ampak tudi analitskim vpogledom v podatke – pojavilo se je t.i. skladiščenje podatkov (data warehousing), ki je postalo standarden del informacijskih sistemov v podjetjih. Paradigma OLAP (On-Line-Analytical-Processing) zahteva od uporabnika, da še vedno sam postavlja sistemu vprašanja in dobiva nanje odgovore in na vizualen način

preverja in išče izstopajoče situacije. Ker seveda to ni vedno mogoče, se je pojavila potreba po avtomatski analizi podatkov oz. z drugimi besedami to, da sistem sam pove, kaj bi utegnilo biti zanimivo za uporabnika – to prinašajo tehnike odkrivanja znanja v podatkih (data mining), ki iz obstoječih podatkov skušajo pridobiti novo znanje in tako uporabniku nudijo novo razumevanje dogajanj zajetih v podatkih.

Slovenska KDD konferenca pokriva vsebine, ki se ukvarjajo z analizo podatkov in odkrivanjem znanja v

podatkih: pristope, orodja, probleme in rešitve.





Dunja Mladenić, Marko Grobelnik





3

FOREWORD



Data driven technologies have significantly progressed after mid 90’s. The first phases were mainly focused on storing and efficiently accessing the data, resulted in the development of industry tools for managing large databases, related standards, supporting querying languages, etc. After the initial period, when the data storage was not a primary problem anymore, the development progressed towards analytical functionalities on how to extract added value from the data; i.e., databases started supporting not only transactions but also analytical processing of the data. At this point, data warehousing with On-Line-Analytical-Processing entered as a usual part of a company’s information system portfolio, requiring from the user to set well defined questions about the aggregated views to the data. Data Mining is a technology developed after year 2000, offering automatic data analysis trying to obtain new discoveries from the existing data and enabling a user new insights in the data. In this respect, the Slovenian KDD conference (SiKDD) covers a broad area including Statistical Data Analysis, Data, Text and Multimedia Mining, Semantic Technologies, Link Detection and Link Analysis, Social Network Analysis, Data Warehouses.





Dunja Mladenić, Marko Grobelnik





4





PROGRAMSKI ODBOR / PROGRAMME COMMITTEE

Dunja Mladenić

Marko Grobelnik



5





6





ANOTATING DOCUMENTS WITH

RELEVANT WIKIPEDIA CONCEPTS



Janez Brank, Gregor Leban, Marko Grobelnik

Artificial Intelligence Laboratory

Jožef Stefan Institute

Jamova 39, 1000 Ljubljana, Slovenia

Tel: +386 1 4773778; fax: +386 1 4251038

e-mail: {janez.brank,gregor.leban,marko.grobelnik}@ijs.si





ABSTRACT

2 PAGERANK-BASED WIKIFICATION

We describe an efficient approach for annotating a document with

The task of wikifying an input document can be broken down

relevant concepts from the Wikipedia. A pagerank-based method is

into several closely interrelated subtasks: (1) identify phrases

used to identify a coherent set of relevant concepts considering the

input document as a whole. The proposed approach is suitable for

(or words) in the input document that refer to a Wikipedia

parallel processing and can support any language for which a

concept; (2) determine which concept exactly a phrase refers

sufficiently large Wikipedia is available.

to; (3) determine which concepts are relevant enough to the

document as a whole that they should be included in the

1 INTRODUCTION

output of the system (i.e. presented to the user).

Recent years have seen a growth in the use of semantic

We follow the approach described by Zhang and

technologies. However, in many contexts we still deal with

Rettinger [1]. This approach makes use of the rich internal

largely unstructured textual documents that lack explicit

structure of hyperlinks between Wikipedia pages. A

semantic information such as might be required for further

hyperlink can be thought of as consisting of a source page, a

processing with semantic technologies. This leads to the

target page, and the link text (also known as the anchor text).

problem of semantic annotation or semantic enrichment as an

If a source page contains a link with the anchor text a and the

important preparatory step before further processing of a

target page t, this is an indication that the phrase a might be

document. Given a document and an ontology covering the

a reference to (or representation of) the concept that

domain of interest, the challenge is to identify concepts from

corresponds to page t. Thus, if the input document that we’re

that ontology that are relevant to the document or that are

trying to wikify contains the phrase a, it might be the case

referred to by it, as well as to identify specific passages in the

that this occurrence of a in the input document also

document where the concepts in question are mentioned.

constitutes a mention of the concept t, and the concept t is a

A specific type of semantic annotation, known as

candidate annotation for this particular phrase.

wikification, involves using the Wikipedia as a source of

possible semantic annotations [1][2]. In this setting, the

2.1 Disambiguation

Wikipedia is treated as a large and fairly general-purpose

In the Wikipedia, there may be many different links with the

ontology: each page is thought of as representing a concept,

same anchor text a, and they might not all be pointing to the

while the relations between concepts are represented by

same target page. For example, in the English-language

internal hyperlinks between different Wikipedia pages, as

Wikipedia, there are links with a = “Tesla” that variously

well as by Wikipedia’s category memberships and and cross-

point to pages about the inventor, the car manufacturer, the

language links.

unit in physics, a band, a film, and several other concepts.

The advantage of this approach is that the Wikipedia is a

Thus, when such a phrase a occurs in an input document,

freely available source of information, it covers a wide range

there are several concepts that can be regarded as candidate

of topics, has a rich internal structure, and each concept is

annotations for that particular mention, and we have to

associated with a semi-structured textual document (i.e. the

determine which of them is actually relevant. This is the

contents of the corresponding Wikipedia article) which can

problem of disambiguation, similar to that of word sense

be used to aid in the process of semantic annotation.

disambiguation in natural language processing.

Furthermore, the Wikipedia is available in a number of

There are broadly two approaches to disambiguation,

languages, with cross-language links being available to

local and global. In the local approach, each mention is

identify pages that refer to the same concept in different

disambiguated independently of the others, while the global

languages, thus making it easier to support multilingual and

approach aims to treat the document as a whole and

cross-lingual annotation.

disambiguate all the mentions in it as a group. The intuition

The remainder of this paper is structured as follows. In

behind the global approach is that the document that we’re

Section 2, we present the pagerank-based approach to

annotating is about some topic, and the concepts that we use

wikification used in our wikifier. In Section 3, we describe

as annotation should be about that topic as well. If the

our implementation and present some experimental

document contains many mentions that include, as some of

evaluaton. Section 4 contains conclusions and a discussion of

their candidate annotations, some car-related concepts, this

possible future work.

makes it more likely that we should treat the mention of



“Tesla” as a reference to Tesla the car manufacturer as

opposed to e.g. a reference to Nikola Tesla or to Tesla the



7

rock band.

of the pagerank calculation process, the pagerank flows into

a concept vertex c from mentions that are closely associated

2.2 The mention-concept graph

with the concept c and from other concepts that are

To implement the global disambiguation approach, our

semantically related to c. Thus after a few iterations,

Wikifier begins by constructing a mention-concept graph for

pagerank should tend to accumulate in a set of concepts that

the input document. (Some authors, e.g. [2], refer to this as a

are closely semantically related to each other and that are

mention-entity graph, but we prefer to use the term “mention-

concept graph” as some of the Wikipedia pages do not

strongly associated with words and phrases that appear in the

input document, which is exactly what we want in the context

necessarily correspond to concepts that we usually think of as

of global disambiguation.

entities, and our wikifier does not by default try to exclude

them.) This can be thought of as a bipartite graph in which the

2.3 Using pagerank for disambiguation

left set of vertices corresponds to mentions and the right set

Once the pagerank values of all the vertices in the graph have

of vertices corresponds to concepts. A directed edge a  c

been calculated, we use the pagerank values of concepts to

exists if and only if the concept c is one of the candidate

disambiguate the mentions. If there are edges from a mention

annotations for the mention a (i.e. if there exists in the

a to several concepts c, we choose the concept with the

Wikipedia a hyperlink with the anchor text a and the target c).

highest pagerank as the one that is relevant to this particular

A transition probability is also assigned to each such edge,

mention a. We say that this concept is supported by the

P( a  c), defined as the ratio [number of hyperlinks, in the

mention a. At the end of this process, concepts that are not

Wikipedia, having the anchor text a and the target c] /

supported by any mention are discarded as not being relevant

[number of hyperlinks, in the Wikipedia, having the anchor

to the input document.

text a].

The remaining concepts are then sorted in decreasing

This graph is then augmented by edges between concepts,

order of their pagerank. Let the i’th concept in this order be

the idea being that an edge c  c' should be used to indicate

ci and let its pagerank be PRi, for i = 1, …, n. Concepts with

that the concepts c and c' are “semantically related”, in the

a very low pagerank value are less likely to be relevant, so it

sense that if one of them is relevant to a given input document,

makes sense to apply a further filtering step at this point and

the other one is also more likely to be relevant to that

discard concepts whose pagerank is below a user-specified

document. Following [1], the internal link structure of the

threshold. However, where exactly this threshold should be

Wikipedia is used to calculate a measure of semantic

depends on whether the user wants to prioritize precision or

relatedness. Informally, the idea is that if c and c' are closely

recall. Furthermore, the absolute values of pagerank can vary

related, then other Wikipedia pages that point to c are likely

a lot from one document to another, e.g. depending on the

to also point to c' and vice versa. Let L

length of the document, the number of mentions and

c be the set of Wikipedia

pages that contain a hyperlink to c, and let N be the total

candidate concepts, etc. Thus we apply the user-specified

number of concepts in the Wikipedia; then the semantic

threshold in the following manner: given the user-specified

relatedness of c and c' can be defined as

threshold value   [0, 1], we output the concepts c 1, …, cm,

where m is the least integer such that 

2 ≥  

SR( c, c' ) = 1 – [log(max{| L

i=1.. m PRi

i=1.. n

c|, | Lc' |}) – log| Lc  Lc' |] /

PR 2. In other words, we report as many top-ranking concepts





[log N – log(min{| L

i

c|, | Lc' |})].

as are needed to cover  of the total sum of squared

In the graph, we add an edge of the form c  c' wherever the

pageranks of all the concepts. We use  = 0.8 as a broadly

semantic relatedness SR(c, c') is > 0. The transition

reasonable default value, though the user can require a

probability of this edge is defined as proportional to the

different threshold depending on their requirements.

semantic relatedness: P( c  c' ) = SR( c, c' ) /  c'' SR( c, c'' ).

For each reported concept, we also output a list of the

This graph is then used as the basis of calculating a vector

mentions that support it.

of pagerank scores, one for each vertex. This is done using

the usual iterative approach where in each iteration, each

2.4 Treatment of highly ambiguous mentions

vertex distributes its pagerank score to its immediate

Our wikifier supports various minor heuristics and

successors in the graph, in proportion to the transition

refinements in an effort to improve the performance of the

probabilities on its outgoing edges:

baseline approach described in the preceding sections.

As described above, anchor text of hyperlinks in the

PRnew( u) =  PR 0( u) + (1 – )  v PRold( v) P( v  u).

Wikipedia is used to identify mentions in an input document

The baseline distribution of pagerank, PR 0, is used both to

(i.e. words or phrases that may support an annotation). One

help the process converge and also to counterbalance the fact

downside of this approach is that some words or phrases

that in our graph there are no edges pointing into the mention

occur as the anchor text of a very large number of hyperlinks

vertices. In our case, PR 0( u) is defined as 0 if u is a concept

in the Wikipedia and these links point to a large number of

vertex; if u is a mention vertex, we use PR 0( u) = z  [number

different Wikipedia pages. In other words, such a phrase is

of Wikipedia pages containing the phrase u as the anchor-text

highly ambiguous; it is not only unlikely to be disambiguated

of a hyperlink] / [number of Wikipedia pages containing the

correctly, but also introduces noise into the mention-concept

phrase u], where z is a normalization constant to ensure that

graph by introducing a large number of concept vertices, the

 u PR 0( u) = 1. We used  = 0.1 as the stabilization parameter.

vast majority of which will be completely irrelevant to the

The intuition behind this approach is that in each iteration

input document. This also slows down the annotation process



8

by increasing the time to calculate the semantic relatedness

Here, a is the mention that we’re trying to disambiguate,

between all pairs of candidate concepts.

and c is the candidate concept that we’re evaluating. P( c| a)

We use several heuristics to deal with this problem.

is the probability that a hyperlink in the Wikipedia has c as

Suppose that a given mention a occurs, in the Wikipedia, as

its target conditioned on the fact that it has a as its anchor

the anchor text of n hyperlinks pointing to k different target

text. f( x) can be either 1 (the default), x, or log( x). PR(c) is pages, and suppose that ni of these links point to page ci (for i

the pagerank of c’s vertex in the mention-concept graph. S( c,

= 1, …, k). We can now define the entropy of the mention a

d) is the cosine similarity between the text of the input

as the amount of uncertainty regarding the link target given

document d and of the Wikipedia page for the concept c.

the fact that its anchor text is a: H( a) = –  i=1.. k ( ni/ n) log( ni/ n).

LS( c, a) is the cosine similarity between the context (e.g.

If this entropy is above a user-specified threshold (e.g. 3 bits),

previous and next 3 words) in which a appears in the input

we completely ignore the mention as being too ambiguous to

document d, and the contexts in which hyperlinks with the

be of any use. For mentions that pass this heuristic, we sort

target c appear in the Wikipedia. Finally, w 1, w 2, w 3 are

the target pages in decreasing order of ni and use only the top

weight constants. However, our preliminary experiments

few of them (e.g. top 20) as candidates in our mention-

haven’t shown any improvements from the addition of these

concept graph. A third heuristic is to ignore candidates for

heuristics, so they are disabled by default ( f( x) = 1, w 2 = w 3

which ni itself is below a certain threshold (e.g. ni < 2), the

= 0) to save computational time and memory (storing the link

idea being that if such a phrase occurs only once as the anchor

contexts needed for the efficient computation of LS has

text of a link pointing to that candidate, this may well turn out

turned out to be particularly memory intensive).

to be noise and is best disregarded.

3 IMPLEMENTATION AND EVALUATION

Optionally, the Wikifier can also be configured to ignore

certain types of concepts based on their Wikidata class

3.1. Implementation

membership. This can be useful to exclude from

Our implementation of the approach described in the

consideration Wikipedia pages that do not really correspond

preceding section is running as a web service and can be

to what is usually thought of as entities (e.g. “List of…”

accessed at http://wikifier.org. The approach is suitable for

pages).

parallel processing as annotating one document is

Another heuristic that we have found useful in reducing

independent of annotating other documents, and any shared

the noise in the output annotations is to ignore any mention

data used by the annotation process (e.g. the Wikipedia link

that consists entirely of stopwords and/or very common

graph, and a trie-based data structure that indexes the anchor

words (top 200 most frequent words in the Wikipedia for that

text of all the hyperlinks) need to be accessed only for

particular language). For this as well as for other purposes the

reading and can thus easily be shared by an arbitrary number

text processing is done in a case-sensitive fashion, which e.g.

of worker threads. This allows for a highly efficient

allows us to ignore spurious links with the link text “the”

processing of a large number of documents.

while processing those that refer to the band “The The”.

Our implementation currently processes on average more

than 500,000 requests per day (the total length of input

2.5. Miscellaneous heuristics

documents averages about 1.2 GB per day), including all the

Semantic relatedness. As mentioned above, the definition of

documents from the JSI Newsfeed service [3]. The output is

semantic relatedness of two concepts, SR( c, c' ), is based on

used among other things as a preprocessing step by the Event

the overlap between the sets Lc, Lc' of immediate predecessors

Registry system [4]. The wikifier currently supports all

of these two concepts in the Wikipedia link graph. Optionally,

languages in which a Wikipedia with at least 1000 pages is

our Wikifier can compute semantic relatedness using

available, amounting to a total of 134 languages. Admittedly,

immediate successors or immediate neighbours (i.e. both

1000 pages is much too small to achieve an adequate

predecessors and successors) instead of immediate

coverage; however, about 60 languages have a Wikipedia

predecessors. However, our preliminary experiments

with at least 100,000 pages, which is already enough for

indicated that these changes do not lead to improvements in

many practical applications.

performance, so they are disabled by default.

Annotations are returned in JSON format and can

Extensions to disambiguation. Our Wikifier also supports

optionally include detailed information about support (which

some optional extensions of the disambiguation process. As

mentions support each annotation), alternative candidate

described above, the default behavior when disambiguating a

annotations (concepts that were considered as candidates

mention is to simply choose the candidate annotation with the

during the disambiguation process but were rejected in

highest pagerank value. Alternatively, after any heuristics

favour of some other more highly scored concept), and

from section 2.4 have been applied, the remaining candidate

WikiData/DbPedia class membership of the proposed

concepts can be re-ranked using a different scoring function

annotations. Thus, the caller can easily implement any

that takes other criteria besides pagerank into account. This is

desired class-based postprocessing.

an opportunity to combine the global disambiguation

3.2. Evaluation

approach with some local techniques. In general, a scoring

One way to evaluate wikification is to compare the set of

function of the following type is supported:

annotations with a manually annotated gold standard for the

score( c| a) = w 1 f( P( c| a)) PR( c) + w 2 S( c, d) + w 3 LS( c, a) same document(s). Performance can then be measured using



9

metrics from information retrieval, such as precision, recall,

small, with a small amount of text, low number of pages, and

and the F 1-measure, which is defined as the harmonic mean

generally poor coverage, the performance of wikification

of precision and recall. We used a manually annotated set of

based on this will be low. One idea to alleviate this problem

1393 news articles that was made available from the authors

would be to optionally allow a second stage of processing, in

of the AIDA system and was originally used in their

which Wikipedias in languages other than the language of

experiments [2]. This manually annotated dataset excludes,

the input document would also be used to identify mentions

by design, any annotations that do not correspond to named

and provide candidate annotations. This might improve

entities. Since our wikifier does not by default distinguish

coverage especially of concepts that are referred to by the

between named entities and other Wikipedia concepts, we

same words or phrases across multiple languages, as is the

have explicitly excluded non-entity concepts (based on their

case with some types of named entities. For the purposes of

class membership in the WikiData ontology) from the output

pagerank-based disambiguation in this second stage, a large

of our Wikifier for the purposes of this experiment. In

common link-graph would have to be constructed by

addition to our wikifier, we obtained annotations from the

merging the link-graphs of the Wikipedias for different

following systems: AIDA [2], Waikato Wikipedia Miner [6],

languages. This can be done by using the cross-language

Babelfy [7], Illinois [8], and DbPedia Spotlight [9].

links which are available in the WikiData ontology,



providing information about when different pages in



Gold

JSI AIDA Waikato Babelfy Illinois Spotlight

different languages refer to the same concept.

Gold

1.000 0.593 0.723

0.372

0.323

0.476

0.279

JSI



1.000 0.625

0.527

0.431

0.489

0.363

Another interesting directon for further work would be to

AIDA





1.000

0.372

0.352

0.434

0.356

try incorporating local disambiguation techniques as a way

Waikato





1.000

0.481

0.564

0.474

to augment the current global disambiguation approach.

Babelfy





1.000

0.434

0.356

When evaluating whether a mention a in the input document

Illinois





1.000

0.376

refers to a particular concept c, the local approach would

Spotlight





1.000

Table 1: F

focus on comparing the context of a to either the text of the

1 measure of agreement between the various wikifiers

and the gold standard.

Wikipedia page for c, or to the context in which hyperlinks

to c occur within the Wikipedia. Preliminary steps taken in

Table 1 shows the agreement not only between each of the

this direction in Sec. 2.5 did not lead to improvements in

wikifiers and the gold standard, but also between each pair of

performance, but this subject is worth exploring further.

wikifiers (the lower left triangle of the matrix is left empty as

Instead of the bag-of-words representation of contexts, other

it would be just a copy of the upper right triangle, since the

vector representations of words could be used, e.g. word2vec

F 1-measure is symmetric). As this experiment indicates, our

[5].

wikifier (“JSI” in the table) performs slightly worse than



AIDA but significantly better than the other wikifiers.

Acknowledgments

Furthermore, it turns out that there is relatively little

This work was supported by the Slovenian Research Agency

agreement between the different wikifiers, which indicates

as well as the euBusinessGraph (ICT-732003-IA) and EW-

that wikification itself is in some sense a vaguely defined task

Shopp (ICT-732590-IA) projects.

where different people can have very different ideas about



whether a particular Wikipedia concept is relevant to a

References

particular input document (and should therefore be included

[1]

L.

Zhang,

A.

Rettinger.

Final

ontological

word-sense-

as an annotation) or not, which types of Wikipedia concepts

disambiguation prototype. Deliverable D3.2.3, xLike Project,

October 2014.

can be considered as annotations (e.g. only named entities or

[2]

J. Hoffart, M. A. Yosef, I. Bordino, et al. Robust disambiguation of

all concepts), etc. Possibly the level of agreement could be

named entities in text. Proc. of the 2011 Conf. on Empirical Methods

improved by fine-tuning the settings of the various wikifiers;

in Natural Language Processing, Edinburgh, Scotland, 2011, pp.

in the experiment described above, default settings were used.

782–792.

[3]

M. Trampuš, B. Novak. Internals of an aggregated web news feed.

4 CONCLUSIONS AND FUTURE WORK

Proc. SiKDD 2012.

We have presented a practical and efficient approach to

[4]

G. Leban, B. Fortuna, J. Brank, M. Grobelnik. Event registry:

Wikification that requires no external data except the

Learning about world events from news. Proc. of the 23rd Int. Conf.

on the World Wide Web (WWW 2014), pp 107–110.

Wikipedia itself, that can deal with documents in any

[5]

T. Mikolov, K. Chen, G. Corrado, J. Dean. Efficient estimation of

language for which the Wikipedia is available, and that is

word representations in vector space. Arxiv.org, 2013.

suitable for a high-performance, parallelized implementation.

[6]

D. Milne, I. H. Witten. An open-source toolkit for mining Wikipedia.

The approach presented here could be improved along

Artificial Intelligence, 194:222–239 (January 2013).

[7]

A. Moro, A. Raganato, R. Navigli. Entity linking meets word sense

several directions. One significant weakness of the current

disambiguation: A unified approach. Trans. of the Assoc. for Comp.

approach concerns the treatment of minority languages.

Linguistics, 2:231–234 (2014).

When dealing with a document in a certain language, we need

[8]

L. Ratinov, D. Roth, D. Downey, M. Anderson. Local and global

hyperlinks whose anchor text is in the same language if we

algorithms for disambiguation to Wikipedia. Proc. of the 49th Annual

Meeting of the Assoc. for Comp Linguistics: Human Language

are to identify mentions in this input document. Thus, if the

Technologies (2011), pp. 1375–84.

document is in a language for which the Wikipedia is not

[9]

J. Daiber, M. Jakob, C. Hokamp, P. N. Mendes. Improving efficiency

available at all, it cannot be wikified using this approach; and

and accuracy in multilingual entity extraction. Proc. of the 9th Int.

similarly, if the Wikipedia is available in this language but is

Conf. on Semantic Systems, 2013.



10





Impact of News Events on the Financial Markets

Miha Torkar

Dunja Mladenič

Jožef Stefan International Postgraduate School

Jožef Stefan International Postgraduate School

and

and

Artificial Intelligence Laboratory,

Artificial Intelligence Laboratory,

Jožef Stefan Institute,

Jožef Stefan Institute

Jamova 39, 1000 Ljubljana,

Jamova 39, 1000 Ljubljana,

Slovenia

Slovenia

miha.torkar@ijs.si

dunja.mladenic@ijs.si

ABSTRACT

ing the sentiment of the news articles and consequently de-

In this work we investigate how news events can be used to

termining the impact on the market on the basis of whether

predict the financial markets. Namely we built a time series

it is positive (upward trend predicted) or vice versa (see [3]

model that includes features obtained from the news and

or [4]). Our approach is similar to the second one, but in-

investigated whether the changes in volume of traded shares

stead of determining sentiment for each news event, we use

can be predicted more accurately with this information. The

the effect that the past similar events had on the market as

time series model that was built is of an ARMA-GARCH

a proxy for the impact of the current event.

type, because we wanted to account for any clustering of

the volatility that is normal for the financial markets. The

We define an event as a collection of news articles from dif-

models were evaluated with the Akaike and Bayesian Infor-

ferent sources about the same story in the news. From this

mation Criterion, while also being compared to the base-

collection of articles we will be able to extract the topic,

line model that did not include any features from the news.

date, location, social score (how trending was this event on

Overall our results show that there is an improvement in

social media) and all of the entities involved in the event,

the model when the information from the news is used and

which will add to the complexity of our dataset. One possi-

hence show a promising avenue for future research work.

ble source of such dataset is a system called Event Registry

(see [5]), which automatically extracts events from news arti-

1.

INTRODUCTION

cles. Using this type of data source is novel for this research

area and hence serves as an additional contribution from this

The predictability of future movements of financial markets

work.

is a well researched area in the literature and offers many in-

teresting theories. Financial institutions and investors have

In order to see the impact that these events had on the

been using various sources of information in order to increase

financial market, we looked at how the volume of traded

the accuracy of their predictions and consequently outper-

shares changes on the days of the event. We obtained several

form others. There are several approaches to building the

features from the news events and checked whether they

best predictive model, but in general we can divide them into

would allow us to improve the time series model for the

two categories: technical analysis and fundamental analysis.

volume change.

On one side we have models that are based on the historical

market data and believe that the past movements will repeat

2.

DATA

themselves. This is the so called technical analysis approach

to modelling markets and believes that an experienced ob-

For the historical market data of the company, we collected

server can detect the repetition of a pattern from a graph of

the following values (prices): Open, High, Low, Close, Vol-

market data. The effectiveness of this approach was tested

ume. In addition to the market values of a company we also

by [9].This approach however does not offer any reasons why

used the value of the market volatility index VIX (closing

market movements would repeat, and in order to incorpo-

price). Secondly with the use of the Event Registry system

rate more fundamental believes, a second approach named

we obtained all of the news events relating to the company.

fundamental analysis was developed. These models use data

The general description above was intentional as it can be

that is available from multiple sources, which ranges from

applied to any publicly listed company, but for our specific

company’s balance sheet data, financial market data like

example we will use data about investment bank Goldman

company’s index, financial data about government activi-

Sachs (GS). In Figure 1 we present the dataset we will be

ties to data about political or geographical circumstances

modelling and predicting, where it can be seen that there are

presented in news.

large spikes at certain time periods. Moreover the volume

change graph demonstrates that we have clusters of periods

From the variety of sources for fundamental data, we will fo-

with high volatility.

cus on how news can be used in modelling financial markets.

There have been various studies on this topic, which differ

2.1

Data Description

by the extent of the analysis of textual data describing news

Our dataset spanned from 2.12.2013 to 30.12.2016, which

story. Initially research focused on the impact of frequency

offered us 777 trading days on which we were able to col-

of news stories on the market movements (see [6] or [7]).

lect historical market data. On the other hand the number

In these papers authors found some correlation between in-

of news events that occurred in that time period was sig-

creased number of published news stories and larger market

nificantly higher. When we singled out events using 50 as

movements. To extend this approach, researchers also anal-

the relevance threshold we obtained 4336 events (details of

ysed the content of the news stories, which led to determin-

how Event Registry calculates the weights of concepts in

11



Volume

In order to reduce noise in the dataset of events we se-

lected only the most important ones. Namely we set a limit,

1e+07

which determined the lower boundary for the relevance of

the events to the given company. Hence if an event was not

aluesV

6e+06

relevant enough we discarded it. This naturally raises the

2e+06

issue of selecting the appropriate boundary for the relevance

of the events and after testing several values we selected 89

Dec 02 2013

Jun 02 2014

Dec 01 2014

Jun 01 2015

Dec 01 2015

Jun 01 2016

Nov 30 2016

out of 100. After this we were left with 424 events.

Volume Change

What should be noted here is that many events occurred on

1.0

the days when the markets were closed (weekends, holidays),

0.5

0.0

so they had to be linked with the next possible trading day

aluesV

that followed. Another issue was also that in many cases

−1.0

multiple events occurred on the same day and hence we were

not able to isolate the effect of a single event, but rather

Dec 02 2013

Jun 02 2014

Dec 01 2014

Jun 01 2015

Dec 01 2015

Jun 01 2016

Nov 30 2016

looked at the average effect of all the events on a specific

day. Therefore we are also relying on the fact that there are

Figure 1: Volume Dataset

no duplicates in the our dataset.

3.

METHODS

the event are presented in [5]). For testing purposes we split

our dataset into two parts, where we allocated 757 observa-

3.1

Predictions from Similar Events

tions for training and testing, while we used the remaining

In order to build our time series model, which uses data from

21 observations for the out of sample prediction.

news events, we examined the effects of past similar events

on the market for each new event separately. This was done

For each event we also obtained related events to it. These

by first extracting the market information on the days of

related events are obtained by computing the TF-IDF weights

the similar events and combining them with the additional

on the concepts present in the event and then by using co-

information about the event (social score, relevance to the

sine similarity measure other events with similar concept

company, correlation to current event, number of articles).

weights are found (see [5]). These past similar events formed

On average each event had between 5-17 similar events from

a crucial dataset, because we could link them to the market

our dataset.

movements and deduce their impact.

Illustration of this processes for an event E ”Goldman Profit

2.2

Data preprocessing

Rises 74% as Bond Trading Beats Estimates” with two simi-

In order to measure the impact of the event we will be con-

lar events (SE) is represented in Figure 2. On the two days of

sidering the change in volume that occurs between the clos-

the SE1 and SE2 we then collect all of the available stock

ing value of today and the day after the event. Specifically

market values for the company and index VIX. It should

we will be analysing and predicting the value of

also be noted that the time difference between event and

past similar events is not limited or predetermined. In this

Vt+1 − Vt

(Volume Change) ≡ V C

,

(1)

case we have events that are years apart.

t

t =

Vt

where Vt is the value of volume at time t. In this formulation

we used future values, so that we can observe the impact

an event has on future volume.

It should be noted that

this value is calculated in the same manner as that of the

returns of shares, which we will also use in our analysis. The

formula is identical except that we replace values of volume

with those of closing price Pt. Hence we write

Pt+1 − Pt

(Returns) ≡ r

.

(2)

t

t =

Pt

Changes in the volatility index VIX were calculated with the

same formula. Additionally we also added rolling 5 and 10

day moving average of volume change to the feature set, so

that we would have another measure of impact an event has

on value. Hence the complete list of all of the features that

were obtained from the stock market is:

• Returns

• VIX Close

• Volume Change

• VIX Change

• Open

• Rolling mean 5 days

Figure 2: Similar Events

• High

• Rolling mean 10 days

Once the dataset of market impacts from past similar events

• Low

•

was obtained an Ordinary Least Squares (OLS) model was

Rolling EMA 5 days

•

fit to the dataset.

In the Figure 2 this would be all of

Close

• Rolling EMA 10 days

the market variables available at times ti−k and ti−j. This

• Volume

12

dataset then allowed us to use data from the market and

The above model however assumes that the variance (σ)

current event to form a prediction of what the future mar-

is constant. This assumption is dropped due to the clus-

ket movement could be. This procedure was done for both

tering in the volume change (periods of high changes are

volume change and returns. Several choices of external re-

followed by lower ones).

This type of models are called

gressors that were used in OLS were tested, where the main

Generalised Autoregressive Conditional Heteroskedasticity

issue arose when there were fewer similar events than fea-

(GARCH) models ([2]) and have been shown across litera-

tures to be fit. Namely we fitted two models, one with all

ture to improve the models (see [8]). They allow us to model

of the market information available and one where only Re-

σ = σt by an additional non linear model. Namely general

turns, VIX Change and concept weight were used to predict

formulation of the problem is the following:

Volume Change. Similarly two models were fit to predict

r

s

Returns, where Volume Change, VIX Change and concept

X

X

σ2t = β0 +

βiZ2

t−i +

γj σ2t−j

(4)

weights were used as features in the second one. In addi-

i=1

j=1

tion we also calculated the average of volume changes and

returns from dates of similar events and used it as a feature.

So the current value of the variance also depends on the pre-

In order to take into account the correlation to the similar

vious values of the variance. From this general formulation

events we created an extra feature that represented the value

one has to determine the value of (r, s) that is most suitable

of average of volume changes and returns multiplied by the

for the model. Finally we can add external regressors (fea-

normalized correlation (correlation/100). All together the

tures) to the equations 3 as an additional sum term in the

following features were added to the model (all of which are

first line of the equation.

predictions from similar events):

3.3

Evaluation Criteria

• Number of Events that day

In order to asses which model was most suitable for the given

dataset we have used a variety of different tests. In order

• Returns (all regressors)

to be able to assume that the time series is stationary we

first ran Augmented Dicky Fuller test and KPSS test. In

• Returns (selected regressors)

both cases the p-value was below and above the 5% thresh-

old respectively. To differentiate between variety of differ-

• Volume Change (all regressors)

ent model we analysed the performance of each model by

the Akaike information criterion (AIC) and Bayesian infor-

• Volume Change (selected regressors)

mation criterion (BIC), which are the classical information

criteria used across literature that also penalize model for

• Average Returns

high complexity ([1]). Finally in order to determine the sig-

nificance of the features we will also build a model without

• Average Returns ∗ Correlation

the predictions from the similar news events. This will serve

as our baseline model and the difference in performance will

• Average Volume Change

be then the main measure of how significant these features

are.

• Average Volume Change ∗ Correlation

4.

RESULTS

With the OLS model trained on the dataset of similar events

Our first step in determining which features are significant

we were then able to predict the impact of the original event

for our analysis was running a t-test selection procedure for

on the volume change. It should also be noted that the OLS

all regressors. From this analysis it was determined that

model was selected in order to avoid over fitting an ARMA

only the following features were relevant for our model:

type model. The predictions that were obtained in this way

were then used in the next step when we were building our

• VIX Close price

• Rolling EMA 5 days

regression model.

• Rolling mean 10 days

• Rolling EMA 10 days

3.2

ARMA-GARCH MODEL

• Rolling mean 5 days

• Prediction of Returns

The main model in our analysis will be of the ARMA-GARCH

type, because this formulation allows us to capture the ef-

fects of values from previous periods and account for clus-

tering in volatility.

Hence our first finding was that the predictions that we ob-

tained from the similar events were significant, but instead

The ARMA model is in its general form specified as:

of using the predicted volume changes, the predicted returns

turned out to be more significant for our model.

Xt = µt + Zt,

p

q

With this set of regressors the best ARMA model was of

X

X

µt =

αiXt−i +

ωj Zt−j

(3)

order (2,2), so two lagged terms in both variables were in-

i=0

j=1

cluded. This model preformed according to the evaluation

Z

criteria shown in the table 1. As mentioned before in order

t = σ ⇒ Z ∼ N (0, σ2)

to capture clustering in the dataset we modelled the vari-

where Xt is the target variable at time t, µt is the equa-

ance term even further with a GARCH type model. Again

tion for the mean at time t, ∼ N (0, 1) is an iid normally

the same evaluation criteria were used and a grid search was

distributed noise term and σ is the variance at time t. The

preformed to find the best coefficients (r, s) for order of the

first step in modelling is to determine the best values for

model. This resulted in a GARCH (5,1) model, with the

(p, q) according to the evaluation criteria.

evaluation criteria presented in the same table 1.

13



Comparison of results yields what improvements have been

5.

CONCLUSION

made by including the feature. This table also serves as a

In this paper we investigated one possible approach of deter-

comparison to the baseline model. Since the AIC is merely

mining the impact that news have on the financial markets.

a heuristic, differences between its values are important. So

This is a growing research area since nowadays any model

the predicted feature from similar events improves AIC value

that wishes to capture market dynamics has to account for

by 10.22, while the improvement of our final model is 32.69.

the effect of world events. In order to extend previous work

in this area, we demonstrated how a more complex text

One way of interpreting this difference is also in terms of rel-

data can be used to obtained relevant features for modelling

ative likelihood, which is defined as exp((AICmin−AICi)/2)

changes in financial markets. Our data consisted of news

for model i, where AICmin represents the lowest AIC value

events that were automatically extracted from the news ar-

from all models. Hence the baseline model is 7.97 ∗ 10−8

ticles. For each event we then collected past similar events

times as probable as the best model to minimize the informa-

and observed how the market reacted to those events. On

tion loss, while the ARIMA(2,2) with feature is 1.06 ∗ 10−5

the basis of these reactions we built various features vectors

times as probable. However we can see that due to addi-

that helped us improve our model. Results show that when

tional complexity of our final model, BIC criterion is actu-

predicting change in volume, the predicted returns from sim-

ally the highest for our chosen model. Hence an optimisation

ilar events served as a useful feature. Additionally we tested

by the BIC criterion would result in a different model.

the performance of our improved times series model on the

out of sample dataset for the period of 1 month. Future work

will be done in this direction, where we will look for further

similarities between news events and possibly obtain new

AIC

BIC

features that could be used for modelling financial markets.

Baseline

-925.57

-874.66

ARMA(2,2)

-935.35

-879.81

6.

ACKNOWLEDGEMENTS

ARMA(2,2)-GARCH(5,1)

-958.26

-869.96

This project has received funding from the European Union’s

Horizon 2020 research and innovation programme under the

Marie Sk lodowska-Curie grant agreement No 675044.

Table 1: Evaluation criteria train set

It should be noted that we have also tested the assumption

about the distribution of the standardized residuals. So in-

stead of using normal distribution we fitted the model with

student t distribution with and without skew. Some minor

improvements in AIC value were obtained, but not really

7.

REFERENCES

significant. Hence we kept the distribution as normal.

[1] H. Akaike. Information Theory and an Extension of the

Maximum Likelihood Principle, pages 199–213. Springer

Final test included out of sample predictions, where our best

New York, New York, NY, 1998.

chosen model was ARMA(2,2)-GARCH(5,1). The time pe-

[2] T. Bollerslev. Generalized autoregressive conditional

riod for prediction was 1 month (December 2016) with 21

heteroskedasticity. Journal of Econometrics, 31(3):307

trading days. We also calculated errors when making these

– 327, 1986.

predictions, where our best model from above scored a value

[3] X. Ding, Y. Zhang, T. Liu, and J. Duan. Deep learning

of 0.1344944 for Mean Absolute Error and 0.1565652 for

for event-driven stock prediction. In Proceedings of the

Root Mean Squared Error. Figure 3 shows how the model

24th International Conference on Artificial Intelligence,

predicted future values, where the middle line represent pre-

IJCAI’15, pages 2327–2333. AAAI Press, 2015.

dictions for the mean value. Additionally the plot also in-

[4] R. Fehrer and S. Feuerriegel. Improving Decision

cluded intervals of upper and lower 95 and 80 quantile range.

Analytics with Deep Learning: The Case of Financial

Disclosures. ArXiv e-prints, August 2015.

[5] G. Leban, B. Fortuna, J. Brank, and M. Grobelnik.

Forecast of 20 days Volume Change

Event registry: Learning about world events from news.

In Proceedings of the 23rd International Conference on

1.0

World Wide Web, WWW ’14 Companion, pages

U95%

107–110, New York, NY, USA, 2014. ACM.

U80%

U50%

M

[6] M. L. Mitchell and J. H. Mulherin. The impact of

0.5

L50%

L80%

public information on the stock market. The Journal of

L95%

Finance, 49(3):923–950, 1994.

0.0

[7] D. Shen, W. Zhang, X. Xiong, X. Li, and Y. Zhang.

olume ChangeV

Trading and non-trading period internet information

flow and intraday return volatility. Physica A:

−0.5

Statistical Mechanics and its Applications, 451:519 –

524, 2016.

−1.0

[8] R. Tsay. An Introduction to Anaysis of Financial Data

with R. John Wiley & Sons, New Jersey, 2013.

740

750

760

770

[9] H. Yu, G. V. Nartea, C. Gan, and L. J. Yao. Predictive

Days

ability and profitability of simple technical trading

rules: Recent evidence from southeast asian stock

Figure 3: Out of Sample Forecast

markets. International Review of Economics &

Finance, 25:356 – 371, 2013.

14





1

Challenges in media monitoring of worldwide news

sources to support public health

Joao Pita Costa * **, Flavio Fuart *, Marko Grobelnik *, Gregor Leban * and Evgenia Belyaeva ***

Abstract—Real-time global media monitoring is nowadays an

essential resource to public health. Multilingual capabilities can

enrich this potential allowing a worldwide overview based on

online news sources, blog posts or social media. In this paper

we propose research topics related with the exploration text

mining tools used to provide real-time global media monitoring

in the context of health. We aim to understand how media can

contribute to a better overview of health related and well being

matters. With it we shall also identify open research questions

that motivate further technological development to better fit the

needs, interests and workflow of public health professionals.

I.

INTRODUCTION

Dealing with media (written online media in particular) has

several issues, one of which is a lack of common publishing

Fig. 1.

The contribution of text mining efforts based on online multi-lingual

standards. Another issue is related to the global nature of the

news to public health raises several research questions. In the image above

service: the mere fact of a system being universal, requires

shows an ER visualisation module representing a real-time stream of news

a system that can manage a variety of languages, possibly

that permits us to explore the range of some of those questions.

hundreds of them. This creates issues, as todays language tech-

nologies can deal only with words and sentences, highlighting

According to the ECDC (European Centre for Disease

the need to bridge the gap from simple textual representation

Prevention and Control), the objective of epidemic intelligence

towards semantic representation, where we would want to

is to produce timely, validated and actionable intelligence on

understand the semantic and conceptual aspect of the textual

events related to communicable diseases or of unknown origin

content and not just lexical (words and phrases) and syntactical

that are of interest for public health and health authorities

(sentences).

[4]. A similar definition is provided also by the WHO [18].

Considerable research and commercial activity in the past

Part of this effort is to gather unofficial, unstructured and

period has led to a development of high performance online

unverified information about those events, that are then later

news monitoring systems and accompanying tools, methods

verified and analyzed by Public Health experts. Several ICT

and models. For example, a complete cross-lingual news

solutions to support these efforts have emerged. Most notable

processing pipeline consisting of the following components

solutions used by authorities are GPHIN [3] (Global Public

has achieved good results in both research and commercial

Health Intelligence Network), developed and operated by the

usage scenarios: NewsFeed system [16] to monitor, gather an

Canadian Government, MedISys [15], developed and operated

produce clear-texts of online (HTML) news articles; Sentiment

by the Joint Research Centre of the European Commission

Detection module [17]; Enrycher module for text annotation

and Healthmap.org, a system developed by Boston Children’s

[7]; Wikifier for advanced text categorization [2]; Cross-lingual

Hospital receiving external funding.

document linking [11]; Document clustering [1]; and Event

All those systems are multi-lingual to some extent, i.e.

Registry [9], an advanced news visualization and analysis

they monitor news in more than one language. However, it

tool. There is a relatively large offer of similar online news

seems they do not leverage the usefulness of cross-lingual

monitoring (”clipping”) systems available, each of them with

approaches to increase the quality of detected health events.

a distinctive set of features. Authors, however, are not aware,

Also, they seem no to use Wikipedia, which is nowadays the

of a monitoring system that would implement a rich set of

biggest freely available knowledge base, to extract meaningful

cross-lingual features as the online news processing pipeline

information from news articles.

mentioned above.

In this paper we aim to identify research questions, related

to features highlighted above (cross-linguality, wikification,

* Quintelligence, Ljubljana, Slovenia

** University of Rijeka, Croatia

among others), that can contribute to the appropriate software

*** Jozef Stefan Institute, Ljubljana, Slovenia

development serving the needs of Public Health professionals.

Part of the work presented in it was developed in the context

Category: H.4.0 Information Systems Applications: General reporting,

statistical analysis, visualisation, networks

of the European Union research project MIDAS, under the

Keywords: Text mining, public health, news, media.

program Horizon 2020.

15



2

II.

COLLECTING THE DATA

Online news media sources represent a reliable and struc-

tured near-real time stream of a heterogeneous, multilingual

text documents that describe real-word events. A range of

services offers the aggregation of both social media and online

news, following the uptake of news publishing through the

latter channel. Several media news aggregators provide web

crawlers for information extraction and media monitoring

aiming for newsworthy stories. In order to extract meaningful

information for a particular application domain, automatic

event detection mechanisms exist allowing us to measure the

media impact of public health awareness rising campaigns,

health related news coverage and bias across different outlets.

Those also include sentiment detection and cross-lingual link-

Fig. 2.

The automatic annotation tool Wikifier used to enrich a WHO article

ing of documents applied to news sources. In the public health

on vaccination taken as an example. It is copied to the Text field and has

domain, these approaches have been used to detect disease

several underlined words corresponding to the identified Wikipedia concepts.

outbreaks and other public health threats (e.g. monitoring of

When hovering the emphD icon on disease name, we notice that this article

is mostly about Yellow Fever, we see that yellow fever is a disease and we

international social and sports events, anti-vaccination cam-

see that Kinshasa is a settlement. If we further explore the DBpedia entry the

paigns, among others).

area, number of inhabitants, country and other information in available.

Among the available news aggregator and analysis services

that provide some level of access to online news streams we

highlight the NewsFeed system (available at newsfeed.ijs.si).

(available at wikifier.ijs.si) takes profit of Wikipedia, the

It is a real-time aggregated stream of semantically enriched

biggest open, online and up-to-date knowledge repository on

news articles tracking over RSS-enabled global media sources

the internet, to annotate text and link it to relevant knowledge

worldwide. In particular, the online media monitoring system

resources. Wikifier allows for annotating large quantities of

NewsFeed currently monitors around 900.000 RSS news feeds

free text in a very short time. The type of analysis provided

( 800.000 web sites) and collects between 350.000 and 600.000

permits us to identify trends (e.g. health related lifestyles) or

articles per day, assuming an article archive available since

ask questions like: Provide me all texts (articles) that are about

May 2008. Using the current API it is possible to get anno-

vaccination campaigns in Africa in cities with population of

tated articles since June 2013. Currently, about 50% of all

more than 2 million people. Figure 2 shows the output of

articles are in English. All languages present in Wikipedia

the automatic annotation of the abstract of a recent article

are included, but are covered in respect of quality, volume

on Cholera. The top ten concepts annotated include ”Vibrio

and extent of analysis performed [8] to differing degrees.

cholerae”, ”Fresh water” or ”Bacteria”.

For research purposes, authors where granted access to the

By being constructed over the knowledge-base of Wikipedia,

NewsFeed system, available at newsfeed.ijs.si. The data is

it is essential that Wikipedia coverage on health topics is

accessed through a HTTP API and the result is provided in

of high quality which is itself an open question. It is also

XML format. Additional metadata can be obtained through the

another open question of whether an extension of Wikipedia

API: named entities, concepts, categories, mentioned places

can capture all of the concepts covered by MeSH.

and similar [14]. Furthermore, this technology could be used

IV.

F

for additional data annotation and analysis tailored to public

ROM GLOBAL TO LOCAL MEDIA MONITORING

health applications (e.g. the annotation with wikifier as later

In the context of Public Health, a graphical dashboard

discussed in Section III).

can provide contribution to the real-time monitoring of the

The news monitoring software EventRegistry (ER) (avail-

global health by allowing to a continuous observation of

able at www.eventregistry.org) feeds on Newsfeed, tracking

medical and well-being issues, limiting to the presentation

over 100,000 global media sources in near-real-time, operating

of articles/items related to those topics (a possible health

across 100 languages, and aggregating global media content in

dashboard is available at [6]). Such a dashboard presents the

a semantically meaningful way. It collects over 300,000 news

incoming multi-lingual health related media content published

articles on average per day and arranges them into events,

somewhere in the world. That allows the health professional to

events are further connected into story-lines which enable

have an overview of what is happening globally at any given

tracking of evolving topics [9]. Since ER covers most of the

moment in time (cf. [5]). This is a convenient way to observe

global media reporting on any topic, it can be used also to

global health monitoring by using the ER system introduced

track topics like health and well-being on different levels of

above in Section II. Its key feature is to be able to observe

resolution

from small local issues, up to the higher level

health issues across many languages and in temporal detail,

country issues and global trends.

over a variety of scales, which is what most other systems

have difficulties [10]. The image in Figure 1 shows a snapshot

III.

AUTOMATIC ANNOTATION

of the dashboard with health related events on July 20th, 2017.

The complexity of identification of entities makes the au-

The event of the Zika outbreak is identified immediately

tomatic multilingual text analysis a difficult task. Wikifier

after the collection of news articles that report about it. With

16





3

Fig. 5.

A temporal intensity visualisation in ER for the query ’Ebola virus

disease’ showing that the system could notice the outbreak of the epidemic

before it was made public.

Fig. 3.

ER screenshot showing the event of “lead poisoning” in Nigeria

on May 12, 2015. It includes date, location, categories, number of languages

covered and social media (Twitter) count.

Fig. 6.

ER screenshot showing a geographical spread visualisation module

for the query ’Ebola virus disease’. Each mark shows the number of related

news per location in the map.

Fig. 4.

ER screen-shot showing story-line developed after the event of “lead

V.

HISTORICAL PERSPECTIVE

poisoning” in Nigeria on May 12, 2015. It describes below the event a timeline

of related news.

Another view relevant to any health related issue is a histor-

ical perspective based on an aggregation of a particular topic.

it, the health professional can explore the evolution of the news

The evolution of Ebola virus and reporting after 2014 until

publishers awareness of the epidemics in time by looking at

2017 can be an example to be explored through ER. Querying

the related news articles represented in a world map, as they

ER for Ebola virus disease provides over 20,000 events related

were identified or updated during a selected period of time. ER

to Ebola appearing after 2014. The content (over 200,000 news

can find articles and events related to a particular entity, topic,

articles) could be analysed through several visual modules:

date, location or category, as well as measure their impact

temporal intensity, geographical spread, topical spread, among

on social media (in particular, Twitter). Moreover, its cross-

others. The Figures 5 and 6 illustrate temporal (with the peak

lingual capabilities allow to consider events where the news

in October 2014) and geographical spread (West Africa and the

appears only in e.g. Chinese or some unknown language where

US) of Ebola related events. We highlight that ER was able to

the news never came into English speaking space. This is

notice the outbreak of the epidemics before it was made public.

of particular interest when considering the monitoring of rare

Though, the question was not asked then, i.e., the query wasn’t

diseases worldwide.

done because it was not yet an identified issue and there was

Another perspective is the micro view of a particular health

a lack of continuous attention. Though a complete attention of

related event happening somewhere in the world. The event of

all such epidemic related topics could be a heavy burden to

“lead poisoning” in Nigeria on May 12th, 2015, which went

the system. This rises the research problem of predicting and

mostly unnoticed at a global level, is an example that can

alerting of high density events like this.

be explored through ER. With a simple query to the system,

In ER the Zika outbreak event is identified immediately after

the health professional can extract all the reports related to

the collection of the news articles that report about it. One

that event, and check the event in the context of other related

can explore the evolution of the news publishers awareness of

events. To illustrate this, the screen-shots in Figures 3 and 4

the epidemics in time by looking at the related news articles

show the event itself and the story-line developed after the

represented in a world map, as they were identified or updated

event, respectively.

during a selected period of time.

17



4

ER can find articles and events related to a particular

entity, topic, date, location or category, as well as measure

their impact on social media (specifically, Twitter). This social

media monitoring is complemented by TwitterObservatory (cf.

[12]), leveraging in-house technology that uses data observa-

tion, enrichment and storage techniques for social media data

presentation, search and analytics. Moreover, there have been

several successful tests done to extract sentiment from news

based on the sentiment of tweets associated with news [13].

The sentiment directly from news is still an open problem that

shall be tackled.

VI.

CONCLUSIONS AND FURTHER WORK

Fig. 7.

Kibana screenshot showing the different visualisation modules based

In this paper we discussed the potential of several text

on queries to elasticSearch composing an interactive dashboard. This is a

mining tools dedicated to explore worlwide multilingual news,

puppet example from venzia.es to show the practical potential of this tool.

focusing matters of interest to Public Health. Further research

(exploring the potential of Newsfeed, Wikifier and ER) in-

[4]

ECDC.

Epidemic

intelligence.

ecdc.europa.eu/en/

cludes: (i) the correlation of high level concepts with low

threats-and-outbreaks/epidemic-intelligence.

Accessed:

2017-09-

level features; (ii) the showcase of hierarchies (e.g. in some

05.

disease) and how they can be drilled down to variety of

[5]

M. Grobelnik. Observing global health and well-being. http://www.

sub topics (e.g. different aspects of such a disease; (iii) the

midasproject.eu/2017/07/24/observing-global-health-and-well-being/

analysis of the impact of health related issues on society

.Accessed6/8/2017, 2017. MIDAS Project Blog, 2017.

(e.g. Ebola news impact in adherence to insurance); (iv) the

[6]

M. Grobelnik and G. Leban. Eventregistry’s health pannel. https://

tinyurl.com/wits2017qnt, 2017. Accessed: 2017-07-25.

presence of PubMed/Medline in global news; and (v) the

[7]

M. Grobelnik and D. Mladenić. Simple classification into large topic

prediction of consequences of a health event. Other research

ontology of web documents. CIT, 13.4:279–285, 2005.

directions consider the dynamics of Public Health/Healthcare

[8]

G. Leban, B. Fortuna, J. Brank, and M. Grobelnik.

Cross-lingual

institutions where activities happen affecting: (1) the decision

detection of world events from news articles. Proceedings of the 2014

makers that make choices based on the legislation and on

International Conference on Posters and Demonstrations Track, CEUR-

the available in-house monitoring systems operated by their

WS. org, 1272:21–24, 2014.

own data scientists, (2) those data scientists that explore

[9]

G. Leban, B. Fortuna, J. Brank, and M. Grobelnik. Event registry:

the data and extract relevant information that contributes to

learning about world events from news.

Proceedings of the 23rd

International Conference on World Wide Web, ACM, 2014.

evaluation of Public Health scenarios, and (3) the technical

team that deploys and maintains the data infrastructure where

[10]

J. P. Linge, J. Belyaeva, R. Steinberger, M. Gemo, F. Fuart, D. Al-

Khudhairy, S. Bucci, R. Yangarber, and E. van der Goot. Medisys:

data scientists are active. It could be very useful to have easy

medical information system. In Advanced ICTs for disaster management

to handle data visualisation modules (like the ones offered by

and threat detection: collaborative and distributed frameworks, pages

Kibana and shown in Figure 7) allowing decision-makers to

131–142, 2010.

choose with a few clicks the representation of data that makes

[11]

A. Muhič, J. Rupnik, and P. Škraba. Cross-lingual document similarity.

sense to the problems they focus on. The data scientists could

Proceedings of the ITI 2012 34th International Conference on Informa-

tion Technology Interfaces (ITI), IEEE, pages 387–392, 2012.

then manipulate the current workflows, maximising critical

outputs, presenting data in a meaningful way whilst minimising

[12]

I. Novalija, M. Papler, and D. Mladenić. Towards social media mining:

Twitterobservatory. Proceedings of the Slovenian Data Mining and Data

resource required to drive data interrogation and presentation.

Warehouses 2014, 2014.

[13]

L. Rei, M. Grobelnik, and D. Mladenić. Event detection in twitter with

ACKNOWLEDGMENT

an event knowledge base. Proceedings of SiKDD 2015, 2015.

The authors would like to thank the EC Horizon 2020

[14]

J. Rupnik, A. Muhic, G. Leban, P. Škraba, B. Fortuna, and M. Gro-

MIDAS Project and funded by the European Union under grant

belnik. News across languages-cross-lingual document similarity and

event tracking. Journal of Artificial Intelligence Research, 55:283–316,

agreement no. 727721, under the call SC1-PM-18-2016 - Big

2016.

Data supporting Public Health policies.

[15]

R. Steinberger, F. Fuart, B. Pouliquen, and E. van der Goot. Medisys:

A multilingual media monitoring tool for medical intelligence and early

REFERENCES

warning. Proceedings of the International Disaster and Risk Conference

IDRC Davos 2008 - Short and Extended Abstracts p. 612-614, 2008.

[1]

J. Brank, G. Leban, and M. Grobelnik. A high-performance multi-

[16]

M. Trampus and B. Novak. The internals of an aggregated web news

threaded approach for clustering a stream of documents. In Proceedings

feed. Proceedings of 15th Multiconference on Information Society IS-

of the 17th International Multiconference Information Society, 2014.

2012, 2012.

[2]

J. Brank, G. Leban, and M. Grobelnik. Annotating documents with

[17]

T. Štajner, I. Novalija, and D. Mladenić. Informal multilingual multi-

relevant wikipedia concepts. Proceedings of SiKDD 2017, forthcoming,

domain sentiment analysis. Informatica, 37.4:373–380, 2013.

2017.

[18]

WHO.

Epidemic intelligence.

www.who.int/csr/alertresponse/

[3]

M. A. Dion M, AbdelMalik P. Big data and the global public health

epidemicintelligence/en/. Accessed: 2017-09-05.

intelligence network (gphin). CCDR: Volume 41-9, September 3, 2015:

Big Data, 2015.

18





Ontology-based translation memory maintenance

Andraž Repar

Senja Pollak

Iolar d.o.o.

Jozef Stefan Institute

Parmova 51 and

Jamova 39

Jozef Stefan International Postgraduate School

1000 Ljubljana, Slovenia

Jamova 39

senja.pollak@ijs.si

1000 Ljubljana, Slovenia

repar.andraz@gmail.com

ABSTRACT

translation memory for “exact” and “fuzzy” matches and the

In this paper, we explore the use of text mining techniques

results are used to calculate the final price of the transla-

for translation memory maintenance. Language service provi-

tion. This technology really took off in the 1990s and today

ders often have large databases of translations, called trans-

virtually every language service provider on the market uses

lation memories, which have been in use for a long time -

some kind of a translation memory to store translations.

leading to a slow population of the translation memory with

other domains (i.e. adding financial content to a medical

In theory, the translation memory concept involves the use

translation memory). To our best knowledge, no tools exist

of metadata to clearly mark the segments belonging to dif-

that would effectively separate the content of a translation

ferent domains and/or customers. However, metadata are

memory according to different domains. Having the ability

often not added due to time pressure or other issues and the

to extract individual domains from low-quality translation

information about the domain or customer is lost. With-

memories could mean a significant benefit to language ser-

out this information it is difficult to reuse the translation

vice providers looking to utilize modern translation meth-

memories for machine translation and/or terminology man-

ods, such as machine translation and automated terminol-

agement. Finally, the quality of a translation memory can

ogy management. In the first stage, we used OntoGen, a

also degrade over the years because segments may get ac-

semi-automatic ontology building tool which uses text min-

cidentally stored in the wrong translation memory (the do-

ing techniques, to separate the segments in the translation

main of the segments is not the same as the domain of the

memory according to domains.

In the second stage, we

translation memory).

wanted to test whether we could use the domains defined

in the previous stage to build classification models - effec-

In this paper, we analyze one such translation memory used

tively using them as class labels in place of the costly and

by the translation company Iolar to see whether we could use

time-consuming manual annotation of segments.

text mining techniques to extract domains and clean low-

quality translation memories. We used OntoGen [4] topic

Keywords

ontology editor to separate the dataset into distinct domains

and then used these domains for text classification in Weka

translation memory, language service provider, ontology, On-

[5].

toGen, text classification

Ontology learning is a well-researched area with researchers

1.

INTRODUCTION

using various techniques, such as natural language process-

In the translation industry, language service providers (LSP)

ing([10]), machine learning ([13]) and information retrieval

often offer a guarantee to their customers that they will

([3]). The same can be said of using machine learning for

never have to pay twice for the translation of the same text.

text classification (for example, [11], [9] and [7]). On the

In order to do so, they have come up with a way of saving

other hand, research into using data mining techniques for

and re-using past translations to reduce costs and offer dis-

translation memory maintainance is scarce with most au-

counts. Starting in the 1970s and 1980s, translation compa-

thors focusing on spotting low-quality individual segments.

nies began using translation memories which are essentially

Barbu [1] uses several machine learning algorithms to spot

databases of bilingual segment pairs (source text – target

false segment pairs in translation memories, Sabet et al. [6]

text) along with some metadata. Whenever a new docu-

describes a system for unsupervised cleaning of translation

ment is received for translation, it is leveraged against the

memories without labeled training data based on a config-

urable and extensible set of filters, and Nahata et al. [8]

defines a set of rules for a rule-based classifier which is in

turn used to find low-quality segment pairs. A more recent

topic that serves a similar purpose is quality estimation of

machine translated segments.

For example, Specia et al.

[12] describe a system that tries to predict the quality of

machine translated segments using machine learning.

19



2.

DATA DESCRIPTION

The translation memory analyzed for this article has been

in use for almost 15 years and contains parallel translation

segments in English and Slovene. Initially, it was meant to

store Marketing, Legal and Financial translations, but over

the years various other domains have been stored in this

translation memory. In addition to the three domains men-

tioned above, this translation memory also contains a large

chunk of IT-related segments, such as user interface strings,

user assistance texts and technical documentation of vari-

ous IT devices (printers, scanners, monitors etc.). Given

the content of these documents, we expect to see some over-

lap between domains – for example, a printer user manual

will typically contain some legal information as well as some

marketing-like language.

3.

EXPERIMENTS

The most obvious way to go about this task would be to

manually annotate a dataset from this translation memory

and then use it to train a classifier. However, manual an-

notation is time consuming and costly, so we first utilize

OntoGen [4], a semi-automatic and data-driven ontology ed-

Figure 1: Ontology visualization: 4 main domains

itor focusing on editing of topic ontologies, and then use the

are extracted (Financial, Marketing, IT, Legal) with

resulting ontology topics for building a text classifier that

two of them having additional subdomains

could be used for other translation memories and documents.

3.1

Preprocessing

the data into Weka machine learning toolkit. We tested var-

ious machine learning classification algorithms (Na¨ıve Bayes

The first step involved extracting the segment pairs and fil-

Multinomial, SVM, J48) to find which one gives the best re-

tering them. The Slovene segment parts were discarded be-

sults. 10-fold cross-validation was used for all experiments.

cause only one language is needed for this task. English was

We applied Weka’s StringToWordVector filter and used a

chosen because it is the source language in this translation

stoplist (300 most frequent words from the BNC [2] corpus)

memory. The TMX file contained 247,103 English-Slovene

to filter out the most common words.

segment pairs. To cut down on the noise and remove the

segments most difficult to classify, we decided to remove all

In the first phase, we have both topics and subtopics – where

segments with less than 8 words leaving us with 121,593

a subtopic existed, we glued the topic and subtopic together

segments.

to get a distinct class. This means we had 7 distinct classes:

ITUserInterface, ITITGeneral, Financial, Marketing, Legal-

3.2

Ontology creation

Contracts, LegalTenders, LegalITLegal.

The selected segments were saved in a Named Line-Documents

format suitable for OntoGen. Given the size of the file, the

In the second phase, we used only the main topics – meaning

processing in OntoGen was slow-going. We tried various ap-

that 4 classes were used: IT, Financial, Marketing, Legal.

proaches in OntoGen and finally settled on using k-means

clustering (with k=10) functionality to generate various sets

Because the original dataset was fairly large (more than

of segments corresponding to different keywords and then

100.000 segments), we had to significantly reduce it in order

manually group them into meaningful domains based on our

to be able to complete the calculations in Weka in reasonable

translation experience with this translation memory.

time. However, we couldn’t just take the first n segments,

because the different topics were not uniformly distributed

After experimenting with various ontology building tech-

across the dataset. Therefore, we took every 10th segment,

niques in OntoGen, the following topic ontology was con-

leaving us with a dataset of about 10,000 segments.

structed (followed by the number of documents in parenthe-

ses): IT (51,247) (subdivided into ITGeneral and User Inter-

Tables 1 and 2 contain information about the performance of

face), Marketing (11,567), Financial (12,987), Legal (42,163)

the three classifiers mentioned in section 3.3. For a detailed

(subdivided into Contracts, Tenders and IT Legal1).

analysis see section 4.2.

A graphical representation is shown in Figure 1.

4.

EVALUATION AND INTERPRETATION

3.3

Classification

OF RESULTS

When one evaluates the results of the hierarchical clustering

In the final step we exported the domains from OntoGen,

by OntoGen and classification, one should bear in mind that

attached them to their corresponding segments and loaded

in many cases no clear boundaries between domains exist.

1This group contains segments from privacy policies and

This was to be expected on the one hand due to the short

license agreements of various software applications

length of the documents, and on the other due to the seg-

20





the precision is high enough to use the topics extracted in

Table 1: Classifier performance with 7 labels (accu-

OntoGen as class labels for building a classifier.

racy of the ZeroR classifier for the majority class =

0.349)

4.2

Classification

J48

SMO

NB Multinomial

The results of the classification with 7 labels are not promis-

Accuracy

0.511

0.495

0.583

ing. The performance of all classifiers does exceed the ma-

Precision

0.507

0.483

0.581

jority class classifier significantly, but the accuracy is not

Recall

0.511

0.495

0.583

high enough for production use (close to 60% for the best

F-measure

0.472

0.483

0.580

performing classifier). Looking at the confusion matrix in

Figure 2, we can observe that the ITGeneral topic overlaps

with quite a few other topics and is the largest culprit for

Table 2: Classifier performance with 4 labels (accu-

the low performance. A significant part of the false pos-

racy of the ZeroR classifier for the majority class =

itives originate in the ITGeneral topic for all topics apart

0.406)

from Financial and LegalContracts (class c and e in Figure

J48

SMO

NB Multinomial

2). These two topics also have the highest precision (0.688

Accuracy

0.597

0.619

0.671

and 0.668, respectively).

Precision

0.615

0.608

0.678

Recall

0.597

0.619

0.671

F-measure

0.576

0.610

0.673

ments that are very difficult to assign to a single domain,

for example:

• The system must support operation of the HSM system

and the archiving of files even if the file system operates

in the Windows cluster.

Figure 2:

Confusion matrix of the Na¨ıve Bayes

Multinomial classifier - 7 labels

• The latest Windows operating systems have a firewall

built in.

When we focus only on the main 4 labels, the results are

better. Na¨ıve Bayes Multinomial is again the best perform-

ing classifier with its accuracy reaching a little over 67%.

The first sentence comes from a tender document, while the

Looking at the confusion matrix in Figure 3, it is evident

second one comes from an IT user manual. Even for a human

that the first 3 labels perform significantly better than the

annotator, this would be a difficult task and we would most

last one. Indeed, the precision of Legal, IT and Financial is

likely see low levels of inter-annotator agreement.

around 0.7, while that of Marketing is just a little over 0.4

(for detailed results see Table 4).

4.1

Ontology creation

To evaluate the results of ontology creation in OntoGen,

we extracted 50 random segments for all 7 topics/subtopics,

manually annotated them and compared the results.

Overall, a precision of 0.81 is quite good considering that

we are working with sentences which are difficult to classify.

It is also important to not lose sight of the fact that there

can be some overlap between the topics and that certain

sentences cannot be adequately classified into any of the

available topics.

The overlap between the various topics

Figure 3:

Confusion matrix of the Na¨ıve Bayes

causes a certain degree of ambiguity, but we believe that

Multinomial classifier - 4 labels

Precision

Recall

F-measure

Table 3: Manual evaluation of the ontology results

Legal

0.713

0.644

0.677

on 50 segments per domain

IT

0.711

0.746

0.728

Topic

Precision

Financial

0.685

0.646

0.665

Legal

0.418

0.508

0.459

Financial

0.76

ITGeneral

0.80

ITUserInterface

0.86

Table 4: Detailed performance of the Na¨ıve Bayes

LegalContracts

0.80

Multinomial classifier

LegalITLegal

0.86

The largest issue that we have not been able to overcome

LegalTenders

0.78

in this analysis is that a huge chunk of the segments in this

Marketing

0.80

dataset are IT related – this is especially true of the Market-

Average

0.81

ing and certain Legal segments (e.g. terms of use, privacy

21

statements or press releases or advertising material for IT

would be to use just those topics that are clearly separated

devices) which means that it is often difficult to differenti-

from the other parts of the dataset.

ate between a Legal/Marketing segment and a regular IT

one. This issue is very clearly seen in the confusion matrix

In terms of future work, we will explore text classification

in Figure 3. In contrast, the Financial segments have no

on manually annotated high quality translation memories.

immediate relation to any IT content making them a much

Finally, an interesting route would be to utilize domain ter-

more distinct category.

minology to enhance highly domain-specific terms assigning

higher weight to terminological features.

5.

CONCLUSION AND FUTURE WORK

This paper tries to determine whether text mining tech-

6.

REFERENCES

niques can be used to facilitate translation memory main-

[1] E. Barbu. Spotting false translation segments in

tenance in a language service provider environment. Given

translation memories, 2015.

the fast-paced nature of work in the translation industry,

[2] J. H. Clear. The digital word. chapter The British

it is only natural that the quality of translation memories

National Corpus, pages 163–187. MIT Press,

reduce over time. Even if they are perfectly designed, noise

Cambridge, MA, USA, 1993.

will inevitably be introduced leading the reduced usefulness

[3] H. Cunningham. Information extraction, automatic.

for other language applications.

Encyclopedia of Language and Linguistics, 2nd

Edition, 5:665–677, 2006.

At the outset, we had two questions: 1) whether OntoGen

[4] B. Fortuna, M. Grobelnik, and D. Mladenic. OntoGen:

can be used to divide the content of a particular low-quality

Semi-automatic Ontology Editor, pages 309–318.

translation memory, and 2) whether the resulting topics can

Springer Berlin Heidelberg, Berlin, Heidelberg, 2007.

be used as labels to build a classifier for other translation

[5] M. Hall, E. Frank, G. Holmes, B. Pfahringer,

memories and documents. The main reason was to find a

P. Reutemann, and I. H. Witten. The weka data

shortcut for manual annotation which is costly and time-

mining software: An update. SIGKDD Explor. Newsl.,

consuming.

11(1):10–18, Nov. 2009.

[6] M. Jalili Sabet, M. Negri, M. Turchi, J. G. C. de

We successfully managed to build an ontology, but the bound-

Souza, and M. Federico. Tmop: a tool for

aries between some topics were relatively vague. One reason

unsupervised translation memory cleaning. In

for this is that we had to deal with sentences – as opposed

Proceedings of ACL-2016 System Demonstrations,

to larger chunks of text – which are difficult to classify. The

pages 49–54. Association for Computational

second issue was the fact that many of these topics were in

Linguistics, 2016.

fact inter-related and some of the segments could have eas-

[7] T. Joachims. Text categorization with Support Vector

ily been classified in more than one domain. In particular,

Machines: Learning with many relevant features,

the Legal, IT and Marketing domains are closely related,

pages 137–142. Springer Berlin Heidelberg, Berlin,

because a lot of Legal and Marketing segments originated

Heidelberg, 1998.

in IT-related translation jobs. One could argue that the IT

[8] N. Nahata, T. Nayak, S. Pal, and S. Naskar. Rule

and Marketing domains could be combined into one cate-

based classifier for translation memory cleaning. 05

gory, since there is so much overlap, however from a strictly

2016.

translator’s point of view it makes sense to have separate

[9] K. Nigam, A. K. Mccallum, S. Thrun, and

categories, because different translation strategies are nor-

T. Mitchell. Text classification from labeled and

mally used for marketing (i.e. press releases) and general IT

unlabeled documents using em. Machine Learning,

(i.e. user manuals, help articles) translation jobs.

39(2):103–134, May 2000.

The results of the ontology creation were promising with

[10] S. P. Ponzetto and M. Strube. Deriving a large scale

manual evaluation (see Table 3) showing that around 4 in 5

taxonomy from wikipedia. In Proceedings of the 22Nd

strings were assigned a correct label. However, the picture

National Conference on Artificial Intelligence -

was much less clear when it came to building a classifier.

Volume 2, AAAI’07, pages 1440–1445. AAAI Press,

It turned out that the full ontology was too complex for

2007.

the classification algorithms used in this paper (see Section

[11] F. Sebastiani. Machine learning in automated text

4.2). When we used only the four main topics as labels, the

categorization. ACM Comput. Surv., 34(1):1–47, Mar.

results started approaching acceptable with an accuracy of

2002.

67% (compared to 0.406 as majority class). We would still

[12] L. Specia, G. Paetzold, and C. Scarton. Multi-level

ideally like to see the accuracy breaking the 75% or 80%

translation quality prediction with quest++. In

barrier.

ACL-IJCNLP 2015 System Demonstrations, pages

115–120, Beijing, China, 2015.

In the current state, the classifier is not accurate enough

[13] F. M. Suchanek, G. Ifrim, and G. Weikum. Combining

to be used in production.

However, when there are rea-

linguistic and statistical analysis to extract relations

sonably clear boundaries between topics in OntoGen, the

from web documents. In Proceedings of the 12th ACM

resulting labels can be successfully used – as evident by the

SIGKDD International Conference on Knowledge

performance of the Financial label. This is in itself a useful

Discovery and Data Mining, KDD ’06, pages 712–717,

achievement, because there is currently no way to export

New York, NY, USA, 2006. ACM.

just the finance-related segments from the translation mem-

ory. An obvious route to better classification performance

22





Audience Segmentation Based on Topic Profiles



Matic Kladnik, Luka Stopar, Blaz Fortuna, Dunja Mladenić

Jožef Stefan Institute

and

Jožef Stefan International Postgraduate School

Jamova 39, 1000 Ljubljana, Slovenia

e-mail: matic.kladnik@ijs.si





ABSTRACT

obtained by third-party cookies that cover a whole range of



user properties including demographics, interest, geography.

Audience segmentation is often applied on Web portals to

The problem that we are addressing is automatic audience

gain insights into the audience, support targeted marketing

segmentation where potentially vendor data on the users is

and in general provide on-line recommendations to the users.

available in addition to the usual Web log files and content

We propose an approach to audience segmentation that is

of the visited pages. In addition we propose to use

based on using machine learning on topic profiles of the

background knowledge in the form of pre-trained machine

visited content. Our preliminary experiments on a small

learning model that classifies Web pages into a predefined

sample of log-data show that the proposed approach is

custom taxonomy. In this way, each Web page is based on

promising and the proposed combination of features

its textual content assigned a ranked list of content topics of

capturing short term and long-term user interest gives better

different granularity. For instance, the assigned topics may

results than using only the short-term interests of the user.

be Business, Business/Financial_Services/Medical_Billing,



Business/News_and_Media, Society/Issues/Gun_Control.

1. INTRODUCTION





Large number of retuning users regularly visiting the same

The dataset that we have used to test the proposed approach

Web portals offer an opportunity to apply audience modeling

was obtained from an international media company. Almost

considering descriptions of the visited content, different

3 000 Web pages were crawled from the company Web site.

characteristics of the user and the user behavior. Instead of

The anonymized user data was obtained for more than half a

the usual audience segmentation based on user profiles that

million users that have visited the Web site within one

is also commonly adopted in recommendation systems [1],

selected day. All the considered text is in English language.

we propose an audience segmentation approach that is based



on the topic profiles of the visited content. As the users

We have pre-processed the data to remove references to Web

potentially have interests in different topics we allow the

pages that have limited textual content or un-standard

same user to occur in several segments.

formatting. The Web pages were processed in a standard way



to extract the textual content, remove the standard English

In this paper, we described the proposed approach and on a

stop-words and represent each Web page as a bag-of-words

small sample of real-world log-data test the hypothesis that

(BoW) with TFIDF weight. In addition to the content, each

the proposed combination of features capturing the content

page has metadata including a set of manually assigned

of the recently visited pages and the properties of all the

content labels done by the editorial team. For instance,

pages visited by the user improves the quality of the

brexit, Europe, money, davos, jobs, London, markets. These

segmentation.

content labels were historically used to annotate the users



visiting the pages. Each user is thus described by a set of

The rest of this paper is structured as follows. Section 2

properties including demographics and the content labels of

describes the problem setting and the dataset used in the

the pages visited over a longer period of time.

experiments. The proposed approach is described in Section



3 together with the description background knowledge that

3. APPROACH DESCRIPTION

we have used for mapping form the space of users into the



Audience segmentation is commonly based on grouping the

space of topic profiles. Section 4 gives the results of the

users by their common interests and some other e.g.,

preliminary experiments, while the conclusions are

demographic properties and behavioral similarity. However,

presented in Section 5.

the same user may have several interests and exhibit different



behavior depending on the current focus. This may result is

2. PROBLEM SETTING AND DATA



grouping together the users that do not have much in

High-quality Web portals that offer regularly updated

common except that they share some (but not the same)

content, such as market data, news articles or financial data,

interests with the third user.

attract many loyal users [2]. Today, vendors offer user data



23



Thus we propose an approach to audience segmentation

into a predefined custom taxonomy, we assign a

based on the similarity of the topics that the users are

ranked list of topics to each URL.

interested in. The idea is to view the problem through topics

3. For each URL we select one or a few topics with the

of the visited Web pages and based on that obtain segments

highest rank and form a collection of URL – Topic

of the users.

pairs.



4. Representation of the topics is based on a list of

Architecture of the proposed approach is shown in Figure 1.

URLs that were assigned the topic and the list of

The whole pipeline consists of several steps:

UserIds of the users that visited the URLs.

1. From the log file of the user visits we obtain a list

5. Topic profiles are processed by a clustering

of visited pages (URLs).

algorithm to obtain segments of the users.

2. By using background knowledge in the form of

machine learning model for classifying documents





Figure 1. Architecture of the proposed approach to audience segmentation. The user visits of the Website and combined with the visited content and properties of the users to obtain segments of the users based on the topic clusters.



As topic profiles are built around the URLs that were

subset was defined by the domain expert from the company

assigned the topics using background knowledge, we

of the Web portal and consists of several hundred of topics.

separately keep a list of UserIds for each URL (obtained



from the log file). When constructing features for each topic,

We use DMOZ classifier with custom taxonomy to classify

we combine different data source:

each Web page into a hierarchical content topic. Pages can

• content of the Web pages visited by the users,

be classified into topics on different levels of hierarchy,

• properties of the visited pages,

where lower levels give more specific classification. Upper

• properties of the users, and

levels also give context to the lower levels of classification.

• information about the visits.

If we compare the following two topics that mention



aerospace in their hierarchy:

Background knowledge that we have used in the experiments

• Science/Technology/Aerospace,

is based on a subset of a large custom taxonomy [2]. The

• Business/Aerospace_and_Defense/Aeronautical,

24

we can see that the first content topic is put into the context

BoW - Words from the



Web page

of Science and Technology, whereas the second is put into

Web pages

59929

the context of Business and Aerospace and Defense. This

User

Content Labels of the



approach gives us more information about the content topic.

interest

visited Web pages

1268





In our experiments, content of the Web pages is obtained by

To compare the influence of different feature sources on the

crawling the Web portal. User data including properties of

obtained segmentation we use a cluster dispersity measure.

the visited pages is obtained from the Vendor data.

Specifically, we measure the weighted average distance



between the examples and their centroid normalized by the

4. EVALUATION

average distance to the global mean (i.e. center of the data).



The formula is given below:

In the experimental evaluation we combine two sources of



data: URLs from the log file and the user data from the user’s

∑𝑘𝑘 𝑛𝑛𝑖𝑖 𝑛𝑛𝑖𝑖

𝑘𝑘

𝑛𝑛

𝑑𝑑(𝜇𝜇

𝑖𝑖

𝑗𝑗=1

𝑖𝑖,𝑥𝑥𝑗𝑗)

∑𝑖𝑖=1 ∑ 𝑑𝑑(𝜇𝜇

history. The two features sets used for data representation

𝐷𝐷 = 𝑖𝑖=1 ∑

1

𝑛𝑛

𝑛𝑛𝑖𝑖

=

𝑗𝑗=1

𝑖𝑖, 𝑥𝑥𝑗𝑗)

1



𝑛𝑛

𝑛𝑛

correspond to these two data sources: bag-of-words from the

∑𝑗𝑗=1 𝑑𝑑(𝜇𝜇, 𝑥𝑥

𝑛𝑛 ∑

𝑑𝑑(𝜇𝜇,𝑥𝑥

𝑗𝑗=1

𝑗𝑗)

𝑗𝑗)

Web page corresponding to the URL; the user interest in the



form of a collection of content labels of the Web pages

D represents the dispersity, d is a distance measure (in our

visited by the user over a longer period of time (see Table 1).

case cosine distance), 𝑛𝑛𝑖𝑖 and 𝜇𝜇𝑖𝑖 are the size and centroid of



the i-th cluster, 𝑥𝑥𝑗𝑗 is the j-the example and 𝜇𝜇 is the global

mean. Intuitively, examples in more compact clusters will lie

Table 1. Features used for audience segmentation.

closer to the centroid and so will contribute less to the

No. of

dispersity than more disperse clusters.

Source

Description

values





Relative Mean Centroid Distances

1

0.8

0.6

0.4

0.2

0

20

50

100

200

User interests

Page cotent

User interest + Page content





Figure 2 Experimental results comparing relative mean centroid distances for the three data representations over different number of clusters ( k ranging from 20 to 200).



The experiments on clustering topic profiles were performed

which represents a long-term interest of the user on an

on three different data representations. One that considers

aggregated level. This can be particularly attributed to the

only content of the visited pages, one that considers only user

fact that the number of different content labels is much

interest and a combination of the two feature sets. We have

smaller than the number of different words from the page

applied k-means clustering, varying the value of parameter k

content. Combining user interest (capturing history of the

(the number of clusters) form 20 to 200. As the results of the

user) and page content (of pages visited in the considered log

applied clustering method depends on the random seed for

file) gives better results than using only page content.

choosing the initial clusters, we have repeating the process



five time for each value of k. Figure 2 shows the results of

Looking at the topic clusters that we have obtained, we can

the experiments averaged over five runs.

see that the similar topics are clustered together (see Table



2). For instance, the topics such as Health/Addiction,

We can see that the smallest distance of topic clustering is

Business/Chemicals/Wholesale_and_Distribution,

obtained when the data is represented only by user interests,

Recreation/Drugs are in the same cluster.

25

Table 2. Illustrative example of some clusters obtained when generating 50 clusters using both feature sets. For a few selected clusters we show the topics that belong to the cluster.

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Business/Biotechnology_and_Pha

Recreation/Models

Home/Personal_Fin

Health/Addictions

Business/Financial_Service

rmaceuticals

ance

s/Venture_Capital/Regional

Science/Astronomy



Recreation/Drugs

Business/Transportation_an

Health/Child_Health

d_Logistics/Bus





Business/Chemicals/W

Business/Transportation_an

Health/Conditions_and_Diseases/

holesale_and_Distribut

d_Logistics/Rail

Cancer

ion





Business/Food_and_R

Government/Agencies

Health/Conditions_and_Diseases/

elated_Products/Bever

Immune_Disorders

ages

ealth/Conditions_and_Diseases/In





Science/Biology/Bioin

Recreation/Autos/Makes_a

fectious_Diseases

formatics

nd_Models/Honda





Society/Issues/Gun_C

Science/Environment

Health/Pharmacy

ontrol





Science/Environment/Carb



on_Cycle, …





To obtain audience segments from the clustering of the topic

are assigned to the same segment, due to similarity with the

profiles, we map the topic clusters onto a set of UserIds based

other users form the same segment.

on the user visits of the URLs that are classified to each of



the topic in the cluster. In this way we obtain non mutually

Larger scale experiments are needed in the future work to

exclusive audience segments. The average number of users

confirm the results and provide additional insights into the

per segment is given in Table 3. From the table we can see

other properties of the users form the same segment, such as

demographics, geography, job.

Table 3. Average size of the audience segments in



relation to the granularity of the segmentation.

6. ACKNOWLEDGMENTS



This work was partially supported by the Slovenian

No. of

Research Agency.

Average size

Median size

segments



20

43592.35

589.5

References



50

17436.94

301

[1] Adomavicius, G. and Tuzhilin, A., Toward the Next

100

8718.47

229.5

Generation of Recommender Systems: A Survey of the

State-of-the-Art and Possible Extensions, IEEE

200

4359.235

193

Transactions on on Knowledge and Data Engineering,



Vol. 17, No. 6, June 2005.

5. CONCLUSION



[2] Fortuna, B., Fortuna, C., Mladenic, D., Real-time news

We have proposed an approach to audience segmentation

recommender system. In Proceedings of Machine

based on topic profiles of the visited Web pages instead of

learning and knowledge discovery in databases :

the commonly used user profiles. The topic are obtained by

European conference, ECML PKDD 2010, Barcelona,

classifying the visited Web pages into a custom taxonomy.

Spain, September 20-24, 2010 (Lecture notes in

The classification is performed automatically using a pre-

computer science, ISSN 0302-9743, Lecture notes in

trained machine learning model. The topic profiles are

artifical intelligence, vol. 6323). Berlin; Heidelberg;

formed from properties of the users visiting the Web page

New York: Springer. 2010, vol. 6323, pp. 583-586.

that are classified into the topic, and the content of the Web

page.

[3] Grobelnik, M., Bank, J., Mladenic, D., Novak, B.,



Fortuna, B.. Using DMoz for constructing ontology

Preliminary experiments on a small sample of log-data show

from data stream. In Proceedings of the 28th

that the proposed approach is promising, grouping together

International Conference on Information Technology

similar topics and based on that segmenting the audience into

Interfaces, June 19-22, 2006, Cavtat/Dubrovnik,

reasonably populated segments. Namely, one of the issues

Croatia, (IEEE Catalog, No. 06EX1244). Zagreb:

with audience segmentation when the users have multiple

University of Zagreb, SRCE University Computing

interests is that many users that are not similar to each other

Centre. cop. 2006, pp. 439-444.

26





Building Client’s Risk Profile Based on Call Detail Records

Zala Herga1, 3, Casey Doyle2, Stephen Dipple2, Caleb Nasman2, Gyorgy Korniss2, Boleslaw Szymanski2, Janez Brank1, Jan Rupnik1, and Dunja Mladenić1, 3

1Jožef Stefan Institute, Jamova 39, Ljubljana, Slovenia

2Rensselaer Polytechnic Institute, 110 8th Street,Troy, NY 12180, USA

3Jožef Stefan Postgraduate School, Jamova 39, Ljubljana, Slovenia

{zala.herga,janez.brank,jan.rupnik, dunja.mladenic}@ijs.si

{doylec3, korniss, nasmancc1, dippls, szymab}@rpi.edu

ABSTRACT

2.

NETWORK PROPERTIES

Data collected from mobile phones can be used to uncover

As the first step in analyzing the dataset and gaining an un-

underlying social network dynamics and individual’s behav-

derstanding of how the users operate we define the structure

ioral patterns. Based on a Call Details Records dataset, we

of the network. Here, we treat the mobile data set as a so-

build a weighted, directed network and analyze it’s proper-

cial network where each node is an individual and each edge

ties. In addition to node-level network measures we extract

represents a connection between them and another individ-

an extensive consumption and mobility-based feature set.

ual. Wherever possible, we use weighted, directed edges to

We show that extracted network and consumption features

preserve the strength of the connection between individuals

can be used to model individual’s risk profile.

[4]. Weights are assigned based on the frequency of outgo-

ing communications between the source node and the target

Keywords

node. Where applicable, we use this metric to define the dis-

tance between two nodes as w

Mobile Phone Network, CDR, Supervised learning

avg /wi→j where wavg is the

average weight of all connections in the network and wi→j is

the weight of the connection between the source (i) and the

1.

INTRODUCTION

target (j) [6]. Wherever it is not feasible to use a weighted

The Call Detail Records (CDR) dataset is a relatively stan-

edge scheme, we create an unweighted graph using a cutoff

dard dataset obtained by mobile phone operators.

One

to define how many outgoing communications from one node

record in the CDR dataset corresponds to a communication

to another constitutes a connection (ie we use the frequency

event between two mobile phone users and includes time

to define whether a connection exists at all, and all connec-

stamp, type of event (call, text), direction (in- or outgoing)

tions are still directed but have equal weight) [5]. Using a

etc. This data type reveals behavioral patterns that can be

low cutoff introduces a lot of noise into the system and is

used to identify user’s personality [2], spending habits [11]

less representative of a true social network as many of the

or socioeconomic level [5]. Here, we are interested in using

edges are too weak to accurately indicate a social connection

the data to build each client’s risk profile; in particular, we

between two individuals, but choosing too high of a cutoff

attempt to use this data to predict user defaults. To this

restricts the network and discards potentially valuable data

end, we focus our analysis around whether the clients phone

connecting nodes and communities together.

number was blocked at the end of month (indicating issues

potentially related to the defaulting behavior), using this

Using these methods, our data set translates to a network

data to label clients as good or defaulted. The dataset used

with a giant component comprising 99.14% of the it. Of

is completely anonymised.

course, the number of edges and size of the giant compo-

nent decreases quickly when the unweighted cutoff scheme

Structure of the rest of the paper is as follows: Section 2

is used (Fig. 1). The size of the giant component decreases

presents characteristics of the network built from the CDR,

linearly with increases in the cutoff, while the decrease in

Section 3 describes feature extraction, Section 4 presents

the number of edges levels off as a power law with γ ≈ 0.75.

probability of default models and their evaluation and Sec-

The degree distribution also changes slightly with the cutoff

tion 5 concludes the paper.

than without; in both cases the distribution has a fat tail

that is well approximated by a power law, but the expo-

nent increases with a larger cutoff. For the general case of

the weighted edges with no cutoff, the power law tail has

an exponent of γ ≈ −4.3 while with a high cutoff such as

thirty the power law tail exhibits an exponent of γ ≈ −6

(Fig. 2), both in general agreement with prior work on mo-

bile network data [5, 7]. Similarly, the distribution of node

strengths (defined as the sum of the weights of its adjacent

edges) also exhibits a heavy tailed decay as expected [7].

27





Figure 1: (a) The size of the giant component G as it decreases linearly with higher cutoff criteria to form an edge between two nodes. (b) The total number of edges in the system decreases as a power law with γ ≈ 0.75.

Figure 2: (a) The degree distribution for various cutoff criteria to create an edge between two nodes. As the cutoff increases, the peak of the distribution shifts left until it peaks at zero. (b) The same distributions on a log-log scale to highlight the power law tail of the distributions. Shown here are the two extreme cutoffs tested in order to highlight the increase in the gamma value for higher cutoffs.

Additionally, we study the distribution of some higher level

centrality has a tightly grouped, high density of relatively

node based measures such as reciprocity[14], which is sur-

high values. This implies a very well connected graph such

prisingly low. Without a cutoff only 38.65% of all links are

that the shortest path between any two nodes is low. This

reciprocated, while higher cutoffs increase the fraction of re-

can further be seen via the harmonic centrality, which also

ciprocated links up to a maximum of only 41.57% when the

has a relatively low density of low scoring individuals even

cutoff is fifteen. We further measure the reciprocity using

with the inclusion of nodes not within the giant component.

the weighted network scheme by defining the weighted reci-

This implies that even the nodes that are not connected to

procity[12, 13] as Rij = |wij − wji|/(wij + wji). Using this

the giant component tend to form small, tightly connected

metric, the network shows an average weighted reciprocity

communities of their own.

of only .3235, further indicating the low reciprocity of the

network.

3.

FEATURE EXTRACTION

After understanding the network dynamics, our aim was to

Finally, we use node centrality to measure how the nodes po-

build individual’s behavioral patterns. For that reason we

sition themselves in within the communication paths across

extracted from the CDR three types of behavior-related fea-

the network (nodes with high centrality are most likely to

tures: individual’s consumption, social network and mobil-

connect communities and therefore are very important to

ity. Some of the features were extracted for different time

the study of how risk patterns propagate across the net-

window (e.g. per day, per week, hour of day), separately for

work). For this purpose, we utilize the closeness central-

incoming and outgoing events and/or separately for event

ity[1, 10, 3], a node level measurement that utilizes the

type (call, text). We also added another more technical cat-

shortest paths across the network to identify where nodes

egory which relates to individual’s position in the underlying

lie in the network structure.

Specifically, it is a ranking

network. Together, more than 6000 features were extracted.

of the distance from the node in question to every other

Each category is described in more detail below.

node, defined as CC (i) = (N − 1)(P

d

i6=j

ij )−1 where CC is

1. Consumption features: These features are related to

the closeness centrality and dij is the length of the shortest

individual’s usage of the mobile phone. We extracted

path between nodes i and j (assuming a path exists). Un-

for each individual the number of all calls, number of

fortunately this measure only works on connected graphs,

all texts, total duration of calls, average duration of

so to analyze the full unconnected graph, we also study

calls, average time between consecutive events.

the harmonic closeness centrality[9, 8], defined instead as

C

P

1

H (i) =

1

. As seen in Fig. 3, the closeness

n−1

j6=i d

2. Social network features: This type of feature focuses

ij

28



Our aim was to model probability of default for each client

based on extracted phone usage patterns and node-level net-

work measures. We present the results of fitting several lin-

ear regression models with varying parameters.

Features

that are described in previous sections were used for mod-

eling, and all features were normalized to standard score

(z-score). We divided our dataset into train- and test set in

70:30 ratio.

We started with a linear model (labeled as glm-6 in Figure

4) that was based only on six predictor variables: frequency

(corresponds to the node’s strength), duration (of user’s call

events; sum), degree, harmonic centrality and the two ge-

ographic (cell tower PD) variables.

We chose with these

features because we believed that they carry a lot stronger

signals in contrast to the other 6048 features. The p-value

Figure 3: Distribution of closeness and harmonic central-

is < 0.01 for all, except for duration, which has a p-value of

ity scores across the network where higher scores indicate a

0.97. Overall this implies that network measures are good

more central position in the network. The closeness central-

predictors for default behavior.

ity only considers nodes in the giant component (and the

distribution is therefore only calculated for those nodes).

4.1

Principal Component Analysis

The harmonic centrality includes all nodes in the network.

Further, when dealing with larger amount of features (6048)

principal component analysis (PCA) was performed on the

train set for feature space reduction.

PCA is a method

on the number of contacts and reciprocated events:

that decomposes the feature space into principal compo-

the number of unique contacts, the number of con-

nents (eigenvectors) and also provides information about

tacts with which individual exchanges on average at

how much variance in the data each component explains.

least 5 texts per week / 2 calls per week, the number

Selection of a subset of PCA components reflects a trade-off

of reciprocated call events, the median time between

between 1) model simplicity (we want to include a moder-

reciprocated call events, and the median time to an-

ate number of features in our models) and 2) total variance

swer text.

explained by the component subset. All features were sub-

jects to PCA, except for the 6 features that were used in

3. Mobility features: These features that are based on

glm-6 model described above. Those were added to models

used BTS tower location and include the average daily

in their original (but standardized) form.

radius of gyration, the average distance traveled per

day of week, the popular cell towers that sum up to

We ordered the obtained PCA components decreasingly by

90% of records, and the average number of unique cell

explained variance of the data. The first component explains

towers used per week.

20% of the variance, the second 7%, the first ten components

4. Node level network measures: These features all rely

together 37%, first thirty together 42%, and first five hun-

on the individuals location within the social network

dred sum up to 66%. We then created two linear models

built off of their usage statistics. The details of these

based on reduced feature subsets: first, using 30 PCA com-

metrics are discussed in Section 2, and represent each

ponents and second, using 500 components (pca-30 and pca-

nodes level of importance to the overall social network

500 in Figure 4). Because many variables in pca-500 have

as well as how deeply embedded the individual is.

large p-values, we fitted another model that didn’t include

those variables with p-value ≥ 0.5 (pval-05 ).

3.1

Geographic analysis

Geographic analysis was performed to help us with the spe-

4.2

Oversampling

cific goal of building individual’s risk profile. Analyzing ge-

Only about 0.25% of users in the underlying dataset exhib-

ographic features requires a definition of their location that

ited default behavior, which makes the dataset very unbal-

considers that most people connect to many different cell

anced. For that reason, we implemented a simple oversam-

towers over the course of a three month period. For our

pling method on train set: we multiplied defaulted users

purposes, we use each user’s top two most used cell towers.

(and their features) by 20. The model using this method,

We assign an individual to both their most used and second

oversampled-20, is also presented in Figure 4. Surprisingly,

most used towers to account for the likelihood that a user

the oversampled dataset not only does not improve perfor-

will spend large amounts of time both at their residence and

mance, but can be seen to provide slightly worse results than

their workplace. From there we analyze the number of peo-

the originally unbalanced dataset.

ple that exhibit default behavior for each tower or district

and identify high risk geographic regions. Based on that

4.3

Evaluation

analysis we calculated empirical probability of default for

Model comparison is presented in Table 1. We can see that

each cell tower. These probabilities were used as two addi-

at 95%, the level recall is high, up to 0.91 for both mod-

tional features, one for each of the two most commonly used

els based on 500 PCA components. Precision is low for all

towers of each user.

models due to the unbalanced dataset, but even with that

drawback, our models still perform far better than random

4.

MODELING

models.

29

Model

random

glm-6

pca-30

pca-500

pval-05

oversampled-60

Recall

0.05

0.13

0.79

0.90

0.91

0.86

Precision

0.003

0.007

0.042

0.049

0.049

0.046

Table 1: Recall and precision at 95% level for each of the models presented in Figure 4.

False positives vs. False negatives

operative Agreement Number W911NF-09-2-0053 (the Net-

work Science CTA), by the Office of Naval Research (ONR)

random

glm-6

Grant No. N00014-15-1-2640, and by NSF Grant No. DMR-

pca-30

1560266 under the REU Program. The views and conclu-

pca-500

pval-05

sions contained in this document are those of the authors

oversampled-20

and should not be interpreted as representing the official

policies either expressed or implied of the Army Research

Laboratory or the U.S. Government.

7.

REFERENCES

[1] A. Bavelas. Communication Patterns in TaskOriented

Groups. J. of the Acoustical Society of America,

22(6):725–730, nov 1950.

[2] Y.-A. de Montjoye, J. Quoidbach, F. Robic, and

False negatives (defaulted clients labeled as good)

A. Pentland. Predicting personality using novel mobile

False positives (good clients labeled as defaulted)

phone-based metrics. In SBP, pages 48–55, 2013.

Figure 4: This graph presents prediction results on test set

[3] L. C. Freeman. Centrality in social networks

of the fitted models. y-axis corresponds to the false neg-

conceptual clarification. Social Networks,

atives (clients, that we’re labeled as good but really de-

1(3):215–239, jan 1978.

faulted), while x-axis corresponds to false positives.

Re-

sults are shown for probability thresholds 0 − 1 with step

[4] M. S. Granovetter. The Strength of Weak Ties.

0.01. pca-500 (purple) is to great extent covered by pval-05

American J. of Sociology, 78(6):1360–1380, May 1973.

(green) since both models provide very similar results.

[5] S. Luo, F. Morone, C. Sarraute, M. Travizano, and

H. A. Makse. Inferring personal economic status from

social network location. Nature communications,

5.

CONCLUSION

8:15227, may 2017.

This paper presents an analysis of a mobile phone data us-

[6] M. E. J. Newman. Scientific collaboration networks.

ing a social network representation and various prediction

II. Shortest paths, weighted networks, and centrality.

models to understand default patterns. The analysis on the

Physical Review E, 64(1):016132, Jun 2001.

underlying network reveals a large giant component such

that most nodes have at least some path to any other node

[7] J.-P. Onnela, J. Saramäki, J. Hyvönen, G. Szabó,

in the network. Further, both the nodes within and without

D. Lazer, K. Kaski, J. Kertész, and A.-L. Barabási.

the giant component exhibit relatively high centrality scores;

Structure and tie strengths in mobile communication

meaning that nodes are form tightly connected communities

networks. PNAS, 104(18):7332–6, May 2007.

such that the path between nodes is generally quite short.

[8] T. Opsahl, F. Agneessens, and J. Skvoretz. Node

Further, many nodes have a high degree and the degree dis-

centrality in weighted networks: Generalizing degree

tribution exhibits a heavy power law-like tail. Using many

and shortest paths. Social Networks, 32(3):245–251,

of these properties as features, we were able to make even

jul 2010.

more accurate predictive models of default.

[9] Y. Rochat. Closeness centrality extended to

Our model evaluation shows that there are many variables

unconnected graphs: The harmonic centrality index.

that carry weak signals about user behavioral patterns that

In ASNA, number EPFL-CONF-200525, 2009.

have a strong predictive power when aggregated together.

[10] G. Sabidussi. The centrality index of a graph.

The unbalanced nature of the dataset makes the fitted mod-

Psychometrika, 31(4):581–603, dec 1966.

els have a high recall but low precision, yet they strongly

outperform the random model in both measures.

[11] V. K. Singh, L. Freeman, B. Lepri, and A. S. Pentland.

Predicting spending behavior using socio-mobile

There is still a lot of space for improvement in the model-

features. In SocialCom, pages 174–179. IEEE, 2013.

ing including testing more complex oversampling methods,

fitting additional models (SVM, LASSO, ANN), and includ-

[12] T. Squartini, F. Picciolo, F. Ruzzenenti, and

ing additional node-level network measures and community

D. Garlaschelli. Reciprocity of weighted networks.

detection analysis.

Scientific Reports, 3(1):2729, dec 2013.

[13] C. Wang, A. Strathman, O. Lizardo, D. Hachen,

Z. Toroczkai, and N. V. Chawla. Weighted reciprocity

6.

ACKNOWLEDGEMENTS

in human communication networks. aug 2011.

This work was supported by RENOIR EU H2020 project un-

der Marie Skodowska-Curie Grant Agreement No. 691152 as

[14] S. Wasserman and K. Faust. Social network analysis :

well as in part by the Army Research Laboratory under Co-

methods and applications. Camb. Univ. Press, 1994.

30





Connecting Professional Skill Demand with Supply

Erik Novak

Inna Novalija

Jožef Stefan Institute

Jožef Stefan Institute

Jožef Stefan International Postgraduate School

Jamova cesta 39

Jamova cesta 39

1000 Ljubljana

1000 Ljubljana

inna.novalija@ijs.si

erik.novak@ijs.si

ABSTRACT

Our contributions are a) creating a sizable data set of data

Todays job market demand from the job seekers to conti-

science related job postings containing the job postings title,

nuously learn new skills. When applying for a job position

description, locations and other information, and b) develo-

one must have the required skill set.

If the applicant is

ping a dashboard which for a given query shows relevant

missing a skill it can be learned by attending a course. Fin-

job postings as well as courses and lectures which give the

ding the appropriate courses can be tedious but necessary

appropriate skills. The dashboard is daily updated with new

work to be up-to-date with the job market demand. In this

job postings showing the most recent changes. Basic stati-

paper, we present a dashboard which connects the job mar-

stics such as the most popular job locations and skills are

ket skill demand with the courses that give the required skill

also shown.

knowledge. We developed a pipeline for continuous crawling

of job postings and courses which feeds the dashboard with

The remainder of the paper is structured as follows. In sec-

the appropriate data. The dashboard allows searching by

tion 2 we present related work. Next, data acquisition is

keywords and returns relevant job postings, courses and ba-

explained in section 3 followed by the presentation of the

sic statistics relevant to the given search query.

dashboard in section 4. Finally, we discuss and conclude

our work in section 5.

General Terms

Job Market, Skill Set, Courses, Lectures, Design

2.

RELATED WORK

There are multiple blogs that write about top skills needed

Keywords

for getting a job in data science. One such blog is [14] which

Information Retrieval, Data Mining, Analysis, Wikifier, Vi-

lists both non-technical and technical skills a data scientist

deoLectures.NET

should have in the coming years. Another blog [15] lists

the top data science skills and courses where they can be

1.

INTRODUCTION

learned. A lot of these blogs are not up-to-date and not

In todays job market the required skills are constantly evol-

reflecting the current state.

ving. This can be seen in more technical fields such as web

development and data science where new tools and libraries

A research report [11] writes about connecting supply and

are developed and available to the public with an increasing

demand in Canada’s youth labor market. They were intere-

rate. This is visible in both research and industry sectors

sted in finding what skills young adults acquired during their

where a job position might require a previously unseen skill

education, how employers demand is conveyed to students

and the applicant needs to learn it to be qualified.

Fin-

and those who support them and how well are the acquired

ding the courses that would give the skill knowledge can be

skills utilized on the job. They presented their results but

tedious and does not guarantee its sufficiency.

did not develop an application that would help to narrow

the gap between the skill demand and supply.

To this end, we developed a dashboard which would connect

the job market skill demand with the courses that give the

Another report [13] talks about the mismatch of the skills

required skill knowledge. We focused on job positions that

young adults get during their education and the skills the

require data science skills and courses that are provided by

companies demand. They found that skills are a critical as-

acknowledged course providers.

set for individuals, businesses and societies and that many

employers report difficulties in finding suitably skilled wor-

kers. Additionally, they find that a sizable qualification mi-

smatch is one of the biggest problems.

The company Year Up [9] helps young adults get the appro-

priate skills and the needed work experience. They identify

motivated individuals and companies that are prepared to

help, send the individuals to learn new skills and afterwards

apply the newfound skills at the companies, getting the cri-

tical work experience for their career.

The work is done

31





manually which can be expensive and time consuming.

Data Set Statistics. The job postings data set contains

almost 3.3M job postings acquired in the period of 18 mon-

ths. Job postings were located for 144 different countries,

3.

DATA ACQUISITION

the majority of them from Europe. Figure 1 shows the top

fifteen countries with most found job postings. The UK do-

Open job positions can be found using job search services.

minates other countries with 906k job postings, followed by

These services aggregate job postings by location, sector,

France with almost 539k.

applicant qualifications and skill set or type. One such ser-

vice is Adzuna [6], a search engine for job ads which mostly

100

0

covers English speaking countries. Another service is Tro-

vit [7], a leading search engine for classified ads in Europe

1000

80

x

and Latin America. The service is available in 13 different

60

languages and provides listings of jobs as well as cars, real

estate and other products.

40

20

When applying for a job position the applicant requires to

have a certain skill set. If the requirements are not fulfil-

0

s

y

led, he can enroll in courses to get the missing skills. Ad-

ce

d

d

a

a

nd

al

in

um

ria

si

ditionally, watching certain lectures can give a deeper un-

dom

an

an

Ital

many

and

tug

Spa

lan

gi

eden

mani

derstanding of a particular problem which can increase the

Rus

ing

Fr

erl

Irela

Pol

el

Aust

Por

itzer

B

Sw

Ro

probability of getting accepted for a job position. Video-

Ger

Lectures.NET [8] is an award-winning free and open access

ted K

Neth

Sw

educational video lectures repository. It contains videos of

niU

individual lectures as well as lectures given at renown con-

ferences.

Figure 1: Top fifteen countries with most found job

postings. The greatest number of job postings were

Crawling. Since we needed a continuous flow of data, we

found for UK, followed by France and Germany.

developed a pipeline for acquiring job postings, courses and

lectures. This will allow us to provide the dashboard, presen-

ted in section 4, with the most recent data. For job postings

There were 650 unique Data Science skills extracted from

we targeted the portals like Adzuna with an emphasis on po-

the data set. These include soft skills, such as leadership

sitions in Data Science and for courses we targeted different

and management, knowledge of a particular domain, such

course providers, including Coursera [2], providing courses

as machine learning and artificial intelligence, and program-

from top universities, and Hackr.io [4], a service which finds

ming languages. Figure 2 show the most demanded skills in

the best online programming courses & tutorials. We also

the data set.

targeted VideoLectures.NET to acquire video lectures con-

30

taining the Data Science tag. The tags are given manually

0

by the VideoLectures team.

25

1000x

20

For data acquisition and enrichment, we collected data ei-

15

ther using dedicated APIs, including Adzuna API [1] as well

10

as custom web crawlers. The data was formatted to JSON

5

to aid further processing and enrichment.

0

e

s

t

+

Enriching. The next step of data preprocessing is wikifica-

nt

ase

ics

sql

ign

si

ng

ip

tics

php

tion - identifying and linking textual components to the cor-

b

enc

aly

me

alysis

c+

linux

java

ta

atist

des

arni

aly

responding Wikipedia pages [16]. This is done using Wikifier

an

da

st

le

javascr

age

an

[10] which also supports cross and multi-linguality enabling

er sci

ne

an

ta an

extraction and annotation of relevant information from job

da

achi

t m

postings, courses and video lectures in different languages.

mputco

m

Wikification will allow us to search for job postings, courses

ojec

and lectures in multiple languages.

pr

Next, we use the Skill and Recruitment Ontology (SARO)

Figure 2: Top fifteen most demanded skills. They

[17] to extract Data Science skills from job postings. For

are mostly comprised of high-level skills, such as “da-

each job posting we match the Wikipedia concepts with the

tabase” and “computer science”, and programming

skills found in SARO ontology and declare the matched con-

languages.

cepts as Data Science skills. These skills are then added to

the job posting profile.

The course data set contains over 63k course information

including their title, description and course providers. The

Finally, to allow searching by locations and countries the job

data set is comprised of over 8k courses available online and

postings were further enriched by using GeoNames ontology

55k offline courses. Figure 3 shows the distribution of online

[5] to include the latitude and longitude and the correspon-

courses by course providers. The most courses were acquired

ding GeoNames ID and the location name.

from Coursera with above 4k, followed by Hackr.io at 2k.

32





5000

processing large-scale real-time streams containing structu-

red and unstructured data.

4000

Components.

The dashboard is composed of different

3000

components. The largest component is a list of job postings.

Each job posting is presented by its extracted information,

2000

including the Data Science skills extracted from the title and

1000

description. Figure 4 shows an example of a job posting in

the list. Since Wikifier supports cross and multi-linguality

0

the list consist of job postings written in different languages.

ra

kr

s

g

y

r

e

ity

in

d

te

rs

ac

learn

site

ac

rsity

tu

u

h

e

yn

re

du

ive

2s

o

co

rs

p

tu

u

learn

en

fu

co

en

p

p

o

o

Figure 3: The distribution of online courses by co-

urse providers.

The most courses were acquired

from Coursera, followed by Hackr.io.

Figure 4: Example of a job posting returned by the

query “machine learning”. Even though the job po-

Finally, we acquired a data set of over 20k lectures published

sting is written in Spanish the methodology finds it

on VideoLectures.NET. It contains information about the

relevant.

lectures available on the video repository including title and

description and link to the lecture.

If the user does not have the required skill set it can be

acquired by enrolling into courses shown in the course list.

The list shows courses offered by different online course pro-

4.

DASHBOARD

viders that are relevant to the users input query. Figure 5

Our objective is to automatically connect Data Science skill

shows the component containing the course list. Left and ri-

demand with the provided courses. To this end, we deve-

ght arrows are used to navigate through the list where each

loped a dashboard [3] which enables its users to search for

course is presented by its name and a course provider.

their desired job position, find out what is the required skill

set and which are the appropriate learning materials and

courses to acquire the missing skills. Additionally, the da-

shboard shows the most demanded skills and hiring location

for the given results. In this section, we present the content

retrieval methodology and describe the different components

of the dashboard.

Methodology. Here we present the methodology used for

retrieving the demand and supply content. The content is

retrieved by inserting a query text in the search input. The

user may add additional query conditions by selecting the

Data Science skills, locations, countries and a time interval

in which the job postings were published. Upon submit-

ting, the query is used to fetch the content that matches

the conditions. While all query values are used for retrie-

ving job postings, only the input text and skills are used for

retrieving the courses and video lectures content. Since co-

urses and video lectures are available online the location and

Figure 5: A sample of recommended courses for the

time interval are irrelevant for retrieving the supply content.

query “machine learning”. Clicking on a course re-

To retrieve the content we first need to set an appropriate

directs the user to the course provider where he can

index. The job posting data set is indexed by Wikipedia

enroll.

concepts, Data Science skills, locations, countries and publi-

shed date while the course and lecture data sets are indexed

Additionally, the user can watch lectures to get a deeper un-

only by Wikipedia concepts. The query text is sent through

derstanding of a problem. Similar to courses the video lectu-

wikification to acquire Wikipedia concepts which are used

res list show relevant content found on VideoLectures.NET.

for retrieving the relevant content. Next, additional query

Clicking the lecture redirects the user to the video lecture

conditions are used to filter out the content. The remaining

homepage.

content is used to calculate the most demanded skills and

hiring locations. Finally, the query results are returned and

The dashboard also shows them most demanded skills and

used to update the dashboard components. This process is

job posting timeline. The timeline shows how did the ratio

developed using QMiner [12], a data analytics platform for

between queried and all job postings change since the start

33





of the year 2016. Additionally, this shows a trend of the

and RENOIR EU H2020 project under Marie Sk lodowska-

skill demand in the queried job posting subset. Figure 6

Curie Grant Agreement No. 691152.

shows the visualizations used to show the skill demand and

timeline.

7.

REFERENCES

[1] Adzuna api. https://developer.adzuna.com/.

Accessed: 2016-09-07.

[2] Coursera | online courses from top universities. join for

free. https://www.coursera.org/. Accessed:

2017-08-22.

[3] European data science academy dashboard.

http://jobs.videolectures.net/. Accessed:

2017-08-23.

[4] Find the best online programming courses & tutorials

Figure 6: On the left the ten most demanded skills

- hackr.io. https://hackr.io/. Accessed: 2017-08-29.

histogram, and on the right the number of job positi-

[5] Geonames. http://www.geonames.org/. Accessed:

ons timeline, for the query “machine learning”. Ho-

2017-08-23.

vering over the histogram column shows the number

[6] Job search - find every job, everywhere with adzuna.

of queried jobs demanding the skill.

https://www.adzuna.com/. Accessed: 2017-08-23.

[7] Trovit - a search engine for classified ads of real estate,

Finally, a world map shows the most popular hiring locations

jobs and cars. https://www.trovit.com/. Accessed:

extracted from the queried job postings. The locations are

2017-08-23.

at first clustered where upon zooming the clusters divide

[8] Videolectures.net - videolectures.net.

and the individual locations are shown. Figure 7 show an

http://videolectures.net/. Accessed: 2017-08-22.

example of clustered locations.

[9] Year up - closing the opportunity divide.

http://www.yearup.org/. Accessed: 2017-08-22.

[10] J. Brank. Wikifier. http://wikifier.org/. Accessed:

2017-08-23.

[11] R. Brisbois, L. Orton, and R. Saunders. Connecting

Supply and Demand in Canada’s Youth Labour

Market. 2008.

[12] B. Fortuna, J. Rupnik, J. Brank, C. Fortuna,

V. Jovanoski, M. Karlovcec, B. Kazic, K. Kenda,

G. Leban, A. Muhic, et al.

qminer: Data analytics

platform for processing streams of structured and

unstructured data , software engineering for machine

learning workshop. In Neural Information Processing

Systems, 2014.

[13] W. E. Forum. Matching skills and labour market

needs: Building social partnerships for better skills

and better jobs. http://www3.weforum.org/docs/

GAC/2014/WEF_GAC_Employment_

MatchingSkillsLabourMarket_Report_2014.pdf,

Figure 7: Top hundred hiring locations for the query

2014.

“machine learning”. The clusters show the number

[14] M. Mayo. KDnuggets analytics big data data mining

of locations it contains.

and data science. www.kdnuggets.com/2016/05/

10-must-have-skills-data-scientist.html.

5.

CONCLUSION AND FUTURE WORK

Accessed: 2017-08-22.

[15] E. Mcnulty. Top 10 data science skills, and how to

In this paper we present the methodology for automatically

learn them.

connecting skill demand and supply. We acquired a sizable

http://dataconomy.com/2014/12/

job posting and course data set, developed a methodology

top-10-data-science-skills-and-how-to-learn-them/.

Accessed: 2017-08-22.

for retrieving job postings, courses and lectures relevant to

the user query and created a dashboard for showing the re-

[16] L. Ratinov, D. Roth, D. Downey, and M. Anderson.

trieved content.

Local and global algorithms for disambiguation to

wikipedia. In Proceedings of the 49th Annual Meeting

In the future we wish to improve the data enriching process

of the Association for Computational Linguistics:

by handling skills that are not in the SARO ontology and

Human Language Technologies-Volume 1, pages

add new features and improvements to the dashboard.

1375–1384. Association for Computational Linguistics,

2011.

[17] E. Sibarani, S. Scerri, N. Mousavi, and S. Auer.

6.

ACKNOWLEDGMENTS

Ontology-based skills demand and trend analysis, July

This work was supported by the Slovenian Research Agency,

2016.

EDSA EU H2020 project (Contract no: H2020-ICT-643937)

34





Analyzing raw log files to find execution anomalies

A novel approach to analyzing text-based log files to find changes in the

execution profile of complex IT systems

Viktor Jovanoski

Jan Rupnik

Carvic d.o.o.

Jozef Stefan Institute

Kotnikova 5

Jamova 39

Ljubljana, Slovenia

Ljubljana, Slovenia

viktor@carvic.si

jan.rupnik@ijs.si

Mario Karlovčec

Blaž Fortuna

Jozef Stefan Institute

Jozef Stefan Institute

Jamova 39

Jamova 39

Ljubljana, Slovenia

Ljubljana, Slovenia

mario.karlovcec@ijs.si

blaz.fortuna@ijs.si

ABSTRACT

by inspecting CPU load, network load or database communi-

Anomaly detection (a.k.a. outlier detection) is the identifi-

cation patterns. Quality assurance of input and output data

cation of events that do not conform to an expected pattern

can also be performed, such as the number and the size of

in a dataset. When applied to monitoring modern, complex

the data records. Spotting unusual behavior of such sys-

IT systems, it keeps track of a plethora of incoming data

tems is crucial and unhandled execution problems can lead

streams.

This paper provides an approach that uses the

to catastrophic results. The anomalies can be very different

lowest and most unstructured source of data related to an

in nature, from abrupt changes that occur within a second

IT system - the raw system log files. Several versions and

to the gradual decay of performance that is only observable

parametrizations of basic building blocks will be presented

on a weekly or monthly scale.

to show how different types of anomalies can be extracted

from the data. Several experiments on synthetic as well as

Most of the research about anomaly detection has been con-

real-world data show effectiveness of the algorithm. Spe-

centrated on outlier detection in numerical time-series (e.g.

cial care is taken to keep the model and the resulting alerts

financial time series like in [5], time series arising from in-

interpret-able - since detecting an error without a meaning-

frastructure monitoring) and discrete-event sequences (e.g.

ful explanation about its details is of limited use to end user

fraud detection like in [4], cyber-security applications like

(the results need to be actionable).

in [3]). Data representation (the way of encoding the rele-

vant information) is vital to the practical performance of an

Keywords

anomaly detection approach (does it capture relevant events

Anomaly detection, Outlier detection, Infrastructure moni-

or just insignificant variation).

toring

However, all of these numeric data series have to be “pre-

1.

INTRODUCTION

pared” in advance. A developer had to think ahead about

the potential problems that can arise and expose appropri-

Modern IT systems are getting increasingly complex and

ate measurements. When this data is available, it can clearly

distributed. On-line monitoring of their health is becoming

signal the problem to the operators. But what happens when

critical for their normal operation. The data that is being

a new type of outage occurs and no measure to detect it was

collected about these systems comes in diverse time series

put in place? In such case the infrastructure maintainers

(numerical, categorical, text), potentially huge in volume

typically resort to inspection of raw log files. Writing a line

and arriving with different latencies. For instance, complex

of text to console or file output is often the only indica-

systems often expose metrics about their performance such

tion that something happened or has not happened when it

as number of served requests or size of internal data struc-

should have. In our experience, all complex IT systems pro-

tures. One can also monitor systems performance indirectly

duce such files and they are still used to solve the hardest

problems and errors.

The log files may be very unstructured in practice and may

contain extremely diverse information. From errors, warn-

ings, database calls and initialization steps, to casual coun-

ters and observations. The log lines themselves may be un-

structured: their content might be unordered and they may

contain text written in natural language (e.g. error mes-

sages). Even then, these bits of information may or may not

35





be written in a easy-to-parse form (e.g. JSON format). The

Algorithm 1 General anomaly pipeline

only thing that can be expected of each line is a timestamp -

while input data available do

a clear indication of time on the server, when this particular

parse data and extract record

line was written to the log. Even if this data is missing from

calculate anomaly score

some lines, the sequential order of writing to file helps us

if score above threshold then

narrow down the possible timestamp for each line. We can

report anomaly

simply choose to re-use the last timestamp before that line.

end if

add record to model

Many algorithms and approaches to anomaly detection have

end while

been proposed in the literature. [1] provides an excellent

overview of the field. New approaches are getting increas-

ingly more sophisticated at dealing with multidimensional

numeric data, discrete, sequential or even spatial data. The

unstructured nature of raw log files makes the problem amend-

able to text-mining based approaches. This article presents

work in that direction.

2.

ANOMALY-DETECTION

End-users in charge of maintaining a large IT system will

typically be concerned with two types of anomalies. Either

they will want to avoid a sudden, critical degradation of

performance, or they will want to know if the performance

has been slowly degrading and attempt to prevent that.

Figure 1: Text-processing anomaly pipeline

Abrupt change - In this scenario, the system ”falls of the

cliff” - performance plummets and multiple parts of the sys-

tem typically experience severe problems. Users will most

The score value is then used to decide if the incoming data

often be notified of the problem via many channels. Hence,

record is an anomaly - either by comparing against some

the raw log files are not the first place they will start looking

manually predefined static threshold or against a dynamic

at. However, if the cause of the problem cannot be deter-

one, which uses historical score values to determine the thresh-

mined or the exact timeline of events of the disaster cannot

old autonomously. Finally, in case of an anomaly, an alert

be established, the information from the log files will also be

is created that contains enough data to explain what was

used as a part of the forensic analysis.

observed and why it was tagged as an anomaly.

Gradual deterioration - IT systems that have been in

Only when all of the above steps are done, we add the new

production environment for a long time can experience grad-

data point to the model. We can store it internally in the

ual changes. These changes do not necessarily cause catas-

model for some time, but we have no guarantee to be able to

trophic failure overnight, but degrade the performance over

ever again access all the historical data (e.g. for retraining

a longer period of time. Such changes are very difficult to

of the algorithm).

detect by the programmer as they are very subtle, e.g. they

are only observable on the monthly scale. And often the

only place where this can be detected is with a long term

2.2

Defining normality

analysis based on the raw log files.

It is crucial for any anomaly-detection system to be able

to tell what normal is, so that based on this notion it can

2.1

General pipeline

say that something is not normal, i.e. anomalous. When

available, domain knowledge and experience from human

Algorithm 1 shows the general structure of a typical anomaly-

experts can be used to capture the appropriate aspect of

detection system that operates on a stream.

normality in the data, which greatly improves the quality of

the reported anomalies.

First, a data record is extracted from the incoming data. As

stated before, the timestamp is always defined. The record

In general, however, the systems need the ability to au-

can also incorporate recent past data (previous lines of log

tonomously define normality. This ability greatly depends

file) and contextual data (e.g. server overall status, holidays

on the scoring function and it is common practice to define

indicator).

it in a way that only the values on the border of observed

values are indicators of an anomaly.

In the second step a score is calculated that should reflect a

record’s anomalousness (non-normality, novelty). The sim-

pler the score, the easier the subsequent steps are. Addi-

2.3

Detecting anomalies on raw log files

tionally, we prefer models and scoring functions that can

We will now present our approach to anomaly detection

easily be explained to the end user, since he is supposed

based on text processing techniques. The steps of the ap-

to act on them. Lastly, the score should be constructed in

proach are given in Algorithm 2 and graphically presented

such manner that the anomalous examples fall on one edge

in Figure 1.

of the spectrum (e.g. the higher the score, the bigger the

anomaly).

The log parser processes a single line at the time and each

36

Algorithm 2 Log-file anomaly pipeline

Table 1: Synthetic Data - window = 1 min

while input data available do

NN rate

Precision

Recall

read data from log file

0.003

0.06

1.00

parse data and extract BOW

insert into time windows

0.001

0.15

0.66

calculate distance in kNN

0.0005

0.23

0.66

if score quantile above threshold then

create explanation using distance

report anomaly

Table 2: Synthetic Data - window = 1 h

end if

NN rate

Precision

Recall

add record to model

0.05

0.07

0.66

end while

0.03

0.14

0.66

0.01

0.33

0.33

time emits a record with several features. In the second step

we extract standard BOW (bag-of-words) feature vectors.

and use these instead of whole texts.

There are several possible ways of how to extract features

and how to weight each feature dimension.

We chose to

2.3.2

Anomaly explanation

use a simple representation where we extract tags, such as

We construct the explanation for an individual alert by find-

server=x and process=mytask.tsk. We collect all these tags

ing its nearest neighbor and subtracting one vector from

from the record and assign them weight 1. This means that

the other and squaring each element of the resulting vector.

individual lines can produce vectors of varying lengths, as

Each dimension is thus attributed with an ”anomalousness”

they are not normalized. We could also normalize them or

score, and the highest scoring dimensions contributed the

re-weight them using the TFIDF weighting scheme ([6]) to

most to the distance to the nearest neighbor. The expla-

down-weight frequent tags.

nation to the user contains a sorted list of highest scoring

dimensions (clipped to avoid information overload).

To capture the wider context of each record, we aggregate

a set of records within a time-window of predefined length

3.

EXPERIMENTAL RESULTS

and emit a combined record (simple normalized sum of all

vectors) that represents that window. In the fourth step, we

3.1

Evaluation

use the k-nearest-neighbor algorithm (kNN) to find which

In most real-world anomaly-detection cases we receive a

k windows from the past are the closest to current window.

dataset that is largely unlabelled. The labels we have usually

This unsupervised algorithm was chosen because it can han-

denote some special catastrophic situations that users expe-

dle skewed distributions very well.

rienced and want to avoid in the future. The rest of the data

can be assumed to be mostly ”normal”, but unknown anoma-

The average distance of the k neighbors is used as an anomaly

lies may also remain in the dataset. The standard metrics

score: the further away an instance is to its nearest neighbor

to evaluate the performance of an anomaly detector are pre-

set, the more anomalous it is. The value of the parameter k

cision (#true anomalies / #predicted anomalies) and recall

is usually small, between 1 and 5. The best value depends

(#predicted true anomalies / #true anomalies). Low preci-

on the data domain and should be determined by experi-

sion translates to a higher burden on the user that inspects

menting.

the anomalies (each inspection has some cost) and low recall

translates to missing many anomalies and increasing risk. If

The absolute scale of the score may vary from problem to

we suspect that the dataset is not labelled completely (unla-

problem and also depends on the feature representation. For

belled anomalies present) the precision might be estimated

that reason we used a quantile based approach: we compare

pessimistically and recall might be measured optimistically.

the anomaly score of a new instance with scores of recently

In such cases manual inspection of false positives might lead

observed data points (the size of the of recently observed

to discovering new relevant types of anomalies present in the

data point set is controlled with a parameter learning win-

dataset.

dow ). If the score is higher than a large (example 0.999)

fraction of scores (controlled by a parameter nn rate), then

3.2

Synthetic data

we classify the instance as an anomaly. The quantile (1 -

We generated log files by simulating parallel execution of

nn rate) directly controls the detection rate (0.001 corre-

several processes, each having a specific pattern of writing

sponds to classifying 0.1% of instances as anomalies) under

to log. We then manually inject 8 instances of anomalous

the assumption that the data distribution is stationary.

entries, 2 per week, each pair occurring within one minute.

2.3.1

Tokenization

We use 10 days for the kNN learning window length, so there

When tokenizing input tests we have several options. If we

will be no anomalies in the first 10 days. For the length of the

know that some common patterns exist on how the special

input-grouping window, we experimented with two settings:

entities are marked inside the text, we can extract them and

1 minute and 1 hour. In the former setting we have 6 original

create useful features for BOW step. For instance we can

anomalies, but in the later, we only have 3 since each pair of

use message text directly to create BOW. Alternatively, we

adjacent anomalies (they are 1 minute apart) gets collapsed

can just find identifiers of the origin of the message (e.g.

into the same hour. We set the parameter k to 1 - thus

process name, method name, class name, page name etc.)

making the explanation of the anomaly very simple.

37

signals (which anomalies have a high impact on system?), in-

Table 3: Web logs - an anomaly explanation example

dicate possible root causes (what might have caused a par-

File

Val

Near

Contr

/shuttle/missions/sts-68/mission-sts-68.html

0.707

0

0.354

ticular anomaly?) and give predictions (what may follow

/images/NASA-logosmall.gif

0

0.505

0.180

after detecting a particular type of anomaly?).

/htbin/cdt main.pl

0

0.416

0.122

The popularity of deep learning techniques ([7]) is also felt

Table 1 shows the results for windows of length 1 minute.

in the anomaly-detection field and we plan to study their

When parameter nn rate, which controls the sensitivity to

application to multivariate analysis.

Most often, autoen-

outliers, was set to 0.03, it correctly detected all 6 manually

coders are used to create compact descriptions of the data

inserted anomalies. Table 2 shows the results for windows

and may also be used to highlight the dimensions with high

of length 1 hour. This granularity of data is too coarse and

reconstruction error. Another interesting approach is to use

hides certain anomalies that remain undetected.

generative adversarial networks ([2]), where two neural net-

works are contesting in a zero-sum game: one network gen-

3.3

Web-server logs

erates “normal” candidates and the other one discriminates

We analyzed browsing pattern logs from a production web

between generated examples and true examples. We cur-

server where the logs contained information on web-page

rently see two major challenges for broader use of deep neu-

and file requests. The specific feature of this web site is

ral networks in anomaly detection systems. The first one is

that it is almost completely static - there are almost no new

the ability to explain the results to the end-user in an ac-

pages being added to it, so the browsing patterns should

tionable way. The second one is processing of stream data

be relatively constant.

The dataset spanned a period of

and updating the model on-the-fly. Currently, existing and

one month. We set the summarization window length to 5

simpler techniques (e.g. kNN, clustering, statistical tests)

minutes and the kNN comparison window length to 10 days.

provide much better support for these requirements. How-

The parameter k was again set to 1 and the anomaly rate

ever, the linearity in describing the feature space is an issue

was set to 0.001. The parameters were hand-tuned.

that we would like to address in the future and deep neural

networks present a promising approach.

When analyzing such data the anomalies that we are in-

terested should capture both system failure (malfunctioning

5.

REFERENCES

software, infrastructure failure) as well as malevolent behav-

[1] C. C. Aggarwal. Outlier Analysis. Springer New York,

ior (denial-of-service attack (DoS), a hacker-induced scan-

New York, New York, 2013.

ning for exploits). In these cases the system should ideally

[2] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,

produce anomalies with strong dimensional outliers.

D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.

Generative adversarial nets. In Z. Ghahramani,

We manually inspected anomalies where the strongest di-

M. Welling, C. Cortes, N. D. Lawrence, and K. Q.

mension contributed more than 30% of the total anomaly

Weinberger, editors, Advances in Neural Information

score. An example is shown in Table 3, where columns V al

Processing Systems 27, pages 2672–2680. Curran

means value in current record, N ear means value in the

Associates, Inc., 2014.

nearest record and Contr means contribution to the total

[3] A. Lakhina, M. Crovella, and C. Diot. Diagnosing

distance. It turns out that these anomalies corresponded

network-wide traffic anomalies. In Proceedings of the

to the rare events when new content was added to the web

2004 Conference on Applications, Technologies,

page. No DoS or hacker attacks were detected in the ob-

Architectures, and Protocols for Computer

served time period.

So the vector dimension for the file

Communications, SIGCOMM ’04, pages 219–230, New

(mission − sts − 68.htm) was V al = 0.707. This file was not

York, NY, USA, 2004. ACM.

present in the nearest record (N ear = 0) and this dimension

[4] X. Liu, P. Zhang, and D. Zeng. Sequence matching for

contributed 0.354 of the total distance between this record

suspicious activity detection in anti-money laundering.

and the nearest one.

In C. C. Yang, H. Chen, M. Chau, K. Chang, S.-D.

Lang, P. S. Chen, R. Hsieh, D. Zeng, F.-Y. Wang,

4.

CONCLUSIONS AND FUTURE WORK

K. M. Carley, W. Mao, and J. Zhan, editors, ISI

We presented a novel combination of known approaches to

Workshops, volume 5075 of Lecture Notes in Computer

anomaly detection using techniques developed in the field of

Science, pages 50–61. Springer, 2008.

text mining. Using these we were able to extract valuable

[5] C. Phua, V. C. S. Lee, K. Smith-Miles, and R. W.

information from raw textual log files that are normally only

Gayler. A comprehensive survey of data mining-based

used for manual inspection and forensic analysis.

fraud detection research. CoRR, abs/1009.6119, 2010.

[6] G. Salton, A. Wong, and C. S. Yang. A vector space

Our algorithm is based on processing log files, but many

model for automatic indexing. Commun. ACM,

other sources of information can be used to extract anoma-

18(11):613–620, Nov. 1975.

lies in modern IT systems. Our long-term goal is to de-

[7] S. Zhai, Y. Cheng, W. Lu, and Z. Zhang. Deep

sign what we call Full-spectrum anomaly detection system

Structured Energy Based Models for Anomaly

(FSADS) that will be able to import many different types of

Detection. ArXiv e-prints, May 2016.

data streams, covering a wide range of aspects of an IT sys-

tem (inputs, outputs, internal performance, database per-

formance, network communications etc.). After each single

source of data is analyzed and anomalies are extracted, the

next step in FSADS will correlate them, determine critical

38





Usage of SVM for a Triggering Mechanism for Higgs

Boson Detection

Klemen Kenda

Dunja Mladenić

Jožef Stefan Institute, Aritificial Intelligence Laboratory

Jožef Stefan Institute, Aritificial Intelligence Laboratory

Jožef Stefan International Postgraduate School

Jožef Stefan International Postgraduate School

Jamova 39, 1000 Ljubljana, Slovenia

Jamova 39, 1000 Ljubljana, Slovenia

klemen.kenda@ijs.si

dunja.mladenic@ijs.si





ABSTRACT

Today advanced classification methods based on machine learning

Real-time classification of events in high energy physics is

are used regularly.

essential to deal with huge amounts of data, produced by proton-

State of the art methods for this type of problems include deep

proton collisions in ATLAS detector at Large Hadron Collider in

neural networks and gradient boosting [10][11][12]. Experiments

CERN. With this work we have implemented a triggering

at CERN prefer the usage of gradient boosting classifiers as they

mechanism method for saving relevant data, based on machine

are able to evaluate large amounts of data (more than 20 × 106

learning. In comparison with the state of the art machine learning

events/s) [4].

methods (gradient boosting and deep neural networks)

shortcomings of Support Vector Machines (SVM) have been

The success of both methods is based on their intrinsic property of

compensated with extensive feature engineering. Method has been

introducing non-linearity into the system. In our work we want to

evaluated with special metrics (average median significance)

compare basic linear methods and Support Vector Machines

suggested by the domain experts. Our method achieves

(SVM) with different kernels to the state of the art models.

significantly higher precision and 8% lower average median

Additionally, we want to enrich the data by intensive feature

significance than the current state of the art method used at ATLAS

engineering.

detector (XGBoost).

The results of feature engineering can be used for further physical

Categories and Subject Descriptors

interpretation of relevant physical phenomena.

H.2.8 [Database Applications]: Data mining, scientific databases

2. DATA

General Terms

Dataset has been made public by the ATLAS collaboration for the

Higgs Boson Machine Learning Challenge on Kaggle in 2014 [3].

Algorithms, Measurement, Experimentation.

It contains data from the ATLAS detector simulator (real labelled

Keywords

data would be impossible to obtain). The winning method from the

Support Vector Machine, Gradient Boosting, Classification, High

challenge is being used in the ATLAS experiment today [4].

Energy Physics, Higgs Boson

1. INTRODUCTION

ATLAS and CMS experiments have announced discovery of the

Higgs boson in 2012 [1]. Experiments have been conducted on

Large Hadron Collider (LHC) in CERN in Geneva. The discovery

has been succeeded by a Nobel Prize in Physics, awarded to

François Englert and Peter Higgs. The existence of the particle,

which gives mass to other elementary particles, has been predicted

around 50 years ago [6][7][8].

Higgs boson decays almost instantly and can be observed only

through its decay products. Initially the particle has been observed

through 𝐻 → 𝛾𝛾, 𝐻 → 𝑍0𝑍0 and 𝐻 → 𝑊+𝑊− decays. These

decays leave a signature that is relatively easy to interpret. The next

steps required analysis of Higgs boson decay into fermion pairs: τ

leptons or b quarks.



Figure 1. Distribution of signal (yellow) and background

In this paper we focus on a special topology of 𝐻 → 𝜏 +𝜏− decay

(green) according to most informative attribute

[9]. Due to similarities with other decays this particular decay is

DER_mass_MMC (mass of Higgs boson candidate) [5].

very difficult to classify. Distinguishing background (events that do



not belong to the 𝐻 → 𝜏 +𝜏 − decay) from signal (events that belong

to Higgs boson decay) requires the use of state of the art machine

2.1 Data Description

learning methods.

Dataset consists of 250,000 instances. 85,667 represent signal,

In the past the task has often been solved with simple cut-off

164,333 represent background. Each instance consists of 32

techniques based on statistical analysis, performed by expert users.

attributes and 1 target variable. All the attributes are numerical

(continuous), target variable is nominal (binary ). 2 of the attributes

39



should not be used for classification purposes, as they represent id

introduced to overcome the shortcomings of linear SVM in

of the instance and probability of such an event happening in the

comparison with gradient boosting or deep neural networks.

experiment [4].

We have built new features from original attributes by transforming

There are missing values in the data. 11 attributes could not always

them with some common functions like 𝑒𝑥, 𝑥2, 𝑥3, √𝑥 and 𝑙𝑜𝑔 (𝑥).

be measured due to characteristics of the detector. Distribution of

Additionally we have used k-means clustering to generate an

the missing values is different for signal and for background.

additional attribute (cluster id). All the generated feature sets are

shown in Table 1.



Table 1. Attribute sets used for SVM.

Set

Description

1

Original feature set.

2

Added missing values.

3

Filtered missing values and all 𝑒𝑥 derivatives.

4

Filtered missing values, 𝑒𝑥 and all 𝑥2 derivatives.

5

Filtered missing values, 𝑒𝑥, 𝑥2 and all 𝑥3 derivatives.

6

Filtered missing values, 𝑒𝑥, 𝑥2, 𝑥3 and all √𝑥



derivatives.

Figure 1. Plot of 1st PCA component against the 3rd. Red dots

Filtered missing values,

7

𝑒𝑥, 𝑥2, 𝑥3,

represent signal instances, green dots represent background

√𝑥 and all 𝑙𝑜𝑔 (𝑥) derivatives.

instances [5].

Selection of most relevant transformations by one

8

attribute.



The signal is limited to the events representing only one possibility

9

Unfiltered set of transformations by one attribute.

for 𝜏 +𝜏 − pair decay [4].

10

Unfiltered set of 𝑥

.

𝑖𝑥𝑗

2.2 Data Understanding

Set of attributes by one of HiggsML winners (Tim

11

The main task of our method is to separate the signal from the

Salimans, DNN).

background, based on the ATLAS detector measurements. As vast

12

Unfiltered set of 𝑥2

2.

𝑖 + 𝑥𝑗

amounts of data (a few terabytes/day) are generated within the

process it is crucial that only the relevant events are detected and

13

Unfiltered set of 2 2

𝑒𝑥 +𝑥

𝑖

𝑗 .

stored [4].

14

Unfiltered set of √𝑥2 + 𝑥2.

Exploratory analysis has shown (see Figure 1) that this task can not

𝑖

𝑗

be successfully accomplished with simple cut-off techniques based

15

Unfiltered set of

2

(1 + 𝑥

.

on a single attribute. Figure 2 depicting PCA components plot is a

𝑖𝑥𝑗)

bit more promising as parts of phase space can clearly be assigned

16

Filtered set of transformations by 1 and 2 attributes.

to one of the classes.

17

(8) with k-means cluster id.

Attributes are divided into 3 groups. First group contains 18

primary attributes (measured in the detector), second group



contains 12 derived attributes (relevant physical phenomena

Filtering of the features has been done manually, with a simple cut-

calculated from primary attributes) and 2 metadata values (weight

off technique based on feature importance as obtained from linear

and event id). Detailed exploratory data analysis can be found in

SVM model.

[5].

3. MACHINE LEARNING METHODS

2.3 Data Preprocessing

USED

ATLAS detector enables good precision of all measurements,

Baseline experiments have been carried out with simple cut-

therefore expected noise in the data is very small and it can not be

further filtered. Missing values have been dealt with in two

off techniques and linear methods like logistical regression

different ways. Firstly – we used “replacement with average”

and Naïve Bayes classifier. As state of the art methods we

strategy to fill in the missing data and secondly, we generated

included gradient boosting and gradient boosting adjusted

additional binary features, representing missing attribute values.

for the approximate median significant metrics (see Section

SVM expects input data to be normalized, therefore the features

3.2) [11].

have been normalized with average and standard deviation values

We are proposing to use SVM method [12]. Linear SVM

set to 1. Data transformation has been handled with Pandas library

can be used for feature selection with large number of

in Python.

attributes. It can discover most relevant features in a large

feature set.

2.4 Feature Engineering



The main task of our work has consisted of extensive feature

engineering, where non-linear combinations of features were

40



3.1 Brief Description of SVM

On our system SVM learning phase took ~1 hour. For time

In our setting we are solving a binary classification problem. Let us

optimization purposes normal evaluation with training and test set

assume, that the classes are linearly separable in our space. In

has been performed. Training set consisted of 225,000 and test set

general, there are many different hyper planes that can separate the

of 25,000 instances.

two classes. Support vector machine (SVM) method is also called

Table 2. Evaluation of different attribute sets on SVM with

maximum margin classifier. There exists only one hyper plane that

linear kernel.

maximizes margin between the two classes [12].

Attribute set

Prec. Rec.

Acc.

𝑭

AMS

𝟏

1

0.665 0.548 0.749

0.600 1.999

3

0.748 0.655 0.805

0.698 2.526

4

0.748 0.654 0.805

0.698 2.528

5

0.740 0.657 0.802

0.696 2.478

6

0.743 0.683 0.809

0.712 2.547

7

0.734 0.690 0.807

0.711 2.516

8

0.732 0.670 0.802

0.700 2.482

10

0.744 0.705 0.815

0.724 2.582

11

0.694 0.584 0.768

0.634 2.201

12

0.744 0.705 0.815

0.724 2.583

13

0.744 0.709 0.816 0.726 2.581



14

0.744 0.705 0.815

0.724 2.583

Figure 2. Maximum margin of dividing hyper plane in

15

0.744 0.710 0.816 0.726 2.578

SVM [5].

SVM classifier is derived from maximization of the margin, which

16

0.740 0.684 0.809

0.711 2.553

2

can be translated into minimization of ||𝑤|| [5][12]. As we are



dealing with data sets, where classes are not separable, we need to

Results from extensive feature engineering are shown in Table 2.

consider a soft margin method that would take into account

Linar SVM performed similar to linear baseline methods (logistic

classification error. SVM is therefore solving minimization

regression, Naïve Bayes). AMS score was ~2.00. Best feature sets

2

for linear SVM were (10), (12), (13) and (14). These feature sets

problem of ||𝑤|| + 𝐶 ∑𝑛 𝜉 , where is a classification error

𝑖=1 𝑖

𝜉𝑖

include two-attribute transformations, e.g., x . It is interesting to

ixj

metrics and C is a parameter that controls the influence of 𝜉.

notice that filtered feature sets performed slightly worse. Extensive

feature generation achieved almost 30% better AMS results (1.999

3.2 Brief Description of the Evaluation

on basic feature set compared to 2.583).

Criteria

Table 3. Evaluation of different methods and attribute sets

Evaluation of the results is to be done with measures derived from

compared to baseline and state-of-the-art methods.

confusion matrix (accuracy, precision, recall, 𝐹 ). The evaluation

1

metrics ( approximate median significance) is defined as

Method

and



attribute set

Prec. Rec.

Acc.

𝑭𝟏

AMS

𝑠

𝐴𝑀𝑆 = √2 (𝑠 + 𝑏 + 𝑏

simple window

0.560

0.824 0.716 0.667

1.579

𝑟𝑒𝑔) ln (1 +

) − 2𝑠

𝑏 + 𝑏𝑟𝑒𝑔

log. reg. (1)

0.668

0.535 0.749 0.594

2.015

𝑠 represents sum of event probabilities of true positives (signal), 𝑏

represents sum of event probabilities of true negatives

SVM-LIN (13)

0.744

0.709 0.816 0.726

2.581

(background), 𝑏

is set to 10 and represents a pre-set

𝑟𝑒𝑔

GBC (8)

0.787

0.703 0.832 0.742

2.856

regularization parameter. The metrics favorizes recall before

precision. In real setting this algorithm is used as a triggering

SVM-r (8)

0.791 0.718 0.837 0.752 2.940

mechanism for saving relevant data. Probability for a positive

opt. SVM-r (8)

0.907 0.446 0.793 0.598

3.451

example in the real data is only around 𝑝 ≈ 2 × 10−5, therefore we

do not want to lose many of them.

XGBoost (1)

0.665

0.806 0.793 0.729

3.735



Table 3 contains results of baseline, state-of-the-art and the

4. EVALUATION

proposed SVM . Best feature sets for selected methods were chosen.

Experiments have been carried out in Python. Data loading and

Baseline methods are simple window (based on cut-off technique

cleaning has been accomplished with Pandas library,

on candidate particle mass) and logistic regression. As state of the

implementation of SVM, scaling and other methods have been

art methods we included: gradient boosting (GBC) and current state

taken from scikit-learn package. Default parameters for

of the art (XGBoost, gradient boosting optimized for AMS).

SVM have been used.

41

Proposed methods are linear SVM, SVM with RBF kernel (SVM -

7. REFERENCES

r) and optimized SVM with RBF kernel (opt. SVM -r).



Usage of kernels (RBF and polynomial kernels have been tested)

[1] G. Aad et al. Observation of a new particle in the search for

improved AMS score for another ~15%. Because of the nature of

the Standard Model Higgs boson with the ATLAS detector at

SVM kernels in this setting 2-attribute transformations were less

the LHC. Physics Letters B, 716(1):1 – 29, 2012.

efficient than 1-attribute transformations. Selection of most

relevant transformations by 1 attribute (set (8)) gave the best

[2] S. Chatrchyan et al. Observation of a new boson at a mass of

results. Method behaved better than gradient boosting classifier

125 GeV with the CMS experiment at the LHC. Physics

(GBC) on the same training set. However, methods were not

Letters B, 716(1):30 – 61, 2012.

optimized to maximize AMS score. The difference however

[3] HiggsML challenge. https://www.kaggle.com/c/higgs-boson,

suggests that the usage of SVM might be a promising way to

2014. [Online; access April 20, 2016].

proceed.

[4] C. Adam-Bourdarios, G. Cowan, C. Germain, I. Guyon, B.

Finally we optimized SVM with RBF kernel for AMS score and

Kégl in D. Rousseau. The Higgs boson machine learning

compared it to XGBoost method, which implements gradient

challenge. In Workshop on High-energy Physics and

boosting, optimized for AMS. Optimization has been done based

Machine Learning, HEPML 2014, held at NIPS 2014,

on threshold for SVM confidence score. Our method performs

Montreal, Quebec, Canada, December 8-13, 2014 [30], pages

approximately ~8% worse than the state of the art. There is,

19–55.

however, a big difference with XGBoost. Our method yields higher

[5] Kenda, K., Podobnik, T., Gorišek, A. Uporaba metod

precision than the other methods and still preserves very high AMS

strojnega učenja pri analizi podatkov, zajetih z detektorjem

score. The proposed method also performs ~20% better than other

ATLAS. Diploma thesis. 2016. Faculty of Mathemathics and

SVM based methods reported in HiggsML Challenge [3].

Physics, University of Ljubljana



[6] P. W. Higgs. Broken symmetries, massless particlees and

5. CONCLUSION

gauge fields. Physics Letters, 12:132–133, September 1964.

In our work we have examined the potential of SVM for a

[7] F. Englert, R. Brout. Broken Symmetry and the Mass of

triggering mechanism in high-energy physics domain. With

Gauge Vector Mesons. Physical Review Letters, 13:321–323,

extensive feature engineering we have also provided an interesting

August 1964.

input for high energy physics experts, where most effective

[8] P. W. Higgs. Broken symmetries and the masses of gauge

generated features could be analyzed through domain knowledge.

bosons. Phys. Rev. Lett., 13:508–509, Oct 1964.

Our method achieves more than 200% better AMS score compared

to cut-off techniques, based on statistical approach. Further, our

[9] G. Aad et al . Evidence for the Higgs-boson Yukawa coupling

methods achieves ~20% better AMS score than other SVM based

to tau leptons with the ATLAS detector. JHEP, 04:117, 2015.

methods reported by HiggsML Challenge competitors, but

[10] T. Chen, T. He. Higgs boson discovery with boosted trees. In

performs ~8% worse than current state of the art (XGBoost). There

HEPML 2014, held at NIPS 2014, pages 69–80.

is however a significant difference between our method and state

[11] T. Chen, C. Guestrin. XGBoost: A scalable tree boosting

of the art. Although achieving comparable AMS score, our methods

system. CoRR, abs/1603.02754, 2016.

achieves much better precision. This might make SVM based

methods valuable members of an ensemble method.

[12] Boser, B. E., Guyon, I. M., Vapnik, V. N. A training

algorithm for optimal margin classifiers. Proceedings of the

Beside adding SVM methods to ensembles and trying to improve

fifth annual workshop on Computational learning theory –

state of the art, further work could be done with adapting the SVM

COLT '92. p. 144.

optimization to AMS metrics. In our work features were selected

based on weight-importance. Often different transformations of the

[13] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B.

same attributes have been selected. Features that could improve our

Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V.

models only by little have potentially been left out. This should be

Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M.

studied further. Optimization of SVM parameters should also be

Brucher, M. Perrot in E. Duchesnay. Scikit-learn: Machine

performed.

learning in Python. Journal of Machine Learning Research,

12:2825–2830, 2011.

6. ACKNOWLEDGMENTS



This work was partially supported by the Slovenian Research

Agency.





42





1

A methodology to evaluate the evolution of networks

using topological data analysis

Joao Pita Costa * ** and Tihana Galinac Grbac *

Abstract—Networks are important representations in computer

science to communicate structural aspects of a given system of

interacting components. The evolution of a network has several

topological properties that can provide us information on the

network itself. In this paper, we present a methodology to

compare the the topological characteristics of the evolution of

a network, encoded into a (persistence) diagram that tracks the

lifetimes of those features. This will enable us to classify the

evolution of networks based on the distance between the diagrams

that represent such network evolution. In that, we also consider

complex vectors that bring a complementary perspective to the

distance-based classification that is closer to the computational

Fig. 1.

The evolving network ScienceAtlas of the collaborations in scientific

methods, aims to enhance the computational efficiency of those

works by Slovenian researchers, evolving over a 9-year period with 3-year

comparisons, and that is by itself a source of open research

leaps. Each node represents an author and each edge represents a collaboration.

questions.

The nodes with degrees smaller than 20 are filtered out so that the networks

are not too large to be visualized.

8

8

I.

I

10

10

NTRODUCTION

1

5

1

5

7

7

9

9

A. Comparing the topology of the evolution of networks

4

12

4

12

13

13

3

3

19

23

Networks that change as a function of time - known as

23

14

14

18

18

evolving networks - are a natural extensions of undirected

15

22

15

22

graphs (i.e., standard (static) networks). Almost all real world

17

21

17

networks evolve over time, either by adding or removing nodes

t = 1

t = 2

6

6

or edges. The example of scientific collaboration analysis, such

8

8

10

10

1

5

5

11

1

11

as in the example of Figure 1 shows such a network.

7

7

9

9

The analysis of the evolution of a network is a matter of

4

20

12

2

4

20

12

13

13

interest transversal to many fields of knowledge, from social

3

19

19

23

3

14

23

14

network analysis and scientific collaboration to computational

18

18

15

22

15

22

biology. A standard example is the network dynamics of a

16

social network such as Twitter should consider an evolution

21

17

21

17

t = 3

t = 4

through time where new nodes come up as new members

join, and new edges are created mirroring the new relationships

Fig. 2.

An example of an evolving network based on an undirected graph

between members that appear [1]. Often all of these processes

that is growing over time by adding new edges and nodes at each step.

occur simultaneously in social networks.

Collaborative networks are a prime example of evolving

A biological network, on the other hand, is an approximate

networks, where nodes represent authors and edges represent

mathematical representation of connections found in ecologi-

scientific collaborations. This is illustrated in Figure 1. It shows

cal, evolutionary, and physiological research, among others. An

the plot of three phases of an instance in the scientific commu-

example of a relevant application of such analysis of biological

nity in Slovenia [13] using ScienceAtlas, a web portal available

networks with respect to human diseases is network medicine.

at scienceatlas.ijs.si integrating data about 35272 researchers,

It considers networks in biological systems containing many

5716 projects, 82905 publications and 17190 video lectures.

components connected within complicated relationships but

This too allows visualizing collaboration and competences of

organized by simple principles [1].

the researchers [14].

In this paper, we focus on the comparison of the evolution of

two (or more) given networks. Our approach considers topo-

* University of Rijeka, Croatia,

** Quintelligence, Ljubljana, Slovenia

logical data analysis (TDA), allowing us to encode the topo-

logical features of the corresponding evolving networks onto

Category: G.2.2: Mathematics of Computing: Discrete Mathematics ap-

diagrams, and using standard methods to compute distances

plications

Keywords: Network, undirected graph, persistent homology, computational

between them. In that, we can classify networks according to

topology, persistence diagram.

the distance between the topology of their evolution.

43

2

The TDA approach to the study of networks is not itself

new. It had several widespread applications from collaboration

networks [3] to functional brain networks [15]. There are

β0 = 2, β1 = 0

β0 = 3, β1 = 0

β0 = 2, β1 = 0 β0 = 1, β1 = 0 β0 = 1, β1 = 1 β0 = 1, β1 = 0

several ways of considering a height function in a network

including: (i) considering weights in the edges of the network

β0

- weighted network - and then having the function built by

β1

threshold those weights [17]; (ii) measuring the distance from

each node to each other by counting the minimal number of

edges between them and then building the height function

Fig. 3.

The computation of persistent homology on a simplicial complex

based on that distance [11]; among others. This permits us

changing in time [3]. The colors correspond to the topological features to

which the lifetime is tracked in the persistence barcode below. The Betti

to use persistent homology over such height function. Another

numbers indicate the number of connected components β0 in dimension zero,

possibility is to consider the maximal cliques as the simplicial

the number of holes in the network β1 in dimension one, and the number of

complexes (named clique complexes) that feed the persistence

tunels and voids β2 in dimension two.

algorithm and proceed with the computation directly over that

[5]. We used the latter approach to compute the persistence of

identifies the closest matching elements of each persistence

the networks generated for the purpose of this paper.

diagram and determines the global distance based on what

is the biggest of those distances. The cost of taking a point

B. Basic notions in persistent homology

p = (p1, p2) to a point q = (q1, q2) in R2 is given by the

L

Topology is a field of study in mathematics concerned in

∞ norm kp − qk∞ = max|p1 − q1|, |p2 − q2|. Then, the

bottleneck distance between persistence diagrams X and Y is

the quality aspects of an object. It focus on the properties that

computed by taking the infimum over all such matchings, i.e.,

are preserved through deformations, twistings, and stretchings

d

kx − η(x)k

of the given continuous objects (e.g. linear maps) in multidi-

B (X, Y ) = inf η supx∈X

∞, where the infimum

is taken over all bijections η from X to Y . Each point with

mensional scenarios. Computational topology takes advantage

multiplicity k in a multiset is interpreted as k individual points,

of simplification methods (e.g. the triangulation of a space) to

and the bijection is interpreted between the resulting sets [4].

permit the computation of topological invariants. One of those

computations is homology which evaluates the connectedness

of, e.g., a network at different dimensions separately. Thus,

C. The motivation of EVOSOFT

homology is a natural choice when it comes to the study of

the topology of a network. Now, if we consider a monotone

Nowadays, software systems start to interconnect to provide

function describing the time variable in, e.g., an evolving

new and innovative applications and services that drives new

network, we can track its homology changes. This notion is

development opportunities in all domains. Therefore these soft-

known as persistent homology and is rooted in TDA, allowing

ware systems have gradually evolved into large scale complex

for retrieving the essential topological features of an object

systems and we lack models for their further management

[2]. Formally, persistent homology computes the topological

and evolution. One of the key aspects of such systems is the

features of a growing sequence of spaces ∅ = X0 ⊆ X1 . . . ⊆

ability to model and predict their behaviour to achieve the

Xn = X, known as a filtration of the space X. Hi(X) is the i-

required quality of operations to fulfill human expectations in

th homology group of X, with an associated i-th Betti number

all domains. In that, the project Evolving Software Systems:

of X,βi, corresponding to the measure of connectedness in the

Analysis and Innovative Approaches for Smart Management

i-th dimension (cf. [5]). Using the inclusion maps Xj → Xj+1

(EVOSOFT) aims to understand how abstract software struc-

we can identify copies of Z2 in the homology groups Hi(Xj)

tures can be used to model global system properties (e.g. fault

and Hi(Xj+1) of a filtration and track where the homology

distributions). Understanding how to use software structure

changes. We do that by recording when a new copy appears

to model fault distributions can help us to improve system

(i.e. ”is born”), and when an existing copy persists or merges to

reliability. EVOSOFT observes software structure as networks

an existing one (i.e. ”dies”). That persistence of the topological

with nodes representing various software functions that are

feature is tracked by a lifetime bar (as shown in Figure 3)

interconnected to each other by function calls. In particular, a

that can be equivalently represented by an ordered pair (x, y),

software graph structure considers nodes as program functions

where x is the birth time and y is the death time. The

(e.g. classes in object oriented paradigm, functions or modules

multiset of all such points exists in the plane subset defined

in functional programming) and edges as function calls or

by 0 < x < y that encodes the topology of a space and

signals transferred in communication among these program

is known as persistence diagram. Several topological features

functions. EVOSOFT aims to observe how large software

can have the same lifetimes and therefore some of the points

systems evolve from version to version, and understand the

in the persistence diagram are repeated in the multiset. We

relationship between the change in software structure during its

refer to their amount as multiplicity. We consider the infinite

evolution, and the change in software fault distributions across

points in the diagonal as points of the persistence diagram

its structure. Previous empirical studies in [9], [10], [18] show

with null lifetime. The standard method to compare two

that communication structures among the program functions

persistence diagrams - called bottleneck distance - measures

significantly influence system fault distributions. This is what

the cost of finding a correspondence between their points. It

motivated us to further explore this relationship.

44



3

To compare the evolution of networks we consider the dis-

tance between the corresponding persistence diagrams, using

the bottleneck distance. This permits a fast computation of the

distance between the (persistence diagrams representing the)

Fig. 4.

The methodology diagram to encode and compare the topology

topology of two evolving networks. In the case of the persis-

of the evolution of two (or more) networks using TDA. It considers three

tence diagrams encoding the topology of evolving networks

phases: (i) the data, where we input the evolving network represented by an

A, B and C represented in Figure 5, we get d(A, B) = 0 and

adjacency matrix; (ii) the topology, where a persistence diagram encodes the

d(B, C) = d(A, C) = 1. This discards the points with infinite

topological invariants of the network evolution; and (iii) the distance, held

between persistence diagrams representing the topology of given evolving

persistence that are less relevant when considering dimension 1

networks.

diagrams. The computations were done using the TDA package

8

8

8

10

10

10

available in R [7]. In this example we can explore the distance

1

5

1

5

1

5

7

7

7

between several possible evolution of a network. In it, shows

9

9

9

4

12

4

12

4

12

13

13

13

how TDA can contribute to better understand the behavior of

3

19

19

19

23

3

3

14

23

14

23

14

a certain network.

18

18

18

15

22

15

22

15

22

21

17

21

17

21

17

III.

THE EVOSOFT EXPERIMENTS

For the purpose of this research we will use the EVOSOFT

4

4

4

motivation to generate networks that fit that scenario and allow

3

3

3

us to compare the evolution of networks in that context. In

these preliminary experiments we shall consider data rep-

resenting the evolution of networks based on the empirical

1

2

3

1

2

3

4

1

2

3

4

analysis of the evolution of complex software systems [8].

Fig. 5. The presented methodology applied to the comparison of the evolution

In these experiment we will generate networks with labeled

2

of three networks sharing the same evolution as the network represented in

nodes - not ordered pairs in R - and extract all maximal

Figure 2 but with differences in phase 2. Each evolving network is associated

cliques from it. The maximal cliques serve us to construct

with one persistence diagram that encodes its topology. This permits us to

clique complexes with which we are able to later on compute

visualize the relevant topological features of the evolution of the networks.

the topology of those networks. In these experiments we shall

obtain the EVOSOFT evolving networks provided by their

II.

USE-CASE METHODOLOGY TO ENCODE AND COMPARE

graph’s Boolean adjacency matrix. Those matrices must be

EVOLVING NETWORKS

consistent with the evolution of the network in the sense that

existing maximal cliques in phase i must maintain or enlarge

The problem of tracking and comparing the evolution of

in the phase i + 1 during the updates of a software version.

networks can be very demanding and complex due to the com-

The persistence diagrams computed by Perseus shall exhibit

binatorial properties of networks. In the following section we

the encoded topology of evolving networks corresponding to

shall describe the methodology diagram to encode a compare

different pieces of software.

the topology of the evolution of networks (as illustrated in

The comparison between the topology of a pair of evolving

Figure 4). It considers persistent homology to encode the topo-

networks given by the adjacency matrix is given by the

logical features of the evolution of a network using persistence

bottleneck distance between the corresponding diagrams. That

diagrams. In that, we first provide the evolving network given

distance can be computed using the R library [7]. When

by one Boolean adjacency matrix for each phase of network

considering other evolving networks we can calculate the

development. We then compute the persistent homology of

pairwise distance between all of them and consider single

the evolving network by feeding the concatenated matrices a

linkage clustering based on this metric (as in earlier TDA

suitable algorithm. It will encode the topology of each evolving

applications to gene expression data as in [12]) to allow

network, representing it by one unique persistence diagram

classification based on the topology of network evolution.

each. Finally, we measure the bottleneck distance between

persistence diagrams to identify how close are the evolving

networks to each other based on their topology.

IV.

COMPARISON THROUGH COMPLEX VECTORS

To the purpose of this paper, we used the software library

A possible algebraic representation of persistence diagrams

Perseus [16] to compute the homology of a the evolving

is offered by complex polynomials. The method layed out in

network represented in Figure 2, given by the graph’s Boolean

[6] can lead to avoid tedious and less meaningful computations

adjacency matrix. The network is provided to Perseus as a

of bottleneck distance, since far polynomials represent far

list of cliques including the time of appearance. The output of

persistence diagrams (the converse is known not to be true). A

that procedure is a persistence diagram that corresponds to the

fast comparison of the coefficient vectors can reduce the size of

topological changes within the evolution of that network. The

the database to be classified by the bottleneck distance. We can

evolving network A on the left has four stages as illustrated

then focus on close persistence diagrams for which we want to

in Figure 2. The evolving networks B and C are variations of

calculate precise measures. This should complement existing

the evolution of the end network in A with different phases at

methods, rising the efficiency of computations for large evolv-

time t = 2, as represented in 5.

ing networks. Given a persistence diagram D described by

45

4

its points p1 = (u1, v1), . . . , ps = (us, vs) with multiplicities

ACKNOWLEDGMENT

r1, . . . , rs, respectively, the method considers complex num-

The authors would like to thank to Barbara di Fabio for

bers z1 = u1 + iv1, . . . , zs = us + ivs. This allows us to

the usefull discussions on complex vectors and advice on

associate to D the complex polynomial f D(t) = Πs

(tz

j=1

j )rj

further research, and to Primož Škraba for his comments and

where rj is the multiplicity of the point pj. It was shown in [6]

suggestions. The first and second authors would like to thank

that the first k coefficients are the ones carrying most of the

to the support of the Croatian Science Foundation’s funding of

relevant information and, therefore, the choice of a threshold

the project EVOSOFT (UIP-2014-09-7945). The second author

k can reduce the computational complexity.

would also like to thank to the support by the University of

The unpublished 2-part algorithm by the authors of [6]

Rijeka Research Grant 13.09.2.2.16 funding.

permits us to input a persistence diagram in order to compute

a complex vector out of it. Then the same algorithm com-

R

pares two complex vectors corresponding to two persistence

EFERENCES

diagrams to output a float corresponding to the distance

[1]

A. L. Barabási. Network medicine - from obesity to the diseasome.

between those vectors. At the moment, this approach to convert

New England Journal of Medicine, 357(4):404–407, 2007.

persistence diagrams into complex vectors can be applied only

[2]

G. Carlsson. Topology and data. Bulletin of the American Mathematical

when neglecting points with infinite persistence. In the running

Society, 46(2):255–308, 2009.

example we get the polynomial p

[3]

C. J. Carstens and K. J. Horadam. Persistent homology of collaboration

A = (t − 1 − 3i)(t − 2 −

networks. Mathematical problems in engineering, 2013.

4i)(t − 3 − 4i) = pB and pC = (t − 1 − 3i)(t − 2 − 4i)2, not

[4]

H. Edelsbrunner and J. Harer. Persistent homology-a survey. Contem-

considering points of infinite persistence. We then develop the

porary mathematics, 453:257–282, 2008.

polynomials to identify their coefficients into a complex vector.

[5]

H. Edelsbrunner and J. L. Harer. Computational Topology. American

The distance between the three complex vectors corresponds

Mathematical Society, Providence, RI, 2010.

to a basic classification of the given evolving networks. This

[6]

B. D. Fabio and M. Ferri. Comparing persistence diagrams through

is not a dense case where we would need additional tools

complex vectors. In International Conference on Image Analysis and

like complex vectors. Though, real life examples of evolving

Processing, pages 294–305, 2015.

networks are appropriate cases of such needs due to their

[7]

B. T. Fasy, J. Kim, F. Lecci, and C. Maria. Introduction to the R package

inherent complexity.

TDA. arXiv:1411.1830, 2014.

[8]

T. Galinac and S. Golubić. Project overlapping and its influence on the

V.

CONCLUSIONS AND FURTHER WORK

product quality. In Proceedings of the 8th International Conference on

Telecommunications, 2005. ConTEL 2005., volume 2, pages 655–662,

In this paper we have discussed the topological data analysis

June 2005.

of evolving networks. In that we presented a method to encode

[9]

T. G. Grbac and D. Huljenic. On the probability distribution of faults

the topology of the evolution of a given network through

in complex software systems.

Information & Software Technology,

a persistence diagram, and its potential for a classification

58:250–258, 2015.

based on a chosen distance between diagrams. The inherent

[10]

T. G. Grbac, P. Runeson, and D. Huljenic.

A second replicated

complexity of an evolving network demands for the data

quantitative analysis of fault distributions in complex software systems.

simplification methods to be available and appropriate to

IEEE Trans. Software Eng., 39(4):462–476, 2013.

the nature of the considered object. In that, the TDA-based

[11]

D. Horak, S. Maletić, and M. Rajković. Persistent homology of complex

networks. Journal of Statistical Mechanics: Theory and Experiment,

methodology in this paper can contribute to the analysis and

2009(3):P03034, 2005.

interpretation of evolving networks and their behaviour. The

[12]

M. Juda. Topological structures in gene expression data. unpuplished

experiments in real data are valuable to improve this method.

work presented at the Genetic Analysis Workshop 19, 2014.

In that, the collaborations with the earlier mentioned Slovenian

[13]

M. Karlovčec, B. Lužar, and D. Mladenić. Core-periphery dynamics

Science Atlas would be welcome, allowing us to further

in collaboration networks: the case study of slovenia. Scientometrics,

explore the interpretation of the topology of the evolution of

109.3:1561–1578, 2016.

these collaborative networks and the distance between them.

[14]

M. Karlovčec, D. Mladenić, M. Grobelnik, and M. Jermol. Conceptual-

Further work includes the processing of EVOSOFT existing

ization of science using collaboration and competences. The Electronic

Library, 34.1:2–23, 2016.

networks, as well as the interpretation of results in the context

[15]

H. Lee, M. K. Chung, H. Kang, B.-N. Kim, and D. S. Lee.

Dis-

of that field of knowledge. It can provide new challenges

criminative persistent homology of brain networks. IEEE International

specific to the available data and to its role and usage in

Symposium on Biomedical Imaging: From Nano to Macro, pages 841–

the field. In particular, the interpretation of the persistent

844, 2011.

topological features captured in EVOSOFT experiments repre-

[16]

V. Nanda. Perseus, the persistent homology software. http://www.sas.

sents a relevant open problem that requires a deeper analysis

upenn.edu/∼vnanda/perseus. Accessed: 2017-09-05.

based on the EVOSOFT expertise and the manipulation of

[17]

G. Petri, M. Scolamiero, I. Donato, and F. Vaccarino. Topological strata

the topological results. Lastly, the mathematical development

of weighted complex networks. PloS one, 8(6):p.e66506, 2013.

of the complex vector method, that contributes to the study of

[18]

J. Petric and T. G. Grbac. Software structure evolution and relation to

evolving networks in general, is a rather computational method

system defectiveness. In 18th International Conference on Evaluation

and Assessment in Software Engineering13-14, 2014, pages 34:1–34:10,

that is suitable to the application of compatible algorithms,

2014.

allowing potential engineering applications. Moreover, it is

itself a great source of open mathematical problems that we

shall consider in further research (e.g. stability [5]).

46





Improving mortality prediction for intensive care unit patients

using text mining techniques

Primož Kocbek1, Nino Fijačko1, Milan Zorman2, Simon Kocbek1,3, Gregor Štiglic1,2

1Univerza v Mariboru Fakulteta za zdravstvene vede, Maribor, Slovenija, +386 2 300 47 00

2Univerza v Mariboru Fakulteta za elektrotehniko, računalništvo in informatiko, Maribor, Slovenija,

+386 2 220 7000

3Kinghorn Centre for Clinical Genomics, Garvan institute of Medical Research, Sydney, Australia,

+61 (02) 9295 8100

{primoz.kocbek, nino.fijacko, milan.zorman, gregor.stiglic}@um.si, skocbek@gmail.com





ABSTRACT

1.

INTRODUCTION

Numerous severity assessment scores for estimation of in-hospital

Predicting the mortality of ICU patients is a complex and dynamic

mortality in Intensive Care Unit (ICU) have been developed over

process. Critical illness severity assessment scores, such as Acute

the last 40 years. In this study, we predicted 1-month mortality in

Physiology and Chronic Health Evaluation I-IV (APACHE),

chronic kidney disease (CKD) patients using the open Medical

Sequential Organ Failure Assessment score (SOFA), Mortality

Information Mart for Intensive Care III (MIMIC III) database.

Probability Model I-III (MPS), or Simplified Acute Physiology

Additionally, we observed the improvement in predictive

Score I-III (SAPS), help clinicians detect patient problems earlier,

performance and interpretability of the baseline model used in

thus providing a better holistic treatment for patients and making

ICUs to a more complex model using simple features such as

patient care more cost-effective. The sheer number of different

unigrams or bigrams, as well as advanced features extracted from

severity scores used is partly because of the quality of recorded data

textual nursing notes. For the latter, MetaMap extraction tool was

needed to calculate them. An example of such severity score is

used to extract medical concepts based on the Unified Medical

APACHE IV, which tends to have the best discriminative

Language System (UMLS) terminology. We used a logistic

performance but the data needed to compute the score is complex

regression based classifier, built using Simplified Acute Physiology

and hospitals would need to develop a good enough high-quality

Score II (SAPS II), age and gender, as a baseline model. The

database for analysis of risk stratification [1, 2].

baseline model was then compared to regularized logistic

The MIMIC III database [3], a free public-access intensive care unit

regression based classifier built using simple and more complex

repository, is widely used for predicting the mortality of ICU

additional features. The Area Under the ROC Curve (AUC) results

patients, where developers provided several severity scores for the

for the baseline predictive performance improved from of 0.761 to

database (e.g., OASIS, SAPS, SAPS II, SOFA), but also noted that

0.782 when frequency of unigrams and bigrams were used to build

for APACHE IV the coding of the diagnostic component is difficult

the model. In a similar scenario, where unigram and bigram

and might lack accuracy [4]. Part of MIMIC III are free text nursing

frequency was replaced with Term Frequency–Inverse Document

notes, which represent a good candidate source of information for

Frequency (TF-IDF) based feature values, AUC further increased

mortality risk prediction, as they contain a detailed and regularly-

to 0.786.

updated record of the interventions performed, medications

This paper represents an opportunity in extracting new knowledge

administered, vital signs, and physical examination findings, all of

in the form of unigrams, bigrams or concepts extracted from textual

which carry highly specific information about the patient’s

notes accompanied by regression coefficient values that can be

dynamic physiological state and eventual outcome [5]. Because

interpreted as relations between the features and the outcome. The

such data is unstructured, our purpose is knowledge discovery

combination of both can provide added value in decision support

where we observe the improvements in predictive performance and

systems in ICU departments, where data is collected in electronic

interpretability of predictive models based on additional features

medical records (EMRs) in real-time.

extracted from nursing notes collected in EMRs. More precisely,



we aim to predict one month mortality in CKD (ICD 9 code 585.x)

Categories and subject descriptors

patients and compare the improvement of the baseline model

H.2.6 [Information Systems]: Database Machines

performance by including simple features such as unigrams or

H.2.8 [Information Systems]: Database Applications

bigrams as well as more advanced features extracted from text in



the form of medical concepts using MetaMap [6] to define mapping

General Terms

between textual notes and UMLS terminology.

Algorithms, Measurement, Documentation, Performance,



Reliability, Experimentation.

2.

METHODS



The data were obtained from the MIMIC III database, version 1.3,

Keywords

to select 58,976 hospitalizations for 46,520 patients who stayed in

Text mining, ICU, database, machine learning, mortality

critical care units of the Beth Israel Deaconess Medical Center in

prediction.

Boston between 2001 and 2012. The database included 26 linked



tables, which can be merged mostly by patient or hospitalization

identification numbers. Our focus were nursing notes (i.e., free text

47





notes from patients with CKD diagnosis), where we excluded

features in our model. Please note that several meta mappings may

hospitalization of patients that died within 24 hours of admission

be found in a sentence. Parentheses contain the concept’s preferred

and nursing notes that were not fully updated, where duplication of

name while square brackets contain the concept’s semantic type.

data was likely. That left us with 10,867 nursing notes from 4,381

One of our goals was interpretability and avoidance of over-fitting,

hospitalizations. The first nursing note was taken on average 7.8

therefore we restricted model building to regularized linear models,

hours after admission (85.2 % hospitalizations have at least two

further we narrowed the selection to L1 regularization models or

notes), second taken on average 14.7 hours after admission (38.1 %

least absolute shrinkage and selection operator (lasso), which

had at least 3 taken) and the third one was taken on average 17.5

includes feature selection functionality which was needed due to

hours after admission. A slight majority of hospitalized patients

high number of features in our datasets [8]. We expanded on the

were male (59.4 %), with an average age of 65.6 (Standard

work of Marafino et al. [4], where they used the MIMIC II dataset

Deviation (SD) 15.2)) and a 13.4 % mortality rate (death during or

and predicted mortality via stochastic gradient descent-based

up to one month after the hospitalization was recorded).

classifiers with TF-IDF on the extracted unigrams and bigrams for

Developers of the database also included source code for

patients that died during the given ICU stay. All experiments were

calculation of several severity scores (i.e., OASIS, SAPS, SAPS II

implemented in R language and environment for statistical

and SOFA), and we selected SAPS II as the main feature in the

computing [9] using glmnet package [10] to build and validate

baseline model, since it is used on daily bases in hospitals. The

predictive models.

input features of our baseline model consisted of SAPS II score,



age and gender.

3.

RESULTS

Nursing notes were initially processed using traditional text

Results presented in this section were obtained from four scenarios

extraction algorithms with stemming and removal of stop words,

where we compared different combinations of two types of

which produced 51,680 unique unigrams and 363,055 unique

extracted features (n-grams versus concepts) and two types of the

bigrams. Both frequency and TF-IDF tables were prepared. The

extracted feature values (frequency vs. TF-IDF).

text from the nursing notes was also processed using the MetaMap

The baseline classifier with basic features (SAPS II score, age and

tool from the US National Library of Medicine. MetaMap identifies

gender) to predict mortality was used to evaluate the performance

and normalizes biomedical terminology from the Unified Medical

gain when more complex classifiers were built. Initially, we were

Language System (UMLS). Binary representations of Bags of

interested in measuring the improvement of the baseline predictive

Phrases (BOP) identified by MetaMap, and their UMLS (Concept

performance by adding unigram and bigram features to SAPS II,

Unique Identifiers) CUIs were used as features for the classifier.

age and gender of the patients. At the same time, we were observing

Space characters in phrases were replaced with underline character.

the complexity of the predictive models by observing the number

Word sense disambiguation was used to distinguish similar words

of features that were included in the models.

with same structure. In addition, phrases were marked with whether

Second row in Table 2 presents the results when frequency of

the associated concepts were found in a positive or negative

unigrams and bigrams were used to build the model. It can be seen

context. To identify the polarity of phrases (negative or positive),

that baseline predictive performance improved from the AUC of

the NegEx module [7] of MetaMap was enabled in order to identify

0.761 to 0.782 when frequency information of unigrams and

the polarity of phrases (negative or positive). NegEx implements a

bigrams was used to build the model. With an improvement of more

simple algorithm that contains several regular expressions

than 2% in AUC it is also important to note that in this scenario we

indicating negation, filters out sentences containing phrases that

obtained the simplest models in terms of interpretation with only 9

falsely appear to be negation phrases, and limits the scope of the

features on average. In a similar scenario where unigram and

negation phrases.

bigram frequency was replaced with TF-IDF based feature values

For better understanding, we provide a short example of MetaMap-

resulted in further improvement with AUC of 0.786. In the next two

annotated phrases from part of the sentence “URINE

experiments, we replaced unigrams and bigrams with UMLS

MICROSCOPY” (Table 1).

concepts that were extracted from free text (nursing notes). Table 2



demonstrates further improvement of the classification

Table 1. Example of MetaMap-annotated phrases from part

performance as the AUC in case of TF-IDF increased to 0.789,

of the sentence “URINE MICROSCOPY”

while even further improvement with a mean AUC of 0.791 was

Meta Candidates

measured in frequency based features experiment. Tables 3 and 4

Score

Matched concept

provide more detailed overview of selected features along with the

1000

C0430397: Urine microscopy (Microscopic

number of times a feature was included in a predictive model during

urinalysis) [Laboratory Procedure]

100 cross-validation runs. It can be seen that TF-IDF produced

861

C0026018: Microscopy [Laboratory Procedure]

predictive models with a larger number of features and therefore

789

C0205288: Microscopic [Qualitative Concept]

represents a richer set of concepts that can be used to warn a

694

C0042036: Urine [Body Substance]

medical expert of a potential threat to a patient. On the other hand,

694

C0042037: Urine (In Urine) [Functional Concept]

a complex model (in case of TF-IDF experiment, more than 35

694

C2963137: Urine (Portion of urine) [Body

features were used in a model on average) might represent a

Substance]

challenge for medical experts when interpretation of models is

Meta Mapping

needed.

Score

Matched concept



1000

C0430397: Urine microscopy (Microscopic

4.

DISCUSSION AND CONCLUSIONS

urinalysis) [Laboratory Procedure]

In this paper, we observed the improvements in predictive



performance and interpretability of predictive models based on new

The meta candidates are all discovered mappings that are ordered

features extracted from nursing notes collected in EMRs. More

according to an evaluation metric described in [7], while meta

precisely, we predicted one month mortality, at the end of 24-hours

mappings represent the selected phrases which finally represent

spent in the ICU, for CKD patients. The improvement of the

48

Table 2. Summary of predictive performance measures for different experiments using features extracted from nursing notes AUC

Sensitivity

Specificity

PPV

NPV

Selected features

Baseline (SAPS II)

0.761

0.712

0.687

0.283

0.933

1.0

[0.757-0.766]

[0.704-0.720]

[0.680-0.695]

[0.277-0.288]

[0.931-0.935]

[1.0-1.0]

Unigrams and

0.782

0.727

0.714

0.306

0.939

9.1

bigrams (frequency)

[0.778-0.786]

[0.721-0.734]

[0.707-0.722]

[0.300-0.313]

[0.937-0.940]

[6.6-11.6]

UMLS concept

0.791

0.736

0.716

0.310

0.941

17.6

mapping (frequency)

[0.787-0.795]

[0.728-0.745]

[0.708-0.724]

[0.304-0.316]

[0.939-0.943]

[14.8-20.5]

Unigrams and

0.786

0.733

0.712

0.306

0.940

25.1

bigrams (TF-IDF)

[0.782-0.790]

[0.725-0.740]

[0.704-0.720]

[0.300-0.313]

[0.938-0.941]

[21.7-28.6]

UMLS concept

0.789

0.747

0.700

0.302

0.942

35.5

mapping (TF-IDF)

[0.785-0.793]

[0.741-0.754]

[0.692-0.709]

[0.296-0.308]

[0.94-0.943]

[31.0-40.0]





baseline model (SAPS II, gender and age) in comparison to

to interpretability of such models. As already noted in [12] in case

predictive models that included unigrams, bigrams as well as more

of similar predictive performance on training set, the simplest

advanced features extracted from text in the form of medical

models often also perform the best on the test set. Therefore, we

concepts using the MetaMap extraction tool was also observed.

should also take the complexity of models with similar

The results show high level of predictive performance that can be

performance into account. In case of our study, the complexity of

compared to a similar study by Brabrand et al. [11] where it was

the four proposed models ranges from 9 up to approximately 35

shown that using clinical intuition of the admission staff produced

selected features. In case of both unigram and bigram based

comparable predictions in terms of AUC when identify patients at

models, it would perhaps make sense to use the simpler model as it

risk of dying. However, it has to be noted that Brabrand and

does not significantly differ in predictive performance at a

colleagues did not focus on a specific group of patients.

significantly lower complexity of the simpler, frequency based

model.



Table 3. Frequency of specific features selected in the UMLS



concept mapping (Frequency) experiment

Table 4. Frequency of specific features selected in the UMLS

Single_count_all_CKD_all_feat

N

concept mapping (TF-IDF) experiment

DNR_(DNR_-_Do_not_resuscitate)_[Finding]

100

Single_tfidf_all_CKD_all_feat

N

Map_(Functional_Map)_[Conceptual_Entity]

100

DNR_(DNR_-_Do_not_resuscitate)_[Finding]

100

SAPSII

100

Meeting_(Meetings)_[Health_Care_Activity]

98

Meeting_(Meetings)_[Health_Care_Activity]

100

Coccyx_(Entire_coccyx)_[Body_Part_Organ_

SAPSII

100

93

or_Organ_Component]

CMO_(Chronic_multifocal_osteomyelitis)_

PICC_line_(Peripherally_inserted_central_

[Disease_or_Syndrome]

98

86

catheter_(physical_object))_[Medical_Device]

PICC_line_(Peripherally_inserted_central_

Anuria_[Disease_or_Syndrome]

85

catheter_(physical_object))_[Medical_Device]

97

CMO_(Chronic_multifocal_osteomyelitis)_

Family_[Family_Group]

96

82

[Disease_or_Syndrome]

Coccyx_(Entire_coccyx)_[Body_Part_Organ

Family_[Family_Group]

70

_or_Organ_Component]

88

vascular_(Blood_Vessel)_[Body_Part_Organ_

Heels_(Heel)_[Body_Location_or_Region]

88

51

or_Organ_Component]

Pressors_[Pharmacologic_Substance]

83

Bilirubin_[Biologically_Active_

45

Map_(Functional_Map)_[Conceptual_Entity]

74

SubstanceOrganic_Chemical]

Levophed_[Organic_ChemicalPharmacologic_

Prognosis_(Forecast_of_outcome)_

45

Substance]

73

[Health_Care_Activity]

loosen_(Loosening)_[Functional_Concept]

70

error_[Qualitative_Concept]

35

neg_DNR_(DNR_-_Do_not_resuscitate)_

Necrotic_(Necrosis)_

34

[Finding]

68

[Organ_or_Tissue_Function]

Pleural_effusion_(Pleural_effusion_fluid)_

Worsening_(Worse)_[Qualitative_Concept]

68

28

[Body_Substance]

error_[Qualitative_Concept]

67

Poor_prognosis_(Prognosis_bad)_

dysfunction_(physiopathological)_[Functional_

28

[Finding]

Concept]

64

Thick_[Qualitative_Concept]

28

Bilirubin_[Biologically_Active_SubstanceOrganic_

Hypotensive_[Pathologic_Function]

27

Chemical]

62

dysfunction_(physiopathological)_

Anasarca_[Pathologic_Function]

59

25

[Functional_Concept]

Coccyx_(Bone_structure_of_coccyx)_

Brain_[Body_Part_Organ_or_

[Body_Part_Organ_or_Organ_Component]

51

23

Organ_Component]

Poor_prognosis_(Prognosis_bad)_[Finding]

50





When providing the predictive models for healthcare experts to

From the most frequently selected features (Table 3 and 4) we can

support their work in clinical practice, we should also pay attention

observe some very general concepts, like “do not resuscitate

49





(DNR)” and high SAPS II score, indicating higher mortality. Also,

[3] Johnson, A.E., Pollard, T.J., Shen, L., Lehman, L.W.H., Feng,

family related concepts can be easily interpreted by a fact that

M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L.A. and Mark,

physicians usually call family members to discuss the severity of

R.G., 2016. MIMIC-III, a freely accessible critical care database.

the situation, especially when the situation is critical or life

Scientific data, 3.

threatening. Additional features indicating higher mortality are

[4] Marafino, B. J., Boscardin, W. J., and Dudley, R. A. 2015.

concepts related to bones like “coccyx” and “heel”, which could

Efficient and sparse feature selection for biomedical text

indicate specific problems related to calcification, frequently

classification via the elastic net: Application to ICU risk

observed in CKD patients. Medical terms such as “peripherally

stratification from nursing notes. Journal of biomedical

inserted central catheter”, “chronic multifocal osteomyelitis” and

informatics.

54

(April 2015),

114–120.

DOI=

“use of pressors (pharmacologic_substance)” could be interpreted

10.1016/j.jbi.2015.02.003.

as signs of worsening health situation. We plan to investigate these

conclusions in more detail in future work. It is also interesting to

[5] MIT Laboratory for Computational Physiology-mimic-code:

note that the UMLS frequency model's most selected variable was

2017. https://github.com/MIT-LCP/mimic-code/tree/master/conce

the medical term “Anuria (non-passage or less than 100 milliliters

pts/severityscores. Accessed: 2017- 09- 04.

passage of urine a day)”, which was not selected in the UMLS TF-

[6] Metamap: Mapping text to the umls metathesaurus. 2006.

IDF model.

https://pdfs.semanticscholar.org/e262/a22134cca0e484be1095160

Further development of our model will include extensions where a

cc4ec8d9e7624.pdf. Accessed: 2017- 09- 04.

shorter (e.g., 6 or 12 hours) period would be used to provide “early

[7] Chapman, W. W., Bridewell, W., Hanbury, P., Cooper, G. F.,

warning” signal to healthcare experts working in the ICU.

and Buchanan, B. G. 2001. A simple algorithm for identifying

Additional features can be extracted from MIMIC-III that would

negated findings and diseases in discharge summaries. Journal of

further improve the predictive performance and possibly also the

biomedical informatics. 34, 5, (October 2001), 301–310. DOI=

interpretability of the models.

10.1006/jbin.2001.1029.



5.

ACKNOWLEDGEMENT

[8] Tibshirani, R., 1996. Regression shrinkage and selection via the

lasso. Journal of the Royal Statistical Society. Series B

The authors would like to acknowledge financial support from the

(Methodological), pp.267-288.

Slovenian Research Agency (research core funding No. P2-0057

and bilateral grant ARRS-BI-US/16-17-064).

[9] R Core Team, 2013. R: A language and environment for



statistical computing. R Foundation for Statistical Computing,

6.

REFERENCES

Vienna, Austria. URL http://www.R-project.org/.

[1] Keegan, M. T., Gajic, O., and Afessa, B. 2012. Comparison of

[10] Friedman, J., Hastie, T. and Tibshirani, R., 2010.

APACHE III, APACHE IV, SAPS 3, and MPM0III and Influence

Regularization paths for generalized linear models via coordinate

of Resuscitation Status on Model Performance. Chest. 142, 4 (April

descent. Journal of statistical software, 33(1), p.1.

2012), 851–858. DOI= 10.1378/chest.11-2164.

[11] Brabrand, M., Hallas, J., and Knudsen, T. 2014. Nurses and

[2] Wu, V. C., Tsai, H. B., Yeh, Y. C., Huang, T. M., Lin, Y. F.,

physicians in a medical admission unit can accurately predict

Chou, N. K., ... and Wu, M. S. 2010. Patients supported by

mortality of acutely admitted patients: a prospective cohort study.

extracorporeal membrane oxygenation and acute dialysis: acute

PloS one, 9,

7,

(July 14),

e101739.

DOI=

physiology and chronic health evaluation score in predicting

10.1371/journal.pone.0101739.

hospital mortality. Artificial organs. 34, 10 (May 2010), 828–835.

[12] Stiglic, G., Kocbek, S., Pernek, I., and Kokol, P. 2012.

DOI= 10.1111/j.1525-1594.2009.00920.x.

Comprehensive decision tree models in bioinformatics. PloS one,

7, 3, (March 30), e33812. DOI= 10.1371/journal.pone.0033812





50





Indeks avtorjev / Author index



Belyaeva Evgenia ......................................................................................................................................................................... 15

Brank Janez .................................................................................................................................................................................... 7

Doyle Casey ................................................................................................................................................................................. 27

Fijacko Nino ................................................................................................................................................................................. 47

Fortuna Blaz ................................................................................................................................................................................. 35

Fuart Flavio .................................................................................................................................................................................. 15

Galinac Grbac Tihana ................................................................................................................................................................... 43

Grobelnik Marko ...................................................................................................................................................................... 7, 15

Herga Zala .................................................................................................................................................................................... 27

Jovanoski Viktor .......................................................................................................................................................................... 35

Karlovcec Mario ........................................................................................................................................................................... 35

Kenda Klemen .............................................................................................................................................................................. 39

Kladnik Matic ............................................................................................................................................................................... 23

Kocbek Primoz ............................................................................................................................................................................. 47

Leban Gregor ........................................................................................................................................................................... 7, 15

Mladenic Dunja ................................................................................................................................................................ 11, 23, 39

Moore Pat ..................................................................................................................................................................................... 27

Novak Erik ................................................................................................................................................................................... 31

Novalija Inna ................................................................................................................................................................................ 31

Pita Costa Joao ....................................................................................................................................................................... 15, 43

Pollak Senja .................................................................................................................................................................................. 19

Repar Andraz ............................................................................................................................................................................... 19

Rupnik Jan .................................................................................................................................................................................... 35

Torkar Miha ................................................................................................................................................................................. 11





51





52





Konferenca / Conference

Uredila / Edited by

Odkrivanje znanja in podatkovna

skladišča - SiKDD /

Data Mining and Data Warehouses - SiKDD

Dunja Mladenić, Marko Grobelnik





Document Outline


A - Naslovnica-SPREDNJA - C

B - Naslovnica - notranja - C

C- Kolofon - C

D-E - IS2017 - skupni zacetni del Blank Page





F - Kazalo - C

G - Naslovnica podkonference - C

H - Predgovor - C

I - Programski odbor - C

J - PDF - C 01 - Brank_Wikifier

02 - Torkar_NewsFinancialMarket

03 - Costas_MIDAS

04 - Repar_Translation

05 - Kladnik_Segmetnation Matic Kladnik, Luka Stopar, Blaz Fortuna, Dunja Mladenić





06 - Herga_ClinetRisk

07 - Novak_Skill Introduction

Related Work

Data Acquisition

Dashboard

Conclusion And Future Work

Acknowledgments

References





08 - Jovanoski_Anomalies

09 - Kenda_HiggsBoson

10 - Costas_Topology

11 - Kocbek_Mortality 1. INTRODUCTION

2. METHODS

3. RESULTS

4. DISCUSSION AND CONCLUSIONS

5. ACKNOWLEDGEMENT

6. REFERENCES





K - Index - C

L - Naslovnica-ZADNJA - C

Blank Page

Blank Page

Blank Page

Blank Page

Blank Page

Blank Page