Zbornik 20. mednarodne multikonference INFORMACIJSKA DRUŽBA - IS 2017 Zvezek C Proceedings of the 20th International Multiconference INFORMATION SOCIETY - IS 2017 Volume C Odkrivanje znanja in podatkovna skladišča - SiKDD Data Mining and Data Warehouses - SiKDD Uredila / Edited by Dunja Mladenić, Marko Grobelnik http://is.ijs.si 9.–13. oktober 2017 / 9–13 October 2017 Ljubljana, Slovenia Zbornik 20. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2017 Zvezek C Proceedings of the 20th International Multiconference INFORMATION SOCIETY – IS 2017 Volume C Odkrivanje znanja in podatkovna skladišča - SiKDD Data Mining and Data Warehouses - SiKDD Uredila / Edited by Dunja Mladenić, Marko Grobelnik http://is.ijs.si 9. - 13. oktober 2017 / 9th – 13th October 2017 Ljubljana, Slovenia Urednika: Dunja Mladenić Laboratorij za umetno inteligenco Institut »Jožef Stefan«, Ljubljana Marko Grobelnik Laboratorij za umetno inteligenco Institut »Jožef Stefan«, Ljubljana Založnik: Institut »Jožef Stefan«, Ljubljana Priprava zbornika: Mitja Lasič, Vesna Lasič, Lana Zemljak Oblikovanje naslovnice: Vesna Lasič Dostop do e-publikacije: http://library.ijs.si/Stacks/Proceedings/InformationSociety Ljubljana, oktober 2017 Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni knjižnici v Ljubljani COBISS.SI-ID=292473088 ISBN 978-961-264-114-6 (pdf) PREDGOVOR MULTIKONFERENCI INFORMACIJSKA DRUŽBA 2017 Multikonferenca Informacijska družba (http://is.ijs.si) je z dvajseto zaporedno prireditvijo osrednji srednjeevropski dogodek na področju informacijske družbe, računalništva in informatike. Letošnja prireditev je ponovno na več lokacijah, osrednji dogodki pa so na Institutu »Jožef Stefan«. Informacijska družba, znanje in umetna inteligenca so spet na razpotju tako same zase kot glede vpliva na človeški razvoj. Se bo eksponentna rast elektronike po Moorovem zakonu nadaljevala ali stagnirala? Bo umetna inteligenca nadaljevala svoj neverjetni razvoj in premagovala ljudi na čedalje več področjih in s tem omogočila razcvet civilizacije, ali pa bo eksponentna rast prebivalstva zlasti v Afriki povzročila zadušitev rasti? Čedalje več pokazateljev kaže v oba ekstrema – da prehajamo v naslednje civilizacijsko obdobje, hkrati pa so planetarni konflikti sodobne družbe čedalje težje obvladljivi. Letos smo v multikonferenco povezali dvanajst odličnih neodvisnih konferenc. Predstavljenih bo okoli 200 predstavitev, povzetkov in referatov v okviru samostojnih konferenc in delavnic. Prireditev bodo spremljale okrogle mize in razprave ter posebni dogodki, kot je svečana podelitev nagrad. Izbrani prispevki bodo izšli tudi v posebni številki revije Informatica, ki se ponaša s 40-letno tradicijo odlične znanstvene revije. Odlične obletnice! Multikonferenco Informacijska družba 2017 sestavljajo naslednje samostojne konference:  Slovenska konferenca o umetni inteligenci  Soočanje z demografskimi izzivi  Kognitivna znanost  Sodelovanje, programska oprema in storitve v informacijski družbi  Izkopavanje znanja in podatkovna skladišča  Vzgoja in izobraževanje v informacijski družbi  Četrta študentska računalniška konferenca  Delavnica »EM-zdravje«  Peta mednarodna konferenca kognitonike  Mednarodna konferenca za prenos tehnologij - ITTC  Delavnica »AS-IT-IC«  Robotika Soorganizatorji in podporniki konference so različne raziskovalne institucije in združenja, med njimi tudi ACM Slovenija, SLAIS, DKZ in druga slovenska nacionalna akademija, Inženirska akademija Slovenije (IAS). V imenu organizatorjev konference se zahvaljujemo združenjem in inštitucijam, še posebej pa udeležencem za njihove dragocene prispevke in priložnost, da z nami delijo svoje izkušnje o informacijski družbi. Zahvaljujemo se tudi recenzentom za njihovo pomoč pri recenziranju. V 2017 bomo petič podelili nagrado za življenjske dosežke v čast Donalda Michija in Alana Turinga. Nagrado Michie-Turing za izjemen življenjski prispevek k razvoju in promociji informacijske družbe bo prejel prof. dr. Marjan Krisper. Priznanje za dosežek leta bo pripadlo prof. dr. Andreju Brodniku. Že šestič podeljujemo nagradi »informacijska limona« in »informacijska jagoda« za najbolj (ne)uspešne poteze v zvezi z informacijsko družbo. Limono je dobilo padanje slovenskih sredstev za akademsko znanost, tako da smo sedaj tretji najslabši po tem kriteriju v Evropi, jagodo pa »e-recept«. Čestitke nagrajencem! Bojan Orel, predsednik programskega odbora Matjaž Gams, predsednik organizacijskega odbora i FOREWORD - INFORMATION SOCIETY 2017 In its 20th year, the Information Society Multiconference (http://is.ijs.si) remains one of the leading conferences in Central Europe devoted to information society, computer science and informatics. In 2017 it is organized at various locations, with the main events at the Jožef Stefan Institute. The pace of progress of information society, knowledge and artificial intelligence is speeding up, and it seems we are again at a turning point. Will the progress of electronics continue according to the Moore’s law or will it start stagnating? Will AI continue to outperform humans at more and more activities and in this way enable the predicted unseen human progress, or will the growth of human population in particular in Africa cause global decline? Both extremes seem more and more likely – fantastic human progress and planetary decline caused by humans destroying our environment and each other. The Multiconference is running in parallel sessions with 200 presentations of scientific papers at twelve conferences, round tables, workshops and award ceremonies. Selected papers will be published in the Informatica journal, which has 40 years of tradition of excellent research publication. These are remarkable achievements. The Information Society 2017 Multiconference consists of the following conferences:  Slovenian Conference on Artificial Intelligence  Facing Demographic Challenges  Cognitive Science  Collaboration, Software and Services in Information Society  Data Mining and Data Warehouses  Education in Information Society  4th Student Computer Science Research Conference  Workshop Electronic and Mobile Health  5th International Conference on Cognitonics  International Conference of Transfer of Technologies - ITTC  Workshop »AC-IT-IC«  Robotics The Multiconference is co-organized and supported by several major research institutions and societies, among them ACM Slovenia, i.e. the Slovenian chapter of the ACM, SLAIS, DKZ and the second national engineering academy, the Slovenian Engineering Academy. In the name of the conference organizers we thank all the societies and institutions, and particularly all the participants for their valuable contribution and their interest in this event, and the reviewers for their thorough reviews. For the fifth year, the award for life-long outstanding contributions will be delivered in memory of Donald Michie and Alan Turing. The Michie-Turing award will be given to Prof. Marjan Krisper for his life-long outstanding contribution to the development and promotion of information society in our country. In addition, an award for current achievements will be given to Prof. Andrej Brodnik. The information lemon goes to national funding of the academic science, which degrades Slovenia to the third worst position in Europe. The information strawberry is awarded for the medical e-recipe project. Congratulations! Bojan Orel, Programme Committee Chair Matjaž Gams, Organizing Committee Chair ii KONFERENČNI ODBORI CONFERENCE COMMITTEES International Programme Committee Organizing Committee Vladimir Bajic, South Africa Matjaž Gams, chair Heiner Benking, Germany Mitja Luštrek Se Woo Cheon, South Korea Lana Zemljak Howie Firth, UK Vesna Koricki Olga Fomichova, Russia Mitja Lasič Vladimir Fomichov, Russia Robert Blatnik Vesna Hljuz Dobric, Croatia Aleš Tavčar Alfred Inselberg, Israel Blaž Mahnič Jay Liebowitz, USA Jure Šorn Huan Liu, Singapore Mario Konecki Henz Martin, Germany Marcin Paprzycki, USA Karl Pribram, USA Claude Sammut, Australia Jiri Wiedermann, Czech Republic Xindong Wu, USA Yiming Ye, USA Ning Zhong, USA Wray Buntine, Australia Bezalel Gavish, USA Gal A. Kaminka, Israel Mike Bain, Australia Michela Milano, Italy Derong Liu, Chicago, USA Toby Walsh, Australia Programme Committee Bojan Orel, chair Mitja Luštrek Niko Schlamberger Franc Solina, co-chair Marko Grobelnik Stanko Strmčnik Viljan Mahnič, co-chair Nikola Guid Jurij Šilc Cene Bavec, co-chair Marjan Heričko Jurij Tasič Tomaž Kalin, co-chair Borka Jerman Blažič Džonova Denis Trček Jozsef Györkös, co-chair Gorazd Kandus Andrej Ule Tadej Bajd Urban Kordeš Tanja Urbančič Jaroslav Berce Marjan Krisper Boštjan Vilfan Mojca Bernik Andrej Kuščer Baldomir Zajc Marko Bohanec Jadran Lenarčič Blaž Zupan Ivan Bratko Borut Likar Boris Žemva Andrej Brodnik Janez Malačič Leon Žlajpah Dušan Caf Olga Markič Saša Divjak Dunja Mladenič Tomaž Erjavec Franc Novak Bogdan Filipič Vladislav Rajkovič Andrej Gams Grega Repovš Matjaž Gams Ivan Rozman iii Invited lecture AN UPDATE FROM THE AI & MUSIC FRONT Gerhard Widmer Institute for Computational Perception Johannes Kepler University Linz (JKU), and Austrian Research Institute for Artificial Intelligence (OFAI), Vienna Abstract Much of current research in Artificial Intelligence and Music, and particularly in the field of Music Information Retrieval (MIR), focuses on algorithms that interpret musical signals and recognize musically relevant objects and patterns at various levels -- from notes to beats and rhythm, to melodic and harmonic patterns and higher-level segment structure --, with the goal of supporting novel applications in the digital music world. This presentation will give the audience a glimpse of what musically "intelligent" systems can currently do with music, and what this is good for. However, we will also find that while some of these capabilities are quite impressive, they are still far from (and do not require) a deeper "understanding" of music. An ongoing project will be presented that aims to take AI & music research a bit closer to the "essence" of music, going beyond surface features and focusing on the expressive aspects of music, and how these are communicated in music. This raises a number of new research challenges for the field of AI and Music (discussed in much more detail in [Widmer, 2016]). As a first step, we will look at recent work on computational models of expressive music performance, and will show some examples of the state of the art (including the result of a recent musical 'Turing test'). References Widmer, G. (2016). Getting Closer to the Essence of Music: The Con Espressione Manifesto. ACM Transactions on Intelligent Systems and Technology 8(2), Article 19. iv KAZALO / TABLE OF CONTENTS Odkrivanje znanja in podatkovna skladišča - SiKDD / Data Mining and Data Warehouses - SiKDD ................. 1 PREDGOVOR / FOREWORD ................................................................................................................................. 3 PROGRAMSKI ODBORI / PROGRAMME COMMITTEES ..................................................................................... 5 Anotating Documents with Relevant Wikipedia Concepts / Brank Janez, Leban Gregor, Grobelnik Marko ......... 7 Impact of News Events on the Financial Markets / Torkar Miha, Mladenic Dunja ............................................... 11 Challenges in media monitoring of worldwide news sources to support public health / Pita Costa Joao, Fuart Flavio, Grobelnik Marko, Leban Gregor, Belyaeva Evgenia ................................................................... 15 Ontology-based translation memory maintenance / Repar Andraz, Pollak Senja ............................................... 19 Audience Segmentation Based on Topic Profiles / Kladnik Matic, Mladenic Dunja ............................................. 23 Building Client’s Risk Profile Based on Cal Detail Records / Herga Zala, Doyle Casey, Moore Pat .................. 27 Connecting Professional Skill Demand with Supply / Novak Erik, Novalija Inna ................................................. 31 Analyzing raw log files to find execution anomaliess / Jovanoski Viktor, Rupnik Jan, Karlovcec Mario, Fortuna Blaz ..................................................................................................................................................... 35 Usage of SVM for a Triggering Mechanism for Higgs Boson Detection / Kenda Klemen, Mladenic Dunja ......... 39 A methodology to evaluate the evolution of networks using topological data analysis / Pita Costa Joao, Galinac Grbac Tihana....................................................................................................................................... 43 Improving mortality prediction for intensive care unit patients using text mining techniques / Kocbek Primoz, Fijacko Nino ......................................................................................................................................... 47 Indeks avtorjev / Author index ................................................................................................................................ 51 v vi Zbornik 20. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2017 Zvezek C Proceedings of the 20th International Multiconference INFORMATION SOCIETY – IS 2017 Volume C Odkrivanje znanja in podatkovna skladišča - SiKDD Data Mining and Data Warehouses - SiKDD Uredila / Edited by Dunja Mladenić, Marko Grobelnik http://is.ijs.si 9. oktober 2017 / 9th October 2017 Ljubljana, Slovenia 1 2 PREDGOVOR Tehnologije, ki se ukvarjajo s podatki so v devetdesetih letih močno napredovale. Iz prve faze, kjer je šlo predvsem za shranjevanje podatkov in kako do njih učinkovito dostopati, se je razvila industrija za izdelavo orodij za delo s podatkovnimi bazami, prišlo je do standardizacije procesov, povpraševalnih jezikov itd. Ko shranjevanje podatkov ni bil več poseben problem, se je pojavila potreba po bolj urejenih podatkovnih bazah, ki bi služile ne le transakcijskem procesiranju ampak tudi analitskim vpogledom v podatke – pojavilo se je t.i. skladiščenje podatkov (data warehousing), ki je postalo standarden del informacijskih sistemov v podjetjih. Paradigma OLAP (On-Line-Analytical-Processing) zahteva od uporabnika, da še vedno sam postavlja sistemu vprašanja in dobiva nanje odgovore in na vizualen način preverja in išče izstopajoče situacije. Ker seveda to ni vedno mogoče, se je pojavila potreba po avtomatski analizi podatkov oz. z drugimi besedami to, da sistem sam pove, kaj bi utegnilo biti zanimivo za uporabnika – to prinašajo tehnike odkrivanja znanja v podatkih (data mining), ki iz obstoječih podatkov skušajo pridobiti novo znanje in tako uporabniku nudijo novo razumevanje dogajanj zajetih v podatkih. Slovenska KDD konferenca pokriva vsebine, ki se ukvarjajo z analizo podatkov in odkrivanjem znanja v podatkih: pristope, orodja, probleme in rešitve. Dunja Mladenić, Marko Grobelnik 3 FOREWORD Data driven technologies have significantly progressed after mid 90’s. The first phases were mainly focused on storing and efficiently accessing the data, resulted in the development of industry tools for managing large databases, related standards, supporting querying languages, etc. After the initial period, when the data storage was not a primary problem anymore, the development progressed towards analytical functionalities on how to extract added value from the data; i.e., databases started supporting not only transactions but also analytical processing of the data. At this point, data warehousing with On-Line-Analytical-Processing entered as a usual part of a company’s information system portfolio, requiring from the user to set well defined questions about the aggregated views to the data. Data Mining is a technology developed after year 2000, offering automatic data analysis trying to obtain new discoveries from the existing data and enabling a user new insights in the data. In this respect, the Slovenian KDD conference (SiKDD) covers a broad area including Statistical Data Analysis, Data, Text and Multimedia Mining, Semantic Technologies, Link Detection and Link Analysis, Social Network Analysis, Data Warehouses. Dunja Mladenić, Marko Grobelnik 4 PROGRAMSKI ODBOR / PROGRAMME COMMITTEE Dunja Mladenić Marko Grobelnik 5 6 ANOTATING DOCUMENTS WITH RELEVANT WIKIPEDIA CONCEPTS Janez Brank, Gregor Leban, Marko Grobelnik Artificial Intelligence Laboratory Jožef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia Tel: +386 1 4773778; fax: +386 1 4251038 e-mail: {janez.brank,gregor.leban,marko.grobelnik}@ijs.si ABSTRACT 2 PAGERANK-BASED WIKIFICATION We describe an efficient approach for annotating a document with The task of wikifying an input document can be broken down relevant concepts from the Wikipedia. A pagerank-based method is into several closely interrelated subtasks: (1) identify phrases used to identify a coherent set of relevant concepts considering the input document as a whole. The proposed approach is suitable for (or words) in the input document that refer to a Wikipedia parallel processing and can support any language for which a concept; (2) determine which concept exactly a phrase refers sufficiently large Wikipedia is available. to; (3) determine which concepts are relevant enough to the document as a whole that they should be included in the 1 INTRODUCTION output of the system (i.e. presented to the user). Recent years have seen a growth in the use of semantic We follow the approach described by Zhang and technologies. However, in many contexts we still deal with Rettinger [1]. This approach makes use of the rich internal largely unstructured textual documents that lack explicit structure of hyperlinks between Wikipedia pages. A semantic information such as might be required for further hyperlink can be thought of as consisting of a source page, a processing with semantic technologies. This leads to the target page, and the link text (also known as the anchor text). problem of semantic annotation or semantic enrichment as an If a source page contains a link with the anchor text a and the important preparatory step before further processing of a target page t, this is an indication that the phrase a might be document. Given a document and an ontology covering the a reference to (or representation of) the concept that domain of interest, the challenge is to identify concepts from corresponds to page t. Thus, if the input document that we’re that ontology that are relevant to the document or that are trying to wikify contains the phrase a, it might be the case referred to by it, as well as to identify specific passages in the that this occurrence of a in the input document also document where the concepts in question are mentioned. constitutes a mention of the concept t, and the concept t is a A specific type of semantic annotation, known as candidate annotation for this particular phrase. wikification, involves using the Wikipedia as a source of possible semantic annotations [1][2]. In this setting, the 2.1 Disambiguation Wikipedia is treated as a large and fairly general-purpose In the Wikipedia, there may be many different links with the ontology: each page is thought of as representing a concept, same anchor text a, and they might not all be pointing to the while the relations between concepts are represented by same target page. For example, in the English-language internal hyperlinks between different Wikipedia pages, as Wikipedia, there are links with a = “Tesla” that variously well as by Wikipedia’s category memberships and and cross- point to pages about the inventor, the car manufacturer, the language links. unit in physics, a band, a film, and several other concepts. The advantage of this approach is that the Wikipedia is a Thus, when such a phrase a occurs in an input document, freely available source of information, it covers a wide range there are several concepts that can be regarded as candidate of topics, has a rich internal structure, and each concept is annotations for that particular mention, and we have to associated with a semi-structured textual document (i.e. the determine which of them is actually relevant. This is the contents of the corresponding Wikipedia article) which can problem of disambiguation, similar to that of word sense be used to aid in the process of semantic annotation. disambiguation in natural language processing. Furthermore, the Wikipedia is available in a number of There are broadly two approaches to disambiguation, languages, with cross-language links being available to local and global. In the local approach, each mention is identify pages that refer to the same concept in different disambiguated independently of the others, while the global languages, thus making it easier to support multilingual and approach aims to treat the document as a whole and cross-lingual annotation. disambiguate all the mentions in it as a group. The intuition The remainder of this paper is structured as follows. In behind the global approach is that the document that we’re Section 2, we present the pagerank-based approach to annotating is about some topic, and the concepts that we use wikification used in our wikifier. In Section 3, we describe as annotation should be about that topic as well. If the our implementation and present some experimental document contains many mentions that include, as some of evaluaton. Section 4 contains conclusions and a discussion of their candidate annotations, some car-related concepts, this possible future work. makes it more likely that we should treat the mention of “Tesla” as a reference to Tesla the car manufacturer as opposed to e.g. a reference to Nikola Tesla or to Tesla the 7 rock band. of the pagerank calculation process, the pagerank flows into a concept vertex c from mentions that are closely associated 2.2 The mention-concept graph with the concept c and from other concepts that are To implement the global disambiguation approach, our semantically related to c. Thus after a few iterations, Wikifier begins by constructing a mention-concept graph for pagerank should tend to accumulate in a set of concepts that the input document. (Some authors, e.g. [2], refer to this as a are closely semantically related to each other and that are mention-entity graph, but we prefer to use the term “mention- concept graph” as some of the Wikipedia pages do not strongly associated with words and phrases that appear in the input document, which is exactly what we want in the context necessarily correspond to concepts that we usually think of as of global disambiguation. entities, and our wikifier does not by default try to exclude them.) This can be thought of as a bipartite graph in which the 2.3 Using pagerank for disambiguation left set of vertices corresponds to mentions and the right set Once the pagerank values of all the vertices in the graph have of vertices corresponds to concepts. A directed edge a  c been calculated, we use the pagerank values of concepts to exists if and only if the concept c is one of the candidate disambiguate the mentions. If there are edges from a mention annotations for the mention a (i.e. if there exists in the a to several concepts c, we choose the concept with the Wikipedia a hyperlink with the anchor text a and the target c). highest pagerank as the one that is relevant to this particular A transition probability is also assigned to each such edge, mention a. We say that this concept is supported by the P( a  c), defined as the ratio [number of hyperlinks, in the mention a. At the end of this process, concepts that are not Wikipedia, having the anchor text a and the target c] / supported by any mention are discarded as not being relevant [number of hyperlinks, in the Wikipedia, having the anchor to the input document. text a]. The remaining concepts are then sorted in decreasing This graph is then augmented by edges between concepts, order of their pagerank. Let the i’th concept in this order be the idea being that an edge c  c' should be used to indicate ci and let its pagerank be PRi, for i = 1, …, n. Concepts with that the concepts c and c' are “semantically related”, in the a very low pagerank value are less likely to be relevant, so it sense that if one of them is relevant to a given input document, makes sense to apply a further filtering step at this point and the other one is also more likely to be relevant to that discard concepts whose pagerank is below a user-specified document. Following [1], the internal link structure of the threshold. However, where exactly this threshold should be Wikipedia is used to calculate a measure of semantic depends on whether the user wants to prioritize precision or relatedness. Informally, the idea is that if c and c' are closely recall. Furthermore, the absolute values of pagerank can vary related, then other Wikipedia pages that point to c are likely a lot from one document to another, e.g. depending on the to also point to c' and vice versa. Let L length of the document, the number of mentions and c be the set of Wikipedia pages that contain a hyperlink to c, and let N be the total candidate concepts, etc. Thus we apply the user-specified number of concepts in the Wikipedia; then the semantic threshold in the following manner: given the user-specified relatedness of c and c' can be defined as threshold value   [0, 1], we output the concepts c 1, …, cm, where m is the least integer such that  2 ≥   SR( c, c' ) = 1 – [log(max{| L i=1.. m PRi i=1.. n c|, | Lc' |}) – log| Lc  Lc' |] / PR 2. In other words, we report as many top-ranking concepts [log N – log(min{| L i c|, | Lc' |})]. as are needed to cover  of the total sum of squared In the graph, we add an edge of the form c  c' wherever the pageranks of all the concepts. We use  = 0.8 as a broadly semantic relatedness SR(c, c') is > 0. The transition reasonable default value, though the user can require a probability of this edge is defined as proportional to the different threshold depending on their requirements. semantic relatedness: P( c  c' ) = SR( c, c' ) /  c'' SR( c, c'' ). For each reported concept, we also output a list of the This graph is then used as the basis of calculating a vector mentions that support it. of pagerank scores, one for each vertex. This is done using the usual iterative approach where in each iteration, each 2.4 Treatment of highly ambiguous mentions vertex distributes its pagerank score to its immediate Our wikifier supports various minor heuristics and successors in the graph, in proportion to the transition refinements in an effort to improve the performance of the probabilities on its outgoing edges: baseline approach described in the preceding sections. As described above, anchor text of hyperlinks in the PRnew( u) =  PR 0( u) + (1 – )  v PRold( v) P( v  u). Wikipedia is used to identify mentions in an input document The baseline distribution of pagerank, PR 0, is used both to (i.e. words or phrases that may support an annotation). One help the process converge and also to counterbalance the fact downside of this approach is that some words or phrases that in our graph there are no edges pointing into the mention occur as the anchor text of a very large number of hyperlinks vertices. In our case, PR 0( u) is defined as 0 if u is a concept in the Wikipedia and these links point to a large number of vertex; if u is a mention vertex, we use PR 0( u) = z  [number different Wikipedia pages. In other words, such a phrase is of Wikipedia pages containing the phrase u as the anchor-text highly ambiguous; it is not only unlikely to be disambiguated of a hyperlink] / [number of Wikipedia pages containing the correctly, but also introduces noise into the mention-concept phrase u], where z is a normalization constant to ensure that graph by introducing a large number of concept vertices, the  u PR 0( u) = 1. We used  = 0.1 as the stabilization parameter. vast majority of which will be completely irrelevant to the The intuition behind this approach is that in each iteration input document. This also slows down the annotation process 8 by increasing the time to calculate the semantic relatedness Here, a is the mention that we’re trying to disambiguate, between all pairs of candidate concepts. and c is the candidate concept that we’re evaluating. P( c| a) We use several heuristics to deal with this problem. is the probability that a hyperlink in the Wikipedia has c as Suppose that a given mention a occurs, in the Wikipedia, as its target conditioned on the fact that it has a as its anchor the anchor text of n hyperlinks pointing to k different target text. f( x) can be either 1 (the default), x, or log( x). PR(c) is pages, and suppose that ni of these links point to page ci (for i the pagerank of c’s vertex in the mention-concept graph. S( c, = 1, …, k). We can now define the entropy of the mention a d) is the cosine similarity between the text of the input as the amount of uncertainty regarding the link target given document d and of the Wikipedia page for the concept c. the fact that its anchor text is a: H( a) = –  i=1.. k ( ni/ n) log( ni/ n). LS( c, a) is the cosine similarity between the context (e.g. If this entropy is above a user-specified threshold (e.g. 3 bits), previous and next 3 words) in which a appears in the input we completely ignore the mention as being too ambiguous to document d, and the contexts in which hyperlinks with the be of any use. For mentions that pass this heuristic, we sort target c appear in the Wikipedia. Finally, w 1, w 2, w 3 are the target pages in decreasing order of ni and use only the top weight constants. However, our preliminary experiments few of them (e.g. top 20) as candidates in our mention- haven’t shown any improvements from the addition of these concept graph. A third heuristic is to ignore candidates for heuristics, so they are disabled by default ( f( x) = 1, w 2 = w 3 which ni itself is below a certain threshold (e.g. ni < 2), the = 0) to save computational time and memory (storing the link idea being that if such a phrase occurs only once as the anchor contexts needed for the efficient computation of LS has text of a link pointing to that candidate, this may well turn out turned out to be particularly memory intensive). to be noise and is best disregarded. 3 IMPLEMENTATION AND EVALUATION Optionally, the Wikifier can also be configured to ignore certain types of concepts based on their Wikidata class 3.1. Implementation membership. This can be useful to exclude from Our implementation of the approach described in the consideration Wikipedia pages that do not really correspond preceding section is running as a web service and can be to what is usually thought of as entities (e.g. “List of…” accessed at http://wikifier.org. The approach is suitable for pages). parallel processing as annotating one document is Another heuristic that we have found useful in reducing independent of annotating other documents, and any shared the noise in the output annotations is to ignore any mention data used by the annotation process (e.g. the Wikipedia link that consists entirely of stopwords and/or very common graph, and a trie-based data structure that indexes the anchor words (top 200 most frequent words in the Wikipedia for that text of all the hyperlinks) need to be accessed only for particular language). For this as well as for other purposes the reading and can thus easily be shared by an arbitrary number text processing is done in a case-sensitive fashion, which e.g. of worker threads. This allows for a highly efficient allows us to ignore spurious links with the link text “the” processing of a large number of documents. while processing those that refer to the band “The The”. Our implementation currently processes on average more than 500,000 requests per day (the total length of input 2.5. Miscellaneous heuristics documents averages about 1.2 GB per day), including all the Semantic relatedness. As mentioned above, the definition of documents from the JSI Newsfeed service [3]. The output is semantic relatedness of two concepts, SR( c, c' ), is based on used among other things as a preprocessing step by the Event the overlap between the sets Lc, Lc' of immediate predecessors Registry system [4]. The wikifier currently supports all of these two concepts in the Wikipedia link graph. Optionally, languages in which a Wikipedia with at least 1000 pages is our Wikifier can compute semantic relatedness using available, amounting to a total of 134 languages. Admittedly, immediate successors or immediate neighbours (i.e. both 1000 pages is much too small to achieve an adequate predecessors and successors) instead of immediate coverage; however, about 60 languages have a Wikipedia predecessors. However, our preliminary experiments with at least 100,000 pages, which is already enough for indicated that these changes do not lead to improvements in many practical applications. performance, so they are disabled by default. Annotations are returned in JSON format and can Extensions to disambiguation. Our Wikifier also supports optionally include detailed information about support (which some optional extensions of the disambiguation process. As mentions support each annotation), alternative candidate described above, the default behavior when disambiguating a annotations (concepts that were considered as candidates mention is to simply choose the candidate annotation with the during the disambiguation process but were rejected in highest pagerank value. Alternatively, after any heuristics favour of some other more highly scored concept), and from section 2.4 have been applied, the remaining candidate WikiData/DbPedia class membership of the proposed concepts can be re-ranked using a different scoring function annotations. Thus, the caller can easily implement any that takes other criteria besides pagerank into account. This is desired class-based postprocessing. an opportunity to combine the global disambiguation 3.2. Evaluation approach with some local techniques. In general, a scoring One way to evaluate wikification is to compare the set of function of the following type is supported: annotations with a manually annotated gold standard for the score( c| a) = w 1 f( P( c| a)) PR( c) + w 2 S( c, d) + w 3 LS( c, a) same document(s). Performance can then be measured using 9 metrics from information retrieval, such as precision, recall, small, with a small amount of text, low number of pages, and and the F 1-measure, which is defined as the harmonic mean generally poor coverage, the performance of wikification of precision and recall. We used a manually annotated set of based on this will be low. One idea to alleviate this problem 1393 news articles that was made available from the authors would be to optionally allow a second stage of processing, in of the AIDA system and was originally used in their which Wikipedias in languages other than the language of experiments [2]. This manually annotated dataset excludes, the input document would also be used to identify mentions by design, any annotations that do not correspond to named and provide candidate annotations. This might improve entities. Since our wikifier does not by default distinguish coverage especially of concepts that are referred to by the between named entities and other Wikipedia concepts, we same words or phrases across multiple languages, as is the have explicitly excluded non-entity concepts (based on their case with some types of named entities. For the purposes of class membership in the WikiData ontology) from the output pagerank-based disambiguation in this second stage, a large of our Wikifier for the purposes of this experiment. In common link-graph would have to be constructed by addition to our wikifier, we obtained annotations from the merging the link-graphs of the Wikipedias for different following systems: AIDA [2], Waikato Wikipedia Miner [6], languages. This can be done by using the cross-language Babelfy [7], Illinois [8], and DbPedia Spotlight [9]. links which are available in the WikiData ontology, providing information about when different pages in Gold JSI AIDA Waikato Babelfy Illinois Spotlight different languages refer to the same concept. Gold 1.000 0.593 0.723 0.372 0.323 0.476 0.279 JSI 1.000 0.625 0.527 0.431 0.489 0.363 Another interesting directon for further work would be to AIDA 1.000 0.372 0.352 0.434 0.356 try incorporating local disambiguation techniques as a way Waikato 1.000 0.481 0.564 0.474 to augment the current global disambiguation approach. Babelfy 1.000 0.434 0.356 When evaluating whether a mention a in the input document Illinois 1.000 0.376 refers to a particular concept c, the local approach would Spotlight 1.000 Table 1: F focus on comparing the context of a to either the text of the 1 measure of agreement between the various wikifiers and the gold standard. Wikipedia page for c, or to the context in which hyperlinks to c occur within the Wikipedia. Preliminary steps taken in Table 1 shows the agreement not only between each of the this direction in Sec. 2.5 did not lead to improvements in wikifiers and the gold standard, but also between each pair of performance, but this subject is worth exploring further. wikifiers (the lower left triangle of the matrix is left empty as Instead of the bag-of-words representation of contexts, other it would be just a copy of the upper right triangle, since the vector representations of words could be used, e.g. word2vec F 1-measure is symmetric). As this experiment indicates, our [5]. wikifier (“JSI” in the table) performs slightly worse than AIDA but significantly better than the other wikifiers. Acknowledgments Furthermore, it turns out that there is relatively little This work was supported by the Slovenian Research Agency agreement between the different wikifiers, which indicates as well as the euBusinessGraph (ICT-732003-IA) and EW- that wikification itself is in some sense a vaguely defined task Shopp (ICT-732590-IA) projects. where different people can have very different ideas about whether a particular Wikipedia concept is relevant to a References particular input document (and should therefore be included [1] L. Zhang, A. Rettinger. Final ontological word-sense- as an annotation) or not, which types of Wikipedia concepts disambiguation prototype. Deliverable D3.2.3, xLike Project, October 2014. can be considered as annotations (e.g. only named entities or [2] J. Hoffart, M. A. Yosef, I. Bordino, et al. Robust disambiguation of all concepts), etc. Possibly the level of agreement could be named entities in text. Proc. of the 2011 Conf. on Empirical Methods improved by fine-tuning the settings of the various wikifiers; in Natural Language Processing, Edinburgh, Scotland, 2011, pp. in the experiment described above, default settings were used. 782–792. [3] M. Trampuš, B. Novak. Internals of an aggregated web news feed. 4 CONCLUSIONS AND FUTURE WORK Proc. SiKDD 2012. We have presented a practical and efficient approach to [4] G. Leban, B. Fortuna, J. Brank, M. Grobelnik. Event registry: Wikification that requires no external data except the Learning about world events from news. Proc. of the 23rd Int. Conf. on the World Wide Web (WWW 2014), pp 107–110. Wikipedia itself, that can deal with documents in any [5] T. Mikolov, K. Chen, G. Corrado, J. Dean. Efficient estimation of language for which the Wikipedia is available, and that is word representations in vector space. Arxiv.org, 2013. suitable for a high-performance, parallelized implementation. [6] D. Milne, I. H. Witten. An open-source toolkit for mining Wikipedia. The approach presented here could be improved along Artificial Intelligence, 194:222–239 (January 2013). [7] A. Moro, A. Raganato, R. Navigli. Entity linking meets word sense several directions. One significant weakness of the current disambiguation: A unified approach. Trans. of the Assoc. for Comp. approach concerns the treatment of minority languages. Linguistics, 2:231–234 (2014). When dealing with a document in a certain language, we need [8] L. Ratinov, D. Roth, D. Downey, M. Anderson. Local and global hyperlinks whose anchor text is in the same language if we algorithms for disambiguation to Wikipedia. Proc. of the 49th Annual Meeting of the Assoc. for Comp Linguistics: Human Language are to identify mentions in this input document. Thus, if the Technologies (2011), pp. 1375–84. document is in a language for which the Wikipedia is not [9] J. Daiber, M. Jakob, C. Hokamp, P. N. Mendes. Improving efficiency available at all, it cannot be wikified using this approach; and and accuracy in multilingual entity extraction. Proc. of the 9th Int. similarly, if the Wikipedia is available in this language but is Conf. on Semantic Systems, 2013. 10 Impact of News Events on the Financial Markets Miha Torkar Dunja Mladenič Jožef Stefan International Postgraduate School Jožef Stefan International Postgraduate School and and Artificial Intelligence Laboratory, Artificial Intelligence Laboratory, Jožef Stefan Institute, Jožef Stefan Institute Jamova 39, 1000 Ljubljana, Jamova 39, 1000 Ljubljana, Slovenia Slovenia miha.torkar@ijs.si dunja.mladenic@ijs.si ABSTRACT ing the sentiment of the news articles and consequently de- In this work we investigate how news events can be used to termining the impact on the market on the basis of whether predict the financial markets. Namely we built a time series it is positive (upward trend predicted) or vice versa (see [3] model that includes features obtained from the news and or [4]). Our approach is similar to the second one, but in- investigated whether the changes in volume of traded shares stead of determining sentiment for each news event, we use can be predicted more accurately with this information. The the effect that the past similar events had on the market as time series model that was built is of an ARMA-GARCH a proxy for the impact of the current event. type, because we wanted to account for any clustering of the volatility that is normal for the financial markets. The We define an event as a collection of news articles from dif- models were evaluated with the Akaike and Bayesian Infor- ferent sources about the same story in the news. From this mation Criterion, while also being compared to the base- collection of articles we will be able to extract the topic, line model that did not include any features from the news. date, location, social score (how trending was this event on Overall our results show that there is an improvement in social media) and all of the entities involved in the event, the model when the information from the news is used and which will add to the complexity of our dataset. One possi- hence show a promising avenue for future research work. ble source of such dataset is a system called Event Registry (see [5]), which automatically extracts events from news arti- 1. INTRODUCTION cles. Using this type of data source is novel for this research area and hence serves as an additional contribution from this The predictability of future movements of financial markets work. is a well researched area in the literature and offers many in- teresting theories. Financial institutions and investors have In order to see the impact that these events had on the been using various sources of information in order to increase financial market, we looked at how the volume of traded the accuracy of their predictions and consequently outper- shares changes on the days of the event. We obtained several form others. There are several approaches to building the features from the news events and checked whether they best predictive model, but in general we can divide them into would allow us to improve the time series model for the two categories: technical analysis and fundamental analysis. volume change. On one side we have models that are based on the historical market data and believe that the past movements will repeat 2. DATA themselves. This is the so called technical analysis approach to modelling markets and believes that an experienced ob- For the historical market data of the company, we collected server can detect the repetition of a pattern from a graph of the following values (prices): Open, High, Low, Close, Vol- market data. The effectiveness of this approach was tested ume. In addition to the market values of a company we also by [9].This approach however does not offer any reasons why used the value of the market volatility index VIX (closing market movements would repeat, and in order to incorpo- price). Secondly with the use of the Event Registry system rate more fundamental believes, a second approach named we obtained all of the news events relating to the company. fundamental analysis was developed. These models use data The general description above was intentional as it can be that is available from multiple sources, which ranges from applied to any publicly listed company, but for our specific company’s balance sheet data, financial market data like example we will use data about investment bank Goldman company’s index, financial data about government activi- Sachs (GS). In Figure 1 we present the dataset we will be ties to data about political or geographical circumstances modelling and predicting, where it can be seen that there are presented in news. large spikes at certain time periods. Moreover the volume change graph demonstrates that we have clusters of periods From the variety of sources for fundamental data, we will fo- with high volatility. cus on how news can be used in modelling financial markets. There have been various studies on this topic, which differ 2.1 Data Description by the extent of the analysis of textual data describing news Our dataset spanned from 2.12.2013 to 30.12.2016, which story. Initially research focused on the impact of frequency offered us 777 trading days on which we were able to col- of news stories on the market movements (see [6] or [7]). lect historical market data. On the other hand the number In these papers authors found some correlation between in- of news events that occurred in that time period was sig- creased number of published news stories and larger market nificantly higher. When we singled out events using 50 as movements. To extend this approach, researchers also anal- the relevance threshold we obtained 4336 events (details of ysed the content of the news stories, which led to determin- how Event Registry calculates the weights of concepts in 11 Volume In order to reduce noise in the dataset of events we se- lected only the most important ones. Namely we set a limit, 1e+07 which determined the lower boundary for the relevance of the events to the given company. Hence if an event was not aluesV 6e+06 relevant enough we discarded it. This naturally raises the 2e+06 issue of selecting the appropriate boundary for the relevance of the events and after testing several values we selected 89 Dec 02 2013 Jun 02 2014 Dec 01 2014 Jun 01 2015 Dec 01 2015 Jun 01 2016 Nov 30 2016 out of 100. After this we were left with 424 events. Volume Change What should be noted here is that many events occurred on 1.0 the days when the markets were closed (weekends, holidays), 0.5 0.0 so they had to be linked with the next possible trading day aluesV that followed. Another issue was also that in many cases −1.0 multiple events occurred on the same day and hence we were not able to isolate the effect of a single event, but rather Dec 02 2013 Jun 02 2014 Dec 01 2014 Jun 01 2015 Dec 01 2015 Jun 01 2016 Nov 30 2016 looked at the average effect of all the events on a specific day. Therefore we are also relying on the fact that there are Figure 1: Volume Dataset no duplicates in the our dataset. 3. METHODS the event are presented in [5]). For testing purposes we split our dataset into two parts, where we allocated 757 observa- 3.1 Predictions from Similar Events tions for training and testing, while we used the remaining In order to build our time series model, which uses data from 21 observations for the out of sample prediction. news events, we examined the effects of past similar events on the market for each new event separately. This was done For each event we also obtained related events to it. These by first extracting the market information on the days of related events are obtained by computing the TF-IDF weights the similar events and combining them with the additional on the concepts present in the event and then by using co- information about the event (social score, relevance to the sine similarity measure other events with similar concept company, correlation to current event, number of articles). weights are found (see [5]). These past similar events formed On average each event had between 5-17 similar events from a crucial dataset, because we could link them to the market our dataset. movements and deduce their impact. Illustration of this processes for an event E ”Goldman Profit 2.2 Data preprocessing Rises 74% as Bond Trading Beats Estimates” with two simi- In order to measure the impact of the event we will be con- lar events (SE) is represented in Figure 2. On the two days of sidering the change in volume that occurs between the clos- the SE1 and SE2 we then collect all of the available stock ing value of today and the day after the event. Specifically market values for the company and index VIX. It should we will be analysing and predicting the value of also be noted that the time difference between event and past similar events is not limited or predetermined. In this Vt+1 − Vt (Volume Change) ≡ V C , (1) case we have events that are years apart. t t = Vt where Vt is the value of volume at time t. In this formulation we used future values, so that we can observe the impact an event has on future volume. It should be noted that this value is calculated in the same manner as that of the returns of shares, which we will also use in our analysis. The formula is identical except that we replace values of volume with those of closing price Pt. Hence we write Pt+1 − Pt (Returns) ≡ r . (2) t t = Pt Changes in the volatility index VIX were calculated with the same formula. Additionally we also added rolling 5 and 10 day moving average of volume change to the feature set, so that we would have another measure of impact an event has on value. Hence the complete list of all of the features that were obtained from the stock market is: • Returns • VIX Close • Volume Change • VIX Change • Open • Rolling mean 5 days Figure 2: Similar Events • High • Rolling mean 10 days Once the dataset of market impacts from past similar events • Low • was obtained an Ordinary Least Squares (OLS) model was Rolling EMA 5 days • fit to the dataset. In the Figure 2 this would be all of Close • Rolling EMA 10 days the market variables available at times ti−k and ti−j. This • Volume 12 dataset then allowed us to use data from the market and The above model however assumes that the variance (σ) current event to form a prediction of what the future mar- is constant. This assumption is dropped due to the clus- ket movement could be. This procedure was done for both tering in the volume change (periods of high changes are volume change and returns. Several choices of external re- followed by lower ones). This type of models are called gressors that were used in OLS were tested, where the main Generalised Autoregressive Conditional Heteroskedasticity issue arose when there were fewer similar events than fea- (GARCH) models ([2]) and have been shown across litera- tures to be fit. Namely we fitted two models, one with all ture to improve the models (see [8]). They allow us to model of the market information available and one where only Re- σ = σt by an additional non linear model. Namely general turns, VIX Change and concept weight were used to predict formulation of the problem is the following: Volume Change. Similarly two models were fit to predict r s Returns, where Volume Change, VIX Change and concept X X σ2t = β0 + βiZ2 t−i + γj σ2t−j (4) weights were used as features in the second one. In addi- i=1 j=1 tion we also calculated the average of volume changes and returns from dates of similar events and used it as a feature. So the current value of the variance also depends on the pre- In order to take into account the correlation to the similar vious values of the variance. From this general formulation events we created an extra feature that represented the value one has to determine the value of (r, s) that is most suitable of average of volume changes and returns multiplied by the for the model. Finally we can add external regressors (fea- normalized correlation (correlation/100). All together the tures) to the equations 3 as an additional sum term in the following features were added to the model (all of which are first line of the equation. predictions from similar events): 3.3 Evaluation Criteria • Number of Events that day In order to asses which model was most suitable for the given dataset we have used a variety of different tests. In order • Returns (all regressors) to be able to assume that the time series is stationary we first ran Augmented Dicky Fuller test and KPSS test. In • Returns (selected regressors) both cases the p-value was below and above the 5% thresh- old respectively. To differentiate between variety of differ- • Volume Change (all regressors) ent model we analysed the performance of each model by the Akaike information criterion (AIC) and Bayesian infor- • Volume Change (selected regressors) mation criterion (BIC), which are the classical information criteria used across literature that also penalize model for • Average Returns high complexity ([1]). Finally in order to determine the sig- nificance of the features we will also build a model without • Average Returns ∗ Correlation the predictions from the similar news events. This will serve as our baseline model and the difference in performance will • Average Volume Change be then the main measure of how significant these features are. • Average Volume Change ∗ Correlation 4. RESULTS With the OLS model trained on the dataset of similar events Our first step in determining which features are significant we were then able to predict the impact of the original event for our analysis was running a t-test selection procedure for on the volume change. It should also be noted that the OLS all regressors. From this analysis it was determined that model was selected in order to avoid over fitting an ARMA only the following features were relevant for our model: type model. The predictions that were obtained in this way were then used in the next step when we were building our • VIX Close price • Rolling EMA 5 days regression model. • Rolling mean 10 days • Rolling EMA 10 days 3.2 ARMA-GARCH MODEL • Rolling mean 5 days • Prediction of Returns The main model in our analysis will be of the ARMA-GARCH type, because this formulation allows us to capture the ef- fects of values from previous periods and account for clus- tering in volatility. Hence our first finding was that the predictions that we ob- tained from the similar events were significant, but instead The ARMA model is in its general form specified as: of using the predicted volume changes, the predicted returns turned out to be more significant for our model. Xt = µt + Zt, p q With this set of regressors the best ARMA model was of X X µt = αiXt−i + ωj Zt−j (3) order (2,2), so two lagged terms in both variables were in- i=0 j=1 cluded. This model preformed according to the evaluation Z criteria shown in the table 1. As mentioned before in order t = σ ⇒ Z ∼ N (0, σ2) to capture clustering in the dataset we modelled the vari- where Xt is the target variable at time t, µt is the equa- ance term even further with a GARCH type model. Again tion for the mean at time t, ∼ N (0, 1) is an iid normally the same evaluation criteria were used and a grid search was distributed noise term and σ is the variance at time t. The preformed to find the best coefficients (r, s) for order of the first step in modelling is to determine the best values for model. This resulted in a GARCH (5,1) model, with the (p, q) according to the evaluation criteria. evaluation criteria presented in the same table 1. 13 Comparison of results yields what improvements have been 5. CONCLUSION made by including the feature. This table also serves as a In this paper we investigated one possible approach of deter- comparison to the baseline model. Since the AIC is merely mining the impact that news have on the financial markets. a heuristic, differences between its values are important. So This is a growing research area since nowadays any model the predicted feature from similar events improves AIC value that wishes to capture market dynamics has to account for by 10.22, while the improvement of our final model is 32.69. the effect of world events. In order to extend previous work in this area, we demonstrated how a more complex text One way of interpreting this difference is also in terms of rel- data can be used to obtained relevant features for modelling ative likelihood, which is defined as exp((AICmin−AICi)/2) changes in financial markets. Our data consisted of news for model i, where AICmin represents the lowest AIC value events that were automatically extracted from the news ar- from all models. Hence the baseline model is 7.97 ∗ 10−8 ticles. For each event we then collected past similar events times as probable as the best model to minimize the informa- and observed how the market reacted to those events. On tion loss, while the ARIMA(2,2) with feature is 1.06 ∗ 10−5 the basis of these reactions we built various features vectors times as probable. However we can see that due to addi- that helped us improve our model. Results show that when tional complexity of our final model, BIC criterion is actu- predicting change in volume, the predicted returns from sim- ally the highest for our chosen model. Hence an optimisation ilar events served as a useful feature. Additionally we tested by the BIC criterion would result in a different model. the performance of our improved times series model on the out of sample dataset for the period of 1 month. Future work will be done in this direction, where we will look for further similarities between news events and possibly obtain new AIC BIC features that could be used for modelling financial markets. Baseline -925.57 -874.66 ARMA(2,2) -935.35 -879.81 6. ACKNOWLEDGEMENTS ARMA(2,2)-GARCH(5,1) -958.26 -869.96 This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sk lodowska-Curie grant agreement No 675044. Table 1: Evaluation criteria train set It should be noted that we have also tested the assumption about the distribution of the standardized residuals. So in- stead of using normal distribution we fitted the model with student t distribution with and without skew. Some minor improvements in AIC value were obtained, but not really 7. REFERENCES significant. Hence we kept the distribution as normal. [1] H. Akaike. Information Theory and an Extension of the Maximum Likelihood Principle, pages 199–213. Springer Final test included out of sample predictions, where our best New York, New York, NY, 1998. chosen model was ARMA(2,2)-GARCH(5,1). The time pe- [2] T. Bollerslev. Generalized autoregressive conditional riod for prediction was 1 month (December 2016) with 21 heteroskedasticity. Journal of Econometrics, 31(3):307 trading days. We also calculated errors when making these – 327, 1986. predictions, where our best model from above scored a value [3] X. Ding, Y. Zhang, T. Liu, and J. Duan. Deep learning of 0.1344944 for Mean Absolute Error and 0.1565652 for for event-driven stock prediction. In Proceedings of the Root Mean Squared Error. Figure 3 shows how the model 24th International Conference on Artificial Intelligence, predicted future values, where the middle line represent pre- IJCAI’15, pages 2327–2333. AAAI Press, 2015. dictions for the mean value. Additionally the plot also in- [4] R. Fehrer and S. Feuerriegel. Improving Decision cluded intervals of upper and lower 95 and 80 quantile range. Analytics with Deep Learning: The Case of Financial Disclosures. ArXiv e-prints, August 2015. [5] G. Leban, B. Fortuna, J. Brank, and M. Grobelnik. Forecast of 20 days Volume Change Event registry: Learning about world events from news. In Proceedings of the 23rd International Conference on 1.0 World Wide Web, WWW ’14 Companion, pages U95% 107–110, New York, NY, USA, 2014. ACM. U80% U50% M [6] M. L. Mitchell and J. H. Mulherin. The impact of 0.5 L50% L80% public information on the stock market. The Journal of L95% Finance, 49(3):923–950, 1994. 0.0 [7] D. Shen, W. Zhang, X. Xiong, X. Li, and Y. Zhang. olume ChangeV Trading and non-trading period internet information flow and intraday return volatility. Physica A: −0.5 Statistical Mechanics and its Applications, 451:519 – 524, 2016. −1.0 [8] R. Tsay. An Introduction to Anaysis of Financial Data with R. John Wiley & Sons, New Jersey, 2013. 740 750 760 770 [9] H. Yu, G. V. Nartea, C. Gan, and L. J. Yao. Predictive Days ability and profitability of simple technical trading rules: Recent evidence from southeast asian stock Figure 3: Out of Sample Forecast markets. International Review of Economics & Finance, 25:356 – 371, 2013. 14 1 Challenges in media monitoring of worldwide news sources to support public health Joao Pita Costa * **, Flavio Fuart *, Marko Grobelnik *, Gregor Leban * and Evgenia Belyaeva *** Abstract—Real-time global media monitoring is nowadays an essential resource to public health. Multilingual capabilities can enrich this potential allowing a worldwide overview based on online news sources, blog posts or social media. In this paper we propose research topics related with the exploration text mining tools used to provide real-time global media monitoring in the context of health. We aim to understand how media can contribute to a better overview of health related and well being matters. With it we shall also identify open research questions that motivate further technological development to better fit the needs, interests and workflow of public health professionals. I. INTRODUCTION Dealing with media (written online media in particular) has several issues, one of which is a lack of common publishing Fig. 1. The contribution of text mining efforts based on online multi-lingual standards. Another issue is related to the global nature of the news to public health raises several research questions. In the image above service: the mere fact of a system being universal, requires shows an ER visualisation module representing a real-time stream of news a system that can manage a variety of languages, possibly that permits us to explore the range of some of those questions. hundreds of them. This creates issues, as todays language tech- nologies can deal only with words and sentences, highlighting According to the ECDC (European Centre for Disease the need to bridge the gap from simple textual representation Prevention and Control), the objective of epidemic intelligence towards semantic representation, where we would want to is to produce timely, validated and actionable intelligence on understand the semantic and conceptual aspect of the textual events related to communicable diseases or of unknown origin content and not just lexical (words and phrases) and syntactical that are of interest for public health and health authorities (sentences). [4]. A similar definition is provided also by the WHO [18]. Considerable research and commercial activity in the past Part of this effort is to gather unofficial, unstructured and period has led to a development of high performance online unverified information about those events, that are then later news monitoring systems and accompanying tools, methods verified and analyzed by Public Health experts. Several ICT and models. For example, a complete cross-lingual news solutions to support these efforts have emerged. Most notable processing pipeline consisting of the following components solutions used by authorities are GPHIN [3] (Global Public has achieved good results in both research and commercial Health Intelligence Network), developed and operated by the usage scenarios: NewsFeed system [16] to monitor, gather an Canadian Government, MedISys [15], developed and operated produce clear-texts of online (HTML) news articles; Sentiment by the Joint Research Centre of the European Commission Detection module [17]; Enrycher module for text annotation and Healthmap.org, a system developed by Boston Children’s [7]; Wikifier for advanced text categorization [2]; Cross-lingual Hospital receiving external funding. document linking [11]; Document clustering [1]; and Event All those systems are multi-lingual to some extent, i.e. Registry [9], an advanced news visualization and analysis they monitor news in more than one language. However, it tool. There is a relatively large offer of similar online news seems they do not leverage the usefulness of cross-lingual monitoring (”clipping”) systems available, each of them with approaches to increase the quality of detected health events. a distinctive set of features. Authors, however, are not aware, Also, they seem no to use Wikipedia, which is nowadays the of a monitoring system that would implement a rich set of biggest freely available knowledge base, to extract meaningful cross-lingual features as the online news processing pipeline information from news articles. mentioned above. In this paper we aim to identify research questions, related to features highlighted above (cross-linguality, wikification, * Quintelligence, Ljubljana, Slovenia ** University of Rijeka, Croatia among others), that can contribute to the appropriate software *** Jozef Stefan Institute, Ljubljana, Slovenia development serving the needs of Public Health professionals. Part of the work presented in it was developed in the context Category: H.4.0 Information Systems Applications: General reporting, statistical analysis, visualisation, networks of the European Union research project MIDAS, under the Keywords: Text mining, public health, news, media. program Horizon 2020. 15 2 II. COLLECTING THE DATA Online news media sources represent a reliable and struc- tured near-real time stream of a heterogeneous, multilingual text documents that describe real-word events. A range of services offers the aggregation of both social media and online news, following the uptake of news publishing through the latter channel. Several media news aggregators provide web crawlers for information extraction and media monitoring aiming for newsworthy stories. In order to extract meaningful information for a particular application domain, automatic event detection mechanisms exist allowing us to measure the media impact of public health awareness rising campaigns, health related news coverage and bias across different outlets. Those also include sentiment detection and cross-lingual link- Fig. 2. The automatic annotation tool Wikifier used to enrich a WHO article ing of documents applied to news sources. In the public health on vaccination taken as an example. It is copied to the Text field and has domain, these approaches have been used to detect disease several underlined words corresponding to the identified Wikipedia concepts. outbreaks and other public health threats (e.g. monitoring of When hovering the emphD icon on disease name, we notice that this article is mostly about Yellow Fever, we see that yellow fever is a disease and we international social and sports events, anti-vaccination cam- see that Kinshasa is a settlement. If we further explore the DBpedia entry the paigns, among others). area, number of inhabitants, country and other information in available. Among the available news aggregator and analysis services that provide some level of access to online news streams we highlight the NewsFeed system (available at newsfeed.ijs.si). (available at wikifier.ijs.si) takes profit of Wikipedia, the It is a real-time aggregated stream of semantically enriched biggest open, online and up-to-date knowledge repository on news articles tracking over RSS-enabled global media sources the internet, to annotate text and link it to relevant knowledge worldwide. In particular, the online media monitoring system resources. Wikifier allows for annotating large quantities of NewsFeed currently monitors around 900.000 RSS news feeds free text in a very short time. The type of analysis provided ( 800.000 web sites) and collects between 350.000 and 600.000 permits us to identify trends (e.g. health related lifestyles) or articles per day, assuming an article archive available since ask questions like: Provide me all texts (articles) that are about May 2008. Using the current API it is possible to get anno- vaccination campaigns in Africa in cities with population of tated articles since June 2013. Currently, about 50% of all more than 2 million people. Figure 2 shows the output of articles are in English. All languages present in Wikipedia the automatic annotation of the abstract of a recent article are included, but are covered in respect of quality, volume on Cholera. The top ten concepts annotated include ”Vibrio and extent of analysis performed [8] to differing degrees. cholerae”, ”Fresh water” or ”Bacteria”. For research purposes, authors where granted access to the By being constructed over the knowledge-base of Wikipedia, NewsFeed system, available at newsfeed.ijs.si. The data is it is essential that Wikipedia coverage on health topics is accessed through a HTTP API and the result is provided in of high quality which is itself an open question. It is also XML format. Additional metadata can be obtained through the another open question of whether an extension of Wikipedia API: named entities, concepts, categories, mentioned places can capture all of the concepts covered by MeSH. and similar [14]. Furthermore, this technology could be used IV. F for additional data annotation and analysis tailored to public ROM GLOBAL TO LOCAL MEDIA MONITORING health applications (e.g. the annotation with wikifier as later In the context of Public Health, a graphical dashboard discussed in Section III). can provide contribution to the real-time monitoring of the The news monitoring software EventRegistry (ER) (avail- global health by allowing to a continuous observation of able at www.eventregistry.org) feeds on Newsfeed, tracking medical and well-being issues, limiting to the presentation over 100,000 global media sources in near-real-time, operating of articles/items related to those topics (a possible health across 100 languages, and aggregating global media content in dashboard is available at [6]). Such a dashboard presents the a semantically meaningful way. It collects over 300,000 news incoming multi-lingual health related media content published articles on average per day and arranges them into events, somewhere in the world. That allows the health professional to events are further connected into story-lines which enable have an overview of what is happening globally at any given tracking of evolving topics [9]. Since ER covers most of the moment in time (cf. [5]). This is a convenient way to observe global media reporting on any topic, it can be used also to global health monitoring by using the ER system introduced track topics like health and well-being on different levels of above in Section II. Its key feature is to be able to observe resolution from small local issues, up to the higher level health issues across many languages and in temporal detail, country issues and global trends. over a variety of scales, which is what most other systems have difficulties [10]. The image in Figure 1 shows a snapshot III. AUTOMATIC ANNOTATION of the dashboard with health related events on July 20th, 2017. The complexity of identification of entities makes the au- The event of the Zika outbreak is identified immediately tomatic multilingual text analysis a difficult task. Wikifier after the collection of news articles that report about it. With 16 3 Fig. 5. A temporal intensity visualisation in ER for the query ’Ebola virus disease’ showing that the system could notice the outbreak of the epidemic before it was made public. Fig. 3. ER screenshot showing the event of “lead poisoning” in Nigeria on May 12, 2015. It includes date, location, categories, number of languages covered and social media (Twitter) count. Fig. 6. ER screenshot showing a geographical spread visualisation module for the query ’Ebola virus disease’. Each mark shows the number of related news per location in the map. Fig. 4. ER screen-shot showing story-line developed after the event of “lead V. HISTORICAL PERSPECTIVE poisoning” in Nigeria on May 12, 2015. It describes below the event a timeline of related news. Another view relevant to any health related issue is a histor- ical perspective based on an aggregation of a particular topic. it, the health professional can explore the evolution of the news The evolution of Ebola virus and reporting after 2014 until publishers awareness of the epidemics in time by looking at 2017 can be an example to be explored through ER. Querying the related news articles represented in a world map, as they ER for Ebola virus disease provides over 20,000 events related were identified or updated during a selected period of time. ER to Ebola appearing after 2014. The content (over 200,000 news can find articles and events related to a particular entity, topic, articles) could be analysed through several visual modules: date, location or category, as well as measure their impact temporal intensity, geographical spread, topical spread, among on social media (in particular, Twitter). Moreover, its cross- others. The Figures 5 and 6 illustrate temporal (with the peak lingual capabilities allow to consider events where the news in October 2014) and geographical spread (West Africa and the appears only in e.g. Chinese or some unknown language where US) of Ebola related events. We highlight that ER was able to the news never came into English speaking space. This is notice the outbreak of the epidemics before it was made public. of particular interest when considering the monitoring of rare Though, the question was not asked then, i.e., the query wasn’t diseases worldwide. done because it was not yet an identified issue and there was Another perspective is the micro view of a particular health a lack of continuous attention. Though a complete attention of related event happening somewhere in the world. The event of all such epidemic related topics could be a heavy burden to “lead poisoning” in Nigeria on May 12th, 2015, which went the system. This rises the research problem of predicting and mostly unnoticed at a global level, is an example that can alerting of high density events like this. be explored through ER. With a simple query to the system, In ER the Zika outbreak event is identified immediately after the health professional can extract all the reports related to the collection of the news articles that report about it. One that event, and check the event in the context of other related can explore the evolution of the news publishers awareness of events. To illustrate this, the screen-shots in Figures 3 and 4 the epidemics in time by looking at the related news articles show the event itself and the story-line developed after the represented in a world map, as they were identified or updated event, respectively. during a selected period of time. 17 4 ER can find articles and events related to a particular entity, topic, date, location or category, as well as measure their impact on social media (specifically, Twitter). This social media monitoring is complemented by TwitterObservatory (cf. [12]), leveraging in-house technology that uses data observa- tion, enrichment and storage techniques for social media data presentation, search and analytics. Moreover, there have been several successful tests done to extract sentiment from news based on the sentiment of tweets associated with news [13]. The sentiment directly from news is still an open problem that shall be tackled. VI. CONCLUSIONS AND FURTHER WORK Fig. 7. Kibana screenshot showing the different visualisation modules based In this paper we discussed the potential of several text on queries to elasticSearch composing an interactive dashboard. This is a mining tools dedicated to explore worlwide multilingual news, puppet example from venzia.es to show the practical potential of this tool. focusing matters of interest to Public Health. Further research (exploring the potential of Newsfeed, Wikifier and ER) in- [4] ECDC. Epidemic intelligence. ecdc.europa.eu/en/ cludes: (i) the correlation of high level concepts with low threats-and-outbreaks/epidemic-intelligence. Accessed: 2017-09- level features; (ii) the showcase of hierarchies (e.g. in some 05. disease) and how they can be drilled down to variety of [5] M. Grobelnik. Observing global health and well-being. http://www. sub topics (e.g. different aspects of such a disease; (iii) the midasproject.eu/2017/07/24/observing-global-health-and-well-being/ analysis of the impact of health related issues on society .Accessed6/8/2017, 2017. MIDAS Project Blog, 2017. (e.g. Ebola news impact in adherence to insurance); (iv) the [6] M. Grobelnik and G. Leban. Eventregistry’s health pannel. https:// tinyurl.com/wits2017qnt, 2017. Accessed: 2017-07-25. presence of PubMed/Medline in global news; and (v) the [7] M. Grobelnik and D. Mladenić. Simple classification into large topic prediction of consequences of a health event. Other research ontology of web documents. CIT, 13.4:279–285, 2005. directions consider the dynamics of Public Health/Healthcare [8] G. Leban, B. Fortuna, J. Brank, and M. Grobelnik. Cross-lingual institutions where activities happen affecting: (1) the decision detection of world events from news articles. Proceedings of the 2014 makers that make choices based on the legislation and on International Conference on Posters and Demonstrations Track, CEUR- the available in-house monitoring systems operated by their WS. org, 1272:21–24, 2014. own data scientists, (2) those data scientists that explore [9] G. Leban, B. Fortuna, J. Brank, and M. Grobelnik. Event registry: the data and extract relevant information that contributes to learning about world events from news. Proceedings of the 23rd International Conference on World Wide Web, ACM, 2014. evaluation of Public Health scenarios, and (3) the technical team that deploys and maintains the data infrastructure where [10] J. P. Linge, J. Belyaeva, R. Steinberger, M. Gemo, F. Fuart, D. Al- Khudhairy, S. Bucci, R. Yangarber, and E. van der Goot. Medisys: data scientists are active. It could be very useful to have easy medical information system. In Advanced ICTs for disaster management to handle data visualisation modules (like the ones offered by and threat detection: collaborative and distributed frameworks, pages Kibana and shown in Figure 7) allowing decision-makers to 131–142, 2010. choose with a few clicks the representation of data that makes [11] A. Muhič, J. Rupnik, and P. Škraba. Cross-lingual document similarity. sense to the problems they focus on. The data scientists could Proceedings of the ITI 2012 34th International Conference on Informa- tion Technology Interfaces (ITI), IEEE, pages 387–392, 2012. then manipulate the current workflows, maximising critical outputs, presenting data in a meaningful way whilst minimising [12] I. Novalija, M. Papler, and D. Mladenić. Towards social media mining: Twitterobservatory. Proceedings of the Slovenian Data Mining and Data resource required to drive data interrogation and presentation. Warehouses 2014, 2014. [13] L. Rei, M. Grobelnik, and D. Mladenić. Event detection in twitter with ACKNOWLEDGMENT an event knowledge base. Proceedings of SiKDD 2015, 2015. The authors would like to thank the EC Horizon 2020 [14] J. Rupnik, A. Muhic, G. Leban, P. Škraba, B. Fortuna, and M. Gro- MIDAS Project and funded by the European Union under grant belnik. News across languages-cross-lingual document similarity and event tracking. Journal of Artificial Intelligence Research, 55:283–316, agreement no. 727721, under the call SC1-PM-18-2016 - Big 2016. Data supporting Public Health policies. [15] R. Steinberger, F. Fuart, B. Pouliquen, and E. van der Goot. Medisys: A multilingual media monitoring tool for medical intelligence and early REFERENCES warning. Proceedings of the International Disaster and Risk Conference IDRC Davos 2008 - Short and Extended Abstracts p. 612-614, 2008. [1] J. Brank, G. Leban, and M. Grobelnik. A high-performance multi- [16] M. Trampus and B. Novak. The internals of an aggregated web news threaded approach for clustering a stream of documents. In Proceedings feed. Proceedings of 15th Multiconference on Information Society IS- of the 17th International Multiconference Information Society, 2014. 2012, 2012. [2] J. Brank, G. Leban, and M. Grobelnik. Annotating documents with [17] T. Štajner, I. Novalija, and D. Mladenić. Informal multilingual multi- relevant wikipedia concepts. Proceedings of SiKDD 2017, forthcoming, domain sentiment analysis. Informatica, 37.4:373–380, 2013. 2017. [18] WHO. Epidemic intelligence. www.who.int/csr/alertresponse/ [3] M. A. Dion M, AbdelMalik P. Big data and the global public health epidemicintelligence/en/. Accessed: 2017-09-05. intelligence network (gphin). CCDR: Volume 41-9, September 3, 2015: Big Data, 2015. 18 Ontology-based translation memory maintenance Andraž Repar Senja Pollak Iolar d.o.o. Jozef Stefan Institute Parmova 51 and Jamova 39 Jozef Stefan International Postgraduate School 1000 Ljubljana, Slovenia Jamova 39 senja.pollak@ijs.si 1000 Ljubljana, Slovenia repar.andraz@gmail.com ABSTRACT translation memory for “exact” and “fuzzy” matches and the In this paper, we explore the use of text mining techniques results are used to calculate the final price of the transla- for translation memory maintenance. Language service provi- tion. This technology really took off in the 1990s and today ders often have large databases of translations, called trans- virtually every language service provider on the market uses lation memories, which have been in use for a long time - some kind of a translation memory to store translations. leading to a slow population of the translation memory with other domains (i.e. adding financial content to a medical In theory, the translation memory concept involves the use translation memory). To our best knowledge, no tools exist of metadata to clearly mark the segments belonging to dif- that would effectively separate the content of a translation ferent domains and/or customers. However, metadata are memory according to different domains. Having the ability often not added due to time pressure or other issues and the to extract individual domains from low-quality translation information about the domain or customer is lost. With- memories could mean a significant benefit to language ser- out this information it is difficult to reuse the translation vice providers looking to utilize modern translation meth- memories for machine translation and/or terminology man- ods, such as machine translation and automated terminol- agement. Finally, the quality of a translation memory can ogy management. In the first stage, we used OntoGen, a also degrade over the years because segments may get ac- semi-automatic ontology building tool which uses text min- cidentally stored in the wrong translation memory (the do- ing techniques, to separate the segments in the translation main of the segments is not the same as the domain of the memory according to domains. In the second stage, we translation memory). wanted to test whether we could use the domains defined in the previous stage to build classification models - effec- In this paper, we analyze one such translation memory used tively using them as class labels in place of the costly and by the translation company Iolar to see whether we could use time-consuming manual annotation of segments. text mining techniques to extract domains and clean low- quality translation memories. We used OntoGen [4] topic Keywords ontology editor to separate the dataset into distinct domains and then used these domains for text classification in Weka translation memory, language service provider, ontology, On- [5]. toGen, text classification Ontology learning is a well-researched area with researchers 1. INTRODUCTION using various techniques, such as natural language process- In the translation industry, language service providers (LSP) ing([10]), machine learning ([13]) and information retrieval often offer a guarantee to their customers that they will ([3]). The same can be said of using machine learning for never have to pay twice for the translation of the same text. text classification (for example, [11], [9] and [7]). On the In order to do so, they have come up with a way of saving other hand, research into using data mining techniques for and re-using past translations to reduce costs and offer dis- translation memory maintainance is scarce with most au- counts. Starting in the 1970s and 1980s, translation compa- thors focusing on spotting low-quality individual segments. nies began using translation memories which are essentially Barbu [1] uses several machine learning algorithms to spot databases of bilingual segment pairs (source text – target false segment pairs in translation memories, Sabet et al. [6] text) along with some metadata. Whenever a new docu- describes a system for unsupervised cleaning of translation ment is received for translation, it is leveraged against the memories without labeled training data based on a config- urable and extensible set of filters, and Nahata et al. [8] defines a set of rules for a rule-based classifier which is in turn used to find low-quality segment pairs. A more recent topic that serves a similar purpose is quality estimation of machine translated segments. For example, Specia et al. [12] describe a system that tries to predict the quality of machine translated segments using machine learning. 19 2. DATA DESCRIPTION The translation memory analyzed for this article has been in use for almost 15 years and contains parallel translation segments in English and Slovene. Initially, it was meant to store Marketing, Legal and Financial translations, but over the years various other domains have been stored in this translation memory. In addition to the three domains men- tioned above, this translation memory also contains a large chunk of IT-related segments, such as user interface strings, user assistance texts and technical documentation of vari- ous IT devices (printers, scanners, monitors etc.). Given the content of these documents, we expect to see some over- lap between domains – for example, a printer user manual will typically contain some legal information as well as some marketing-like language. 3. EXPERIMENTS The most obvious way to go about this task would be to manually annotate a dataset from this translation memory and then use it to train a classifier. However, manual an- notation is time consuming and costly, so we first utilize OntoGen [4], a semi-automatic and data-driven ontology ed- Figure 1: Ontology visualization: 4 main domains itor focusing on editing of topic ontologies, and then use the are extracted (Financial, Marketing, IT, Legal) with resulting ontology topics for building a text classifier that two of them having additional subdomains could be used for other translation memories and documents. 3.1 Preprocessing the data into Weka machine learning toolkit. We tested var- ious machine learning classification algorithms (Na¨ıve Bayes The first step involved extracting the segment pairs and fil- Multinomial, SVM, J48) to find which one gives the best re- tering them. The Slovene segment parts were discarded be- sults. 10-fold cross-validation was used for all experiments. cause only one language is needed for this task. English was We applied Weka’s StringToWordVector filter and used a chosen because it is the source language in this translation stoplist (300 most frequent words from the BNC [2] corpus) memory. The TMX file contained 247,103 English-Slovene to filter out the most common words. segment pairs. To cut down on the noise and remove the segments most difficult to classify, we decided to remove all In the first phase, we have both topics and subtopics – where segments with less than 8 words leaving us with 121,593 a subtopic existed, we glued the topic and subtopic together segments. to get a distinct class. This means we had 7 distinct classes: ITUserInterface, ITITGeneral, Financial, Marketing, Legal- 3.2 Ontology creation Contracts, LegalTenders, LegalITLegal. The selected segments were saved in a Named Line-Documents format suitable for OntoGen. Given the size of the file, the In the second phase, we used only the main topics – meaning processing in OntoGen was slow-going. We tried various ap- that 4 classes were used: IT, Financial, Marketing, Legal. proaches in OntoGen and finally settled on using k-means clustering (with k=10) functionality to generate various sets Because the original dataset was fairly large (more than of segments corresponding to different keywords and then 100.000 segments), we had to significantly reduce it in order manually group them into meaningful domains based on our to be able to complete the calculations in Weka in reasonable translation experience with this translation memory. time. However, we couldn’t just take the first n segments, because the different topics were not uniformly distributed After experimenting with various ontology building tech- across the dataset. Therefore, we took every 10th segment, niques in OntoGen, the following topic ontology was con- leaving us with a dataset of about 10,000 segments. structed (followed by the number of documents in parenthe- ses): IT (51,247) (subdivided into ITGeneral and User Inter- Tables 1 and 2 contain information about the performance of face), Marketing (11,567), Financial (12,987), Legal (42,163) the three classifiers mentioned in section 3.3. For a detailed (subdivided into Contracts, Tenders and IT Legal1). analysis see section 4.2. A graphical representation is shown in Figure 1. 4. EVALUATION AND INTERPRETATION 3.3 Classification OF RESULTS When one evaluates the results of the hierarchical clustering In the final step we exported the domains from OntoGen, by OntoGen and classification, one should bear in mind that attached them to their corresponding segments and loaded in many cases no clear boundaries between domains exist. 1This group contains segments from privacy policies and This was to be expected on the one hand due to the short license agreements of various software applications length of the documents, and on the other due to the seg- 20 the precision is high enough to use the topics extracted in Table 1: Classifier performance with 7 labels (accu- OntoGen as class labels for building a classifier. racy of the ZeroR classifier for the majority class = 0.349) 4.2 Classification J48 SMO NB Multinomial The results of the classification with 7 labels are not promis- Accuracy 0.511 0.495 0.583 ing. The performance of all classifiers does exceed the ma- Precision 0.507 0.483 0.581 jority class classifier significantly, but the accuracy is not Recall 0.511 0.495 0.583 high enough for production use (close to 60% for the best F-measure 0.472 0.483 0.580 performing classifier). Looking at the confusion matrix in Figure 2, we can observe that the ITGeneral topic overlaps with quite a few other topics and is the largest culprit for Table 2: Classifier performance with 4 labels (accu- the low performance. A significant part of the false pos- racy of the ZeroR classifier for the majority class = itives originate in the ITGeneral topic for all topics apart 0.406) from Financial and LegalContracts (class c and e in Figure J48 SMO NB Multinomial 2). These two topics also have the highest precision (0.688 Accuracy 0.597 0.619 0.671 and 0.668, respectively). Precision 0.615 0.608 0.678 Recall 0.597 0.619 0.671 F-measure 0.576 0.610 0.673 ments that are very difficult to assign to a single domain, for example: • The system must support operation of the HSM system and the archiving of files even if the file system operates in the Windows cluster. Figure 2: Confusion matrix of the Na¨ıve Bayes Multinomial classifier - 7 labels • The latest Windows operating systems have a firewall built in. When we focus only on the main 4 labels, the results are better. Na¨ıve Bayes Multinomial is again the best perform- ing classifier with its accuracy reaching a little over 67%. The first sentence comes from a tender document, while the Looking at the confusion matrix in Figure 3, it is evident second one comes from an IT user manual. Even for a human that the first 3 labels perform significantly better than the annotator, this would be a difficult task and we would most last one. Indeed, the precision of Legal, IT and Financial is likely see low levels of inter-annotator agreement. around 0.7, while that of Marketing is just a little over 0.4 (for detailed results see Table 4). 4.1 Ontology creation To evaluate the results of ontology creation in OntoGen, we extracted 50 random segments for all 7 topics/subtopics, manually annotated them and compared the results. Overall, a precision of 0.81 is quite good considering that we are working with sentences which are difficult to classify. It is also important to not lose sight of the fact that there can be some overlap between the topics and that certain sentences cannot be adequately classified into any of the available topics. The overlap between the various topics Figure 3: Confusion matrix of the Na¨ıve Bayes causes a certain degree of ambiguity, but we believe that Multinomial classifier - 4 labels Precision Recall F-measure Table 3: Manual evaluation of the ontology results Legal 0.713 0.644 0.677 on 50 segments per domain IT 0.711 0.746 0.728 Topic Precision Financial 0.685 0.646 0.665 Legal 0.418 0.508 0.459 Financial 0.76 ITGeneral 0.80 ITUserInterface 0.86 Table 4: Detailed performance of the Na¨ıve Bayes LegalContracts 0.80 Multinomial classifier LegalITLegal 0.86 The largest issue that we have not been able to overcome LegalTenders 0.78 in this analysis is that a huge chunk of the segments in this Marketing 0.80 dataset are IT related – this is especially true of the Market- Average 0.81 ing and certain Legal segments (e.g. terms of use, privacy 21 statements or press releases or advertising material for IT would be to use just those topics that are clearly separated devices) which means that it is often difficult to differenti- from the other parts of the dataset. ate between a Legal/Marketing segment and a regular IT one. This issue is very clearly seen in the confusion matrix In terms of future work, we will explore text classification in Figure 3. In contrast, the Financial segments have no on manually annotated high quality translation memories. immediate relation to any IT content making them a much Finally, an interesting route would be to utilize domain ter- more distinct category. minology to enhance highly domain-specific terms assigning higher weight to terminological features. 5. CONCLUSION AND FUTURE WORK This paper tries to determine whether text mining tech- 6. REFERENCES niques can be used to facilitate translation memory main- [1] E. Barbu. Spotting false translation segments in tenance in a language service provider environment. Given translation memories, 2015. the fast-paced nature of work in the translation industry, [2] J. H. Clear. The digital word. chapter The British it is only natural that the quality of translation memories National Corpus, pages 163–187. MIT Press, reduce over time. Even if they are perfectly designed, noise Cambridge, MA, USA, 1993. will inevitably be introduced leading the reduced usefulness [3] H. Cunningham. Information extraction, automatic. for other language applications. Encyclopedia of Language and Linguistics, 2nd Edition, 5:665–677, 2006. At the outset, we had two questions: 1) whether OntoGen [4] B. Fortuna, M. Grobelnik, and D. Mladenic. OntoGen: can be used to divide the content of a particular low-quality Semi-automatic Ontology Editor, pages 309–318. translation memory, and 2) whether the resulting topics can Springer Berlin Heidelberg, Berlin, Heidelberg, 2007. be used as labels to build a classifier for other translation [5] M. Hall, E. Frank, G. Holmes, B. Pfahringer, memories and documents. The main reason was to find a P. Reutemann, and I. H. Witten. The weka data shortcut for manual annotation which is costly and time- mining software: An update. SIGKDD Explor. Newsl., consuming. 11(1):10–18, Nov. 2009. [6] M. Jalili Sabet, M. Negri, M. Turchi, J. G. C. de We successfully managed to build an ontology, but the bound- Souza, and M. Federico. Tmop: a tool for aries between some topics were relatively vague. One reason unsupervised translation memory cleaning. In for this is that we had to deal with sentences – as opposed Proceedings of ACL-2016 System Demonstrations, to larger chunks of text – which are difficult to classify. The pages 49–54. Association for Computational second issue was the fact that many of these topics were in Linguistics, 2016. fact inter-related and some of the segments could have eas- [7] T. Joachims. Text categorization with Support Vector ily been classified in more than one domain. In particular, Machines: Learning with many relevant features, the Legal, IT and Marketing domains are closely related, pages 137–142. Springer Berlin Heidelberg, Berlin, because a lot of Legal and Marketing segments originated Heidelberg, 1998. in IT-related translation jobs. One could argue that the IT [8] N. Nahata, T. Nayak, S. Pal, and S. Naskar. Rule and Marketing domains could be combined into one cate- based classifier for translation memory cleaning. 05 gory, since there is so much overlap, however from a strictly 2016. translator’s point of view it makes sense to have separate [9] K. Nigam, A. K. Mccallum, S. Thrun, and categories, because different translation strategies are nor- T. Mitchell. Text classification from labeled and mally used for marketing (i.e. press releases) and general IT unlabeled documents using em. Machine Learning, (i.e. user manuals, help articles) translation jobs. 39(2):103–134, May 2000. The results of the ontology creation were promising with [10] S. P. Ponzetto and M. Strube. Deriving a large scale manual evaluation (see Table 3) showing that around 4 in 5 taxonomy from wikipedia. In Proceedings of the 22Nd strings were assigned a correct label. However, the picture National Conference on Artificial Intelligence - was much less clear when it came to building a classifier. Volume 2, AAAI’07, pages 1440–1445. AAAI Press, It turned out that the full ontology was too complex for 2007. the classification algorithms used in this paper (see Section [11] F. Sebastiani. Machine learning in automated text 4.2). When we used only the four main topics as labels, the categorization. ACM Comput. Surv., 34(1):1–47, Mar. results started approaching acceptable with an accuracy of 2002. 67% (compared to 0.406 as majority class). We would still [12] L. Specia, G. Paetzold, and C. Scarton. Multi-level ideally like to see the accuracy breaking the 75% or 80% translation quality prediction with quest++. In barrier. ACL-IJCNLP 2015 System Demonstrations, pages 115–120, Beijing, China, 2015. In the current state, the classifier is not accurate enough [13] F. M. Suchanek, G. Ifrim, and G. Weikum. Combining to be used in production. However, when there are rea- linguistic and statistical analysis to extract relations sonably clear boundaries between topics in OntoGen, the from web documents. In Proceedings of the 12th ACM resulting labels can be successfully used – as evident by the SIGKDD International Conference on Knowledge performance of the Financial label. This is in itself a useful Discovery and Data Mining, KDD ’06, pages 712–717, achievement, because there is currently no way to export New York, NY, USA, 2006. ACM. just the finance-related segments from the translation mem- ory. An obvious route to better classification performance 22 Audience Segmentation Based on Topic Profiles Matic Kladnik, Luka Stopar, Blaz Fortuna, Dunja Mladenić Jožef Stefan Institute and Jožef Stefan International Postgraduate School Jamova 39, 1000 Ljubljana, Slovenia e-mail: matic.kladnik@ijs.si ABSTRACT obtained by third-party cookies that cover a whole range of user properties including demographics, interest, geography. Audience segmentation is often applied on Web portals to The problem that we are addressing is automatic audience gain insights into the audience, support targeted marketing segmentation where potentially vendor data on the users is and in general provide on-line recommendations to the users. available in addition to the usual Web log files and content We propose an approach to audience segmentation that is of the visited pages. In addition we propose to use based on using machine learning on topic profiles of the background knowledge in the form of pre-trained machine visited content. Our preliminary experiments on a small learning model that classifies Web pages into a predefined sample of log-data show that the proposed approach is custom taxonomy. In this way, each Web page is based on promising and the proposed combination of features its textual content assigned a ranked list of content topics of capturing short term and long-term user interest gives better different granularity. For instance, the assigned topics may results than using only the short-term interests of the user. be Business, Business/Financial_Services/Medical_Billing, Business/News_and_Media, Society/Issues/Gun_Control. 1. INTRODUCTION Large number of retuning users regularly visiting the same The dataset that we have used to test the proposed approach Web portals offer an opportunity to apply audience modeling was obtained from an international media company. Almost considering descriptions of the visited content, different 3 000 Web pages were crawled from the company Web site. characteristics of the user and the user behavior. Instead of The anonymized user data was obtained for more than half a the usual audience segmentation based on user profiles that million users that have visited the Web site within one is also commonly adopted in recommendation systems [1], selected day. All the considered text is in English language. we propose an audience segmentation approach that is based on the topic profiles of the visited content. As the users We have pre-processed the data to remove references to Web potentially have interests in different topics we allow the pages that have limited textual content or un-standard same user to occur in several segments. formatting. The Web pages were processed in a standard way to extract the textual content, remove the standard English In this paper, we described the proposed approach and on a stop-words and represent each Web page as a bag-of-words small sample of real-world log-data test the hypothesis that (BoW) with TFIDF weight. In addition to the content, each the proposed combination of features capturing the content page has metadata including a set of manually assigned of the recently visited pages and the properties of all the content labels done by the editorial team. For instance, pages visited by the user improves the quality of the brexit, Europe, money, davos, jobs, London, markets. These segmentation. content labels were historically used to annotate the users visiting the pages. Each user is thus described by a set of The rest of this paper is structured as follows. Section 2 properties including demographics and the content labels of describes the problem setting and the dataset used in the the pages visited over a longer period of time. experiments. The proposed approach is described in Section 3 together with the description background knowledge that 3. APPROACH DESCRIPTION we have used for mapping form the space of users into the Audience segmentation is commonly based on grouping the space of topic profiles. Section 4 gives the results of the users by their common interests and some other e.g., preliminary experiments, while the conclusions are demographic properties and behavioral similarity. However, presented in Section 5. the same user may have several interests and exhibit different behavior depending on the current focus. This may result is 2. PROBLEM SETTING AND DATA grouping together the users that do not have much in High-quality Web portals that offer regularly updated common except that they share some (but not the same) content, such as market data, news articles or financial data, interests with the third user. attract many loyal users [2]. Today, vendors offer user data 23 Thus we propose an approach to audience segmentation into a predefined custom taxonomy, we assign a based on the similarity of the topics that the users are ranked list of topics to each URL. interested in. The idea is to view the problem through topics 3. For each URL we select one or a few topics with the of the visited Web pages and based on that obtain segments highest rank and form a collection of URL – Topic of the users. pairs. 4. Representation of the topics is based on a list of Architecture of the proposed approach is shown in Figure 1. URLs that were assigned the topic and the list of The whole pipeline consists of several steps: UserIds of the users that visited the URLs. 1. From the log file of the user visits we obtain a list 5. Topic profiles are processed by a clustering of visited pages (URLs). algorithm to obtain segments of the users. 2. By using background knowledge in the form of machine learning model for classifying documents Figure 1. Architecture of the proposed approach to audience segmentation. The user visits of the Website and combined with the visited content and properties of the users to obtain segments of the users based on the topic clusters. As topic profiles are built around the URLs that were subset was defined by the domain expert from the company assigned the topics using background knowledge, we of the Web portal and consists of several hundred of topics. separately keep a list of UserIds for each URL (obtained from the log file). When constructing features for each topic, We use DMOZ classifier with custom taxonomy to classify we combine different data source: each Web page into a hierarchical content topic. Pages can • content of the Web pages visited by the users, be classified into topics on different levels of hierarchy, • properties of the visited pages, where lower levels give more specific classification. Upper • properties of the users, and levels also give context to the lower levels of classification. • information about the visits. If we compare the following two topics that mention aerospace in their hierarchy: Background knowledge that we have used in the experiments • Science/Technology/Aerospace, is based on a subset of a large custom taxonomy [2]. The • Business/Aerospace_and_Defense/Aeronautical, 24 we can see that the first content topic is put into the context BoW - Words from the Web page of Science and Technology, whereas the second is put into Web pages 59929 the context of Business and Aerospace and Defense. This User Content Labels of the approach gives us more information about the content topic. interest visited Web pages 1268 In our experiments, content of the Web pages is obtained by To compare the influence of different feature sources on the crawling the Web portal. User data including properties of obtained segmentation we use a cluster dispersity measure. the visited pages is obtained from the Vendor data. Specifically, we measure the weighted average distance between the examples and their centroid normalized by the 4. EVALUATION average distance to the global mean (i.e. center of the data). The formula is given below: In the experimental evaluation we combine two sources of data: URLs from the log file and the user data from the user’s ∑𝑘𝑘 𝑛𝑛𝑖𝑖 𝑛𝑛𝑖𝑖 𝑘𝑘 𝑛𝑛 𝑑𝑑(𝜇𝜇 𝑖𝑖 𝑗𝑗=1 𝑖𝑖,𝑥𝑥𝑗𝑗) ∑𝑖𝑖=1 ∑ 𝑑𝑑(𝜇𝜇 history. The two features sets used for data representation 𝐷𝐷 = 𝑖𝑖=1 ∑ 1 𝑛𝑛 𝑛𝑛𝑖𝑖 = 𝑗𝑗=1 𝑖𝑖, 𝑥𝑥𝑗𝑗) 1 𝑛𝑛 𝑛𝑛 correspond to these two data sources: bag-of-words from the ∑𝑗𝑗=1 𝑑𝑑(𝜇𝜇, 𝑥𝑥 𝑛𝑛 ∑ 𝑑𝑑(𝜇𝜇,𝑥𝑥 𝑗𝑗=1 𝑗𝑗) 𝑗𝑗) Web page corresponding to the URL; the user interest in the form of a collection of content labels of the Web pages D represents the dispersity, d is a distance measure (in our visited by the user over a longer period of time (see Table 1). case cosine distance), 𝑛𝑛𝑖𝑖 and 𝜇𝜇𝑖𝑖 are the size and centroid of the i-th cluster, 𝑥𝑥𝑗𝑗 is the j-the example and 𝜇𝜇 is the global mean. Intuitively, examples in more compact clusters will lie Table 1. Features used for audience segmentation. closer to the centroid and so will contribute less to the No. of dispersity than more disperse clusters. Source Description values Relative Mean Centroid Distances 1 0.8 0.6 0.4 0.2 0 20 50 100 200 User interests Page cotent User interest + Page content Figure 2 Experimental results comparing relative mean centroid distances for the three data representations over different number of clusters ( k ranging from 20 to 200). The experiments on clustering topic profiles were performed which represents a long-term interest of the user on an on three different data representations. One that considers aggregated level. This can be particularly attributed to the only content of the visited pages, one that considers only user fact that the number of different content labels is much interest and a combination of the two feature sets. We have smaller than the number of different words from the page applied k-means clustering, varying the value of parameter k content. Combining user interest (capturing history of the (the number of clusters) form 20 to 200. As the results of the user) and page content (of pages visited in the considered log applied clustering method depends on the random seed for file) gives better results than using only page content. choosing the initial clusters, we have repeating the process five time for each value of k. Figure 2 shows the results of Looking at the topic clusters that we have obtained, we can the experiments averaged over five runs. see that the similar topics are clustered together (see Table 2). For instance, the topics such as Health/Addiction, We can see that the smallest distance of topic clustering is Business/Chemicals/Wholesale_and_Distribution, obtained when the data is represented only by user interests, Recreation/Drugs are in the same cluster. 25 Table 2. Illustrative example of some clusters obtained when generating 50 clusters using both feature sets. For a few selected clusters we show the topics that belong to the cluster. Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Business/Biotechnology_and_Pha Recreation/Models Home/Personal_Fin Health/Addictions Business/Financial_Service rmaceuticals ance s/Venture_Capital/Regional Science/Astronomy Recreation/Drugs Business/Transportation_an Health/Child_Health d_Logistics/Bus Business/Chemicals/W Business/Transportation_an Health/Conditions_and_Diseases/ holesale_and_Distribut d_Logistics/Rail Cancer ion Business/Food_and_R Government/Agencies Health/Conditions_and_Diseases/ elated_Products/Bever Immune_Disorders ages ealth/Conditions_and_Diseases/In Science/Biology/Bioin Recreation/Autos/Makes_a fectious_Diseases formatics nd_Models/Honda Society/Issues/Gun_C Science/Environment Health/Pharmacy ontrol Science/Environment/Carb on_Cycle, … To obtain audience segments from the clustering of the topic are assigned to the same segment, due to similarity with the profiles, we map the topic clusters onto a set of UserIds based other users form the same segment. on the user visits of the URLs that are classified to each of the topic in the cluster. In this way we obtain non mutually Larger scale experiments are needed in the future work to exclusive audience segments. The average number of users confirm the results and provide additional insights into the per segment is given in Table 3. From the table we can see other properties of the users form the same segment, such as demographics, geography, job. Table 3. Average size of the audience segments in relation to the granularity of the segmentation. 6. ACKNOWLEDGMENTS This work was partially supported by the Slovenian No. of Research Agency. Average size Median size segments 20 43592.35 589.5 References 50 17436.94 301 [1] Adomavicius, G. and Tuzhilin, A., Toward the Next 100 8718.47 229.5 Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions, IEEE 200 4359.235 193 Transactions on on Knowledge and Data Engineering, Vol. 17, No. 6, June 2005. 5. CONCLUSION [2] Fortuna, B., Fortuna, C., Mladenic, D., Real-time news We have proposed an approach to audience segmentation recommender system. In Proceedings of Machine based on topic profiles of the visited Web pages instead of learning and knowledge discovery in databases : the commonly used user profiles. The topic are obtained by European conference, ECML PKDD 2010, Barcelona, classifying the visited Web pages into a custom taxonomy. Spain, September 20-24, 2010 (Lecture notes in The classification is performed automatically using a pre- computer science, ISSN 0302-9743, Lecture notes in trained machine learning model. The topic profiles are artifical intelligence, vol. 6323). Berlin; Heidelberg; formed from properties of the users visiting the Web page New York: Springer. 2010, vol. 6323, pp. 583-586. that are classified into the topic, and the content of the Web page. [3] Grobelnik, M., Bank, J., Mladenic, D., Novak, B., Fortuna, B.. Using DMoz for constructing ontology Preliminary experiments on a small sample of log-data show from data stream. In Proceedings of the 28th that the proposed approach is promising, grouping together International Conference on Information Technology similar topics and based on that segmenting the audience into Interfaces, June 19-22, 2006, Cavtat/Dubrovnik, reasonably populated segments. Namely, one of the issues Croatia, (IEEE Catalog, No. 06EX1244). Zagreb: with audience segmentation when the users have multiple University of Zagreb, SRCE University Computing interests is that many users that are not similar to each other Centre. cop. 2006, pp. 439-444. 26 Building Client’s Risk Profile Based on Call Detail Records Zala Herga1, 3, Casey Doyle2, Stephen Dipple2, Caleb Nasman2, Gyorgy Korniss2, Boleslaw Szymanski2, Janez Brank1, Jan Rupnik1, and Dunja Mladenić1, 3 1Jožef Stefan Institute, Jamova 39, Ljubljana, Slovenia 2Rensselaer Polytechnic Institute, 110 8th Street,Troy, NY 12180, USA 3Jožef Stefan Postgraduate School, Jamova 39, Ljubljana, Slovenia {zala.herga,janez.brank,jan.rupnik, dunja.mladenic}@ijs.si {doylec3, korniss, nasmancc1, dippls, szymab}@rpi.edu ABSTRACT 2. NETWORK PROPERTIES Data collected from mobile phones can be used to uncover As the first step in analyzing the dataset and gaining an un- underlying social network dynamics and individual’s behav- derstanding of how the users operate we define the structure ioral patterns. Based on a Call Details Records dataset, we of the network. Here, we treat the mobile data set as a so- build a weighted, directed network and analyze it’s proper- cial network where each node is an individual and each edge ties. In addition to node-level network measures we extract represents a connection between them and another individ- an extensive consumption and mobility-based feature set. ual. Wherever possible, we use weighted, directed edges to We show that extracted network and consumption features preserve the strength of the connection between individuals can be used to model individual’s risk profile. [4]. Weights are assigned based on the frequency of outgo- ing communications between the source node and the target Keywords node. Where applicable, we use this metric to define the dis- tance between two nodes as w Mobile Phone Network, CDR, Supervised learning avg /wi→j where wavg is the average weight of all connections in the network and wi→j is the weight of the connection between the source (i) and the 1. INTRODUCTION target (j) [6]. Wherever it is not feasible to use a weighted The Call Detail Records (CDR) dataset is a relatively stan- edge scheme, we create an unweighted graph using a cutoff dard dataset obtained by mobile phone operators. One to define how many outgoing communications from one node record in the CDR dataset corresponds to a communication to another constitutes a connection (ie we use the frequency event between two mobile phone users and includes time to define whether a connection exists at all, and all connec- stamp, type of event (call, text), direction (in- or outgoing) tions are still directed but have equal weight) [5]. Using a etc. This data type reveals behavioral patterns that can be low cutoff introduces a lot of noise into the system and is used to identify user’s personality [2], spending habits [11] less representative of a true social network as many of the or socioeconomic level [5]. Here, we are interested in using edges are too weak to accurately indicate a social connection the data to build each client’s risk profile; in particular, we between two individuals, but choosing too high of a cutoff attempt to use this data to predict user defaults. To this restricts the network and discards potentially valuable data end, we focus our analysis around whether the clients phone connecting nodes and communities together. number was blocked at the end of month (indicating issues potentially related to the defaulting behavior), using this Using these methods, our data set translates to a network data to label clients as good or defaulted. The dataset used with a giant component comprising 99.14% of the it. Of is completely anonymised. course, the number of edges and size of the giant compo- nent decreases quickly when the unweighted cutoff scheme Structure of the rest of the paper is as follows: Section 2 is used (Fig. 1). The size of the giant component decreases presents characteristics of the network built from the CDR, linearly with increases in the cutoff, while the decrease in Section 3 describes feature extraction, Section 4 presents the number of edges levels off as a power law with γ ≈ 0.75. probability of default models and their evaluation and Sec- The degree distribution also changes slightly with the cutoff tion 5 concludes the paper. than without; in both cases the distribution has a fat tail that is well approximated by a power law, but the expo- nent increases with a larger cutoff. For the general case of the weighted edges with no cutoff, the power law tail has an exponent of γ ≈ −4.3 while with a high cutoff such as thirty the power law tail exhibits an exponent of γ ≈ −6 (Fig. 2), both in general agreement with prior work on mo- bile network data [5, 7]. Similarly, the distribution of node strengths (defined as the sum of the weights of its adjacent edges) also exhibits a heavy tailed decay as expected [7]. 27 Figure 1: (a) The size of the giant component G as it decreases linearly with higher cutoff criteria to form an edge between two nodes. (b) The total number of edges in the system decreases as a power law with γ ≈ 0.75. Figure 2: (a) The degree distribution for various cutoff criteria to create an edge between two nodes. As the cutoff increases, the peak of the distribution shifts left until it peaks at zero. (b) The same distributions on a log-log scale to highlight the power law tail of the distributions. Shown here are the two extreme cutoffs tested in order to highlight the increase in the gamma value for higher cutoffs. Additionally, we study the distribution of some higher level centrality has a tightly grouped, high density of relatively node based measures such as reciprocity[14], which is sur- high values. This implies a very well connected graph such prisingly low. Without a cutoff only 38.65% of all links are that the shortest path between any two nodes is low. This reciprocated, while higher cutoffs increase the fraction of re- can further be seen via the harmonic centrality, which also ciprocated links up to a maximum of only 41.57% when the has a relatively low density of low scoring individuals even cutoff is fifteen. We further measure the reciprocity using with the inclusion of nodes not within the giant component. the weighted network scheme by defining the weighted reci- This implies that even the nodes that are not connected to procity[12, 13] as Rij = |wij − wji|/(wij + wji). Using this the giant component tend to form small, tightly connected metric, the network shows an average weighted reciprocity communities of their own. of only .3235, further indicating the low reciprocity of the network. 3. FEATURE EXTRACTION After understanding the network dynamics, our aim was to Finally, we use node centrality to measure how the nodes po- build individual’s behavioral patterns. For that reason we sition themselves in within the communication paths across extracted from the CDR three types of behavior-related fea- the network (nodes with high centrality are most likely to tures: individual’s consumption, social network and mobil- connect communities and therefore are very important to ity. Some of the features were extracted for different time the study of how risk patterns propagate across the net- window (e.g. per day, per week, hour of day), separately for work). For this purpose, we utilize the closeness central- incoming and outgoing events and/or separately for event ity[1, 10, 3], a node level measurement that utilizes the type (call, text). We also added another more technical cat- shortest paths across the network to identify where nodes egory which relates to individual’s position in the underlying lie in the network structure. Specifically, it is a ranking network. Together, more than 6000 features were extracted. of the distance from the node in question to every other Each category is described in more detail below. node, defined as CC (i) = (N − 1)(P d i6=j ij )−1 where CC is 1. Consumption features: These features are related to the closeness centrality and dij is the length of the shortest individual’s usage of the mobile phone. We extracted path between nodes i and j (assuming a path exists). Un- for each individual the number of all calls, number of fortunately this measure only works on connected graphs, all texts, total duration of calls, average duration of so to analyze the full unconnected graph, we also study calls, average time between consecutive events. the harmonic closeness centrality[9, 8], defined instead as C P 1 H (i) = 1 . As seen in Fig. 3, the closeness n−1 j6=i d 2. Social network features: This type of feature focuses ij 28 Our aim was to model probability of default for each client based on extracted phone usage patterns and node-level net- work measures. We present the results of fitting several lin- ear regression models with varying parameters. Features that are described in previous sections were used for mod- eling, and all features were normalized to standard score (z-score). We divided our dataset into train- and test set in 70:30 ratio. We started with a linear model (labeled as glm-6 in Figure 4) that was based only on six predictor variables: frequency (corresponds to the node’s strength), duration (of user’s call events; sum), degree, harmonic centrality and the two ge- ographic (cell tower PD) variables. We chose with these features because we believed that they carry a lot stronger signals in contrast to the other 6048 features. The p-value Figure 3: Distribution of closeness and harmonic central- is < 0.01 for all, except for duration, which has a p-value of ity scores across the network where higher scores indicate a 0.97. Overall this implies that network measures are good more central position in the network. The closeness central- predictors for default behavior. ity only considers nodes in the giant component (and the distribution is therefore only calculated for those nodes). 4.1 Principal Component Analysis The harmonic centrality includes all nodes in the network. Further, when dealing with larger amount of features (6048) principal component analysis (PCA) was performed on the train set for feature space reduction. PCA is a method on the number of contacts and reciprocated events: that decomposes the feature space into principal compo- the number of unique contacts, the number of con- nents (eigenvectors) and also provides information about tacts with which individual exchanges on average at how much variance in the data each component explains. least 5 texts per week / 2 calls per week, the number Selection of a subset of PCA components reflects a trade-off of reciprocated call events, the median time between between 1) model simplicity (we want to include a moder- reciprocated call events, and the median time to an- ate number of features in our models) and 2) total variance swer text. explained by the component subset. All features were sub- jects to PCA, except for the 6 features that were used in 3. Mobility features: These features that are based on glm-6 model described above. Those were added to models used BTS tower location and include the average daily in their original (but standardized) form. radius of gyration, the average distance traveled per day of week, the popular cell towers that sum up to We ordered the obtained PCA components decreasingly by 90% of records, and the average number of unique cell explained variance of the data. The first component explains towers used per week. 20% of the variance, the second 7%, the first ten components 4. Node level network measures: These features all rely together 37%, first thirty together 42%, and first five hun- on the individuals location within the social network dred sum up to 66%. We then created two linear models built off of their usage statistics. The details of these based on reduced feature subsets: first, using 30 PCA com- metrics are discussed in Section 2, and represent each ponents and second, using 500 components (pca-30 and pca- nodes level of importance to the overall social network 500 in Figure 4). Because many variables in pca-500 have as well as how deeply embedded the individual is. large p-values, we fitted another model that didn’t include those variables with p-value ≥ 0.5 (pval-05 ). 3.1 Geographic analysis Geographic analysis was performed to help us with the spe- 4.2 Oversampling cific goal of building individual’s risk profile. Analyzing ge- Only about 0.25% of users in the underlying dataset exhib- ographic features requires a definition of their location that ited default behavior, which makes the dataset very unbal- considers that most people connect to many different cell anced. For that reason, we implemented a simple oversam- towers over the course of a three month period. For our pling method on train set: we multiplied defaulted users purposes, we use each user’s top two most used cell towers. (and their features) by 20. The model using this method, We assign an individual to both their most used and second oversampled-20, is also presented in Figure 4. Surprisingly, most used towers to account for the likelihood that a user the oversampled dataset not only does not improve perfor- will spend large amounts of time both at their residence and mance, but can be seen to provide slightly worse results than their workplace. From there we analyze the number of peo- the originally unbalanced dataset. ple that exhibit default behavior for each tower or district and identify high risk geographic regions. Based on that 4.3 Evaluation analysis we calculated empirical probability of default for Model comparison is presented in Table 1. We can see that each cell tower. These probabilities were used as two addi- at 95%, the level recall is high, up to 0.91 for both mod- tional features, one for each of the two most commonly used els based on 500 PCA components. Precision is low for all towers of each user. models due to the unbalanced dataset, but even with that drawback, our models still perform far better than random 4. MODELING models. 29 Model random glm-6 pca-30 pca-500 pval-05 oversampled-60 Recall 0.05 0.13 0.79 0.90 0.91 0.86 Precision 0.003 0.007 0.042 0.049 0.049 0.046 Table 1: Recall and precision at 95% level for each of the models presented in Figure 4. False positives vs. False negatives operative Agreement Number W911NF-09-2-0053 (the Net- work Science CTA), by the Office of Naval Research (ONR) random glm-6 Grant No. N00014-15-1-2640, and by NSF Grant No. DMR- pca-30 1560266 under the REU Program. The views and conclu- pca-500 pval-05 sions contained in this document are those of the authors oversampled-20 and should not be interpreted as representing the official policies either expressed or implied of the Army Research Laboratory or the U.S. Government. 7. REFERENCES [1] A. Bavelas. Communication Patterns in TaskOriented Groups. J. of the Acoustical Society of America, 22(6):725–730, nov 1950. [2] Y.-A. de Montjoye, J. Quoidbach, F. Robic, and False negatives (defaulted clients labeled as good) A. Pentland. Predicting personality using novel mobile False positives (good clients labeled as defaulted) phone-based metrics. In SBP, pages 48–55, 2013. Figure 4: This graph presents prediction results on test set [3] L. C. Freeman. Centrality in social networks of the fitted models. y-axis corresponds to the false neg- conceptual clarification. Social Networks, atives (clients, that we’re labeled as good but really de- 1(3):215–239, jan 1978. faulted), while x-axis corresponds to false positives. Re- sults are shown for probability thresholds 0 − 1 with step [4] M. S. Granovetter. The Strength of Weak Ties. 0.01. pca-500 (purple) is to great extent covered by pval-05 American J. of Sociology, 78(6):1360–1380, May 1973. (green) since both models provide very similar results. [5] S. Luo, F. Morone, C. Sarraute, M. Travizano, and H. A. Makse. Inferring personal economic status from social network location. Nature communications, 5. CONCLUSION 8:15227, may 2017. This paper presents an analysis of a mobile phone data us- [6] M. E. J. Newman. Scientific collaboration networks. ing a social network representation and various prediction II. Shortest paths, weighted networks, and centrality. models to understand default patterns. The analysis on the Physical Review E, 64(1):016132, Jun 2001. underlying network reveals a large giant component such that most nodes have at least some path to any other node [7] J.-P. Onnela, J. Saramäki, J. Hyvönen, G. Szabó, in the network. Further, both the nodes within and without D. Lazer, K. Kaski, J. Kertész, and A.-L. Barabási. the giant component exhibit relatively high centrality scores; Structure and tie strengths in mobile communication meaning that nodes are form tightly connected communities networks. PNAS, 104(18):7332–6, May 2007. such that the path between nodes is generally quite short. [8] T. Opsahl, F. Agneessens, and J. Skvoretz. Node Further, many nodes have a high degree and the degree dis- centrality in weighted networks: Generalizing degree tribution exhibits a heavy power law-like tail. Using many and shortest paths. Social Networks, 32(3):245–251, of these properties as features, we were able to make even jul 2010. more accurate predictive models of default. [9] Y. Rochat. Closeness centrality extended to Our model evaluation shows that there are many variables unconnected graphs: The harmonic centrality index. that carry weak signals about user behavioral patterns that In ASNA, number EPFL-CONF-200525, 2009. have a strong predictive power when aggregated together. [10] G. Sabidussi. The centrality index of a graph. The unbalanced nature of the dataset makes the fitted mod- Psychometrika, 31(4):581–603, dec 1966. els have a high recall but low precision, yet they strongly outperform the random model in both measures. [11] V. K. Singh, L. Freeman, B. Lepri, and A. S. Pentland. Predicting spending behavior using socio-mobile There is still a lot of space for improvement in the model- features. In SocialCom, pages 174–179. IEEE, 2013. ing including testing more complex oversampling methods, fitting additional models (SVM, LASSO, ANN), and includ- [12] T. Squartini, F. Picciolo, F. Ruzzenenti, and ing additional node-level network measures and community D. Garlaschelli. Reciprocity of weighted networks. detection analysis. Scientific Reports, 3(1):2729, dec 2013. [13] C. Wang, A. Strathman, O. Lizardo, D. Hachen, Z. Toroczkai, and N. V. Chawla. Weighted reciprocity 6. ACKNOWLEDGEMENTS in human communication networks. aug 2011. This work was supported by RENOIR EU H2020 project un- der Marie Skodowska-Curie Grant Agreement No. 691152 as [14] S. Wasserman and K. Faust. Social network analysis : well as in part by the Army Research Laboratory under Co- methods and applications. Camb. Univ. Press, 1994. 30 Connecting Professional Skill Demand with Supply Erik Novak Inna Novalija Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan International Postgraduate School Jamova cesta 39 Jamova cesta 39 1000 Ljubljana 1000 Ljubljana inna.novalija@ijs.si erik.novak@ijs.si ABSTRACT Our contributions are a) creating a sizable data set of data Todays job market demand from the job seekers to conti- science related job postings containing the job postings title, nuously learn new skills. When applying for a job position description, locations and other information, and b) develo- one must have the required skill set. If the applicant is ping a dashboard which for a given query shows relevant missing a skill it can be learned by attending a course. Fin- job postings as well as courses and lectures which give the ding the appropriate courses can be tedious but necessary appropriate skills. The dashboard is daily updated with new work to be up-to-date with the job market demand. In this job postings showing the most recent changes. Basic stati- paper, we present a dashboard which connects the job mar- stics such as the most popular job locations and skills are ket skill demand with the courses that give the required skill also shown. knowledge. We developed a pipeline for continuous crawling of job postings and courses which feeds the dashboard with The remainder of the paper is structured as follows. In sec- the appropriate data. The dashboard allows searching by tion 2 we present related work. Next, data acquisition is keywords and returns relevant job postings, courses and ba- explained in section 3 followed by the presentation of the sic statistics relevant to the given search query. dashboard in section 4. Finally, we discuss and conclude our work in section 5. General Terms Job Market, Skill Set, Courses, Lectures, Design 2. RELATED WORK There are multiple blogs that write about top skills needed Keywords for getting a job in data science. One such blog is [14] which Information Retrieval, Data Mining, Analysis, Wikifier, Vi- lists both non-technical and technical skills a data scientist deoLectures.NET should have in the coming years. Another blog [15] lists the top data science skills and courses where they can be 1. INTRODUCTION learned. A lot of these blogs are not up-to-date and not In todays job market the required skills are constantly evol- reflecting the current state. ving. This can be seen in more technical fields such as web development and data science where new tools and libraries A research report [11] writes about connecting supply and are developed and available to the public with an increasing demand in Canada’s youth labor market. They were intere- rate. This is visible in both research and industry sectors sted in finding what skills young adults acquired during their where a job position might require a previously unseen skill education, how employers demand is conveyed to students and the applicant needs to learn it to be qualified. Fin- and those who support them and how well are the acquired ding the courses that would give the skill knowledge can be skills utilized on the job. They presented their results but tedious and does not guarantee its sufficiency. did not develop an application that would help to narrow the gap between the skill demand and supply. To this end, we developed a dashboard which would connect the job market skill demand with the courses that give the Another report [13] talks about the mismatch of the skills required skill knowledge. We focused on job positions that young adults get during their education and the skills the require data science skills and courses that are provided by companies demand. They found that skills are a critical as- acknowledged course providers. set for individuals, businesses and societies and that many employers report difficulties in finding suitably skilled wor- kers. Additionally, they find that a sizable qualification mi- smatch is one of the biggest problems. The company Year Up [9] helps young adults get the appro- priate skills and the needed work experience. They identify motivated individuals and companies that are prepared to help, send the individuals to learn new skills and afterwards apply the newfound skills at the companies, getting the cri- tical work experience for their career. The work is done 31 manually which can be expensive and time consuming. Data Set Statistics. The job postings data set contains almost 3.3M job postings acquired in the period of 18 mon- ths. Job postings were located for 144 different countries, 3. DATA ACQUISITION the majority of them from Europe. Figure 1 shows the top fifteen countries with most found job postings. The UK do- Open job positions can be found using job search services. minates other countries with 906k job postings, followed by These services aggregate job postings by location, sector, France with almost 539k. applicant qualifications and skill set or type. One such ser- vice is Adzuna [6], a search engine for job ads which mostly 100 0 covers English speaking countries. Another service is Tro- vit [7], a leading search engine for classified ads in Europe 1000 80 x and Latin America. The service is available in 13 different 60 languages and provides listings of jobs as well as cars, real estate and other products. 40 20 When applying for a job position the applicant requires to have a certain skill set. If the requirements are not fulfil- 0 s y led, he can enroll in courses to get the missing skills. Ad- ce d d a a nd al in um ria si ditionally, watching certain lectures can give a deeper un- dom an an Ital many and tug Spa lan gi eden mani derstanding of a particular problem which can increase the Rus ing Fr erl Irela Pol el Aust Por itzer B Sw Ro probability of getting accepted for a job position. Video- Ger Lectures.NET [8] is an award-winning free and open access ted K Neth Sw educational video lectures repository. It contains videos of niU individual lectures as well as lectures given at renown con- ferences. Figure 1: Top fifteen countries with most found job postings. The greatest number of job postings were Crawling. Since we needed a continuous flow of data, we found for UK, followed by France and Germany. developed a pipeline for acquiring job postings, courses and lectures. This will allow us to provide the dashboard, presen- ted in section 4, with the most recent data. For job postings There were 650 unique Data Science skills extracted from we targeted the portals like Adzuna with an emphasis on po- the data set. These include soft skills, such as leadership sitions in Data Science and for courses we targeted different and management, knowledge of a particular domain, such course providers, including Coursera [2], providing courses as machine learning and artificial intelligence, and program- from top universities, and Hackr.io [4], a service which finds ming languages. Figure 2 show the most demanded skills in the best online programming courses & tutorials. We also the data set. targeted VideoLectures.NET to acquire video lectures con- 30 taining the Data Science tag. The tags are given manually 0 by the VideoLectures team. 25 1000x 20 For data acquisition and enrichment, we collected data ei- 15 ther using dedicated APIs, including Adzuna API [1] as well 10 as custom web crawlers. The data was formatted to JSON 5 to aid further processing and enrichment. 0 e s t + Enriching. The next step of data preprocessing is wikifica- nt ase ics sql ign si ng ip tics php tion - identifying and linking textual components to the cor- b enc aly me alysis c+ linux java ta atist des arni aly responding Wikipedia pages [16]. This is done using Wikifier an da st le javascr age an [10] which also supports cross and multi-linguality enabling er sci ne an ta an extraction and annotation of relevant information from job da achi t m postings, courses and video lectures in different languages. mputco m Wikification will allow us to search for job postings, courses ojec and lectures in multiple languages. pr Next, we use the Skill and Recruitment Ontology (SARO) Figure 2: Top fifteen most demanded skills. They [17] to extract Data Science skills from job postings. For are mostly comprised of high-level skills, such as “da- each job posting we match the Wikipedia concepts with the tabase” and “computer science”, and programming skills found in SARO ontology and declare the matched con- languages. cepts as Data Science skills. These skills are then added to the job posting profile. The course data set contains over 63k course information including their title, description and course providers. The Finally, to allow searching by locations and countries the job data set is comprised of over 8k courses available online and postings were further enriched by using GeoNames ontology 55k offline courses. Figure 3 shows the distribution of online [5] to include the latitude and longitude and the correspon- courses by course providers. The most courses were acquired ding GeoNames ID and the location name. from Coursera with above 4k, followed by Hackr.io at 2k. 32 5000 processing large-scale real-time streams containing structu- red and unstructured data. 4000 Components. The dashboard is composed of different 3000 components. The largest component is a list of job postings. Each job posting is presented by its extracted information, 2000 including the Data Science skills extracted from the title and 1000 description. Figure 4 shows an example of a job posting in the list. Since Wikifier supports cross and multi-linguality 0 the list consist of job postings written in different languages. ra kr s g y r e ity in d te rs ac learn site ac rsity tu u h e yn re du ive 2s o co rs p tu u learn en fu co en p p o o Figure 3: The distribution of online courses by co- urse providers. The most courses were acquired from Coursera, followed by Hackr.io. Figure 4: Example of a job posting returned by the query “machine learning”. Even though the job po- Finally, we acquired a data set of over 20k lectures published sting is written in Spanish the methodology finds it on VideoLectures.NET. It contains information about the relevant. lectures available on the video repository including title and description and link to the lecture. If the user does not have the required skill set it can be acquired by enrolling into courses shown in the course list. The list shows courses offered by different online course pro- 4. DASHBOARD viders that are relevant to the users input query. Figure 5 Our objective is to automatically connect Data Science skill shows the component containing the course list. Left and ri- demand with the provided courses. To this end, we deve- ght arrows are used to navigate through the list where each loped a dashboard [3] which enables its users to search for course is presented by its name and a course provider. their desired job position, find out what is the required skill set and which are the appropriate learning materials and courses to acquire the missing skills. Additionally, the da- shboard shows the most demanded skills and hiring location for the given results. In this section, we present the content retrieval methodology and describe the different components of the dashboard. Methodology. Here we present the methodology used for retrieving the demand and supply content. The content is retrieved by inserting a query text in the search input. The user may add additional query conditions by selecting the Data Science skills, locations, countries and a time interval in which the job postings were published. Upon submit- ting, the query is used to fetch the content that matches the conditions. While all query values are used for retrie- ving job postings, only the input text and skills are used for retrieving the courses and video lectures content. Since co- urses and video lectures are available online the location and Figure 5: A sample of recommended courses for the time interval are irrelevant for retrieving the supply content. query “machine learning”. Clicking on a course re- To retrieve the content we first need to set an appropriate directs the user to the course provider where he can index. The job posting data set is indexed by Wikipedia enroll. concepts, Data Science skills, locations, countries and publi- shed date while the course and lecture data sets are indexed Additionally, the user can watch lectures to get a deeper un- only by Wikipedia concepts. The query text is sent through derstanding of a problem. Similar to courses the video lectu- wikification to acquire Wikipedia concepts which are used res list show relevant content found on VideoLectures.NET. for retrieving the relevant content. Next, additional query Clicking the lecture redirects the user to the video lecture conditions are used to filter out the content. The remaining homepage. content is used to calculate the most demanded skills and hiring locations. Finally, the query results are returned and The dashboard also shows them most demanded skills and used to update the dashboard components. This process is job posting timeline. The timeline shows how did the ratio developed using QMiner [12], a data analytics platform for between queried and all job postings change since the start 33 of the year 2016. Additionally, this shows a trend of the and RENOIR EU H2020 project under Marie Sk lodowska- skill demand in the queried job posting subset. Figure 6 Curie Grant Agreement No. 691152. shows the visualizations used to show the skill demand and timeline. 7. REFERENCES [1] Adzuna api. https://developer.adzuna.com/. Accessed: 2016-09-07. [2] Coursera | online courses from top universities. join for free. https://www.coursera.org/. Accessed: 2017-08-22. [3] European data science academy dashboard. http://jobs.videolectures.net/. Accessed: 2017-08-23. [4] Find the best online programming courses & tutorials Figure 6: On the left the ten most demanded skills - hackr.io. https://hackr.io/. Accessed: 2017-08-29. histogram, and on the right the number of job positi- [5] Geonames. http://www.geonames.org/. Accessed: ons timeline, for the query “machine learning”. Ho- 2017-08-23. vering over the histogram column shows the number [6] Job search - find every job, everywhere with adzuna. of queried jobs demanding the skill. https://www.adzuna.com/. Accessed: 2017-08-23. [7] Trovit - a search engine for classified ads of real estate, Finally, a world map shows the most popular hiring locations jobs and cars. https://www.trovit.com/. Accessed: extracted from the queried job postings. The locations are 2017-08-23. at first clustered where upon zooming the clusters divide [8] Videolectures.net - videolectures.net. and the individual locations are shown. Figure 7 show an http://videolectures.net/. Accessed: 2017-08-22. example of clustered locations. [9] Year up - closing the opportunity divide. http://www.yearup.org/. Accessed: 2017-08-22. [10] J. Brank. Wikifier. http://wikifier.org/. Accessed: 2017-08-23. [11] R. Brisbois, L. Orton, and R. Saunders. Connecting Supply and Demand in Canada’s Youth Labour Market. 2008. [12] B. Fortuna, J. Rupnik, J. Brank, C. Fortuna, V. Jovanoski, M. Karlovcec, B. Kazic, K. Kenda, G. Leban, A. Muhic, et al. qminer: Data analytics platform for processing streams of structured and unstructured data , software engineering for machine learning workshop. In Neural Information Processing Systems, 2014. [13] W. E. Forum. Matching skills and labour market needs: Building social partnerships for better skills and better jobs. http://www3.weforum.org/docs/ GAC/2014/WEF_GAC_Employment_ MatchingSkillsLabourMarket_Report_2014.pdf, Figure 7: Top hundred hiring locations for the query 2014. “machine learning”. The clusters show the number [14] M. Mayo. KDnuggets analytics big data data mining of locations it contains. and data science. www.kdnuggets.com/2016/05/ 10-must-have-skills-data-scientist.html. 5. CONCLUSION AND FUTURE WORK Accessed: 2017-08-22. [15] E. Mcnulty. Top 10 data science skills, and how to In this paper we present the methodology for automatically learn them. connecting skill demand and supply. We acquired a sizable http://dataconomy.com/2014/12/ job posting and course data set, developed a methodology top-10-data-science-skills-and-how-to-learn-them/. Accessed: 2017-08-22. for retrieving job postings, courses and lectures relevant to the user query and created a dashboard for showing the re- [16] L. Ratinov, D. Roth, D. Downey, and M. Anderson. trieved content. Local and global algorithms for disambiguation to wikipedia. In Proceedings of the 49th Annual Meeting In the future we wish to improve the data enriching process of the Association for Computational Linguistics: by handling skills that are not in the SARO ontology and Human Language Technologies-Volume 1, pages add new features and improvements to the dashboard. 1375–1384. Association for Computational Linguistics, 2011. [17] E. Sibarani, S. Scerri, N. Mousavi, and S. Auer. 6. ACKNOWLEDGMENTS Ontology-based skills demand and trend analysis, July This work was supported by the Slovenian Research Agency, 2016. EDSA EU H2020 project (Contract no: H2020-ICT-643937) 34 Analyzing raw log files to find execution anomalies A novel approach to analyzing text-based log files to find changes in the execution profile of complex IT systems Viktor Jovanoski Jan Rupnik Carvic d.o.o. Jozef Stefan Institute Kotnikova 5 Jamova 39 Ljubljana, Slovenia Ljubljana, Slovenia viktor@carvic.si jan.rupnik@ijs.si Mario Karlovčec Blaž Fortuna Jozef Stefan Institute Jozef Stefan Institute Jamova 39 Jamova 39 Ljubljana, Slovenia Ljubljana, Slovenia mario.karlovcec@ijs.si blaz.fortuna@ijs.si ABSTRACT by inspecting CPU load, network load or database communi- Anomaly detection (a.k.a. outlier detection) is the identifi- cation patterns. Quality assurance of input and output data cation of events that do not conform to an expected pattern can also be performed, such as the number and the size of in a dataset. When applied to monitoring modern, complex the data records. Spotting unusual behavior of such sys- IT systems, it keeps track of a plethora of incoming data tems is crucial and unhandled execution problems can lead streams. This paper provides an approach that uses the to catastrophic results. The anomalies can be very different lowest and most unstructured source of data related to an in nature, from abrupt changes that occur within a second IT system - the raw system log files. Several versions and to the gradual decay of performance that is only observable parametrizations of basic building blocks will be presented on a weekly or monthly scale. to show how different types of anomalies can be extracted from the data. Several experiments on synthetic as well as Most of the research about anomaly detection has been con- real-world data show effectiveness of the algorithm. Spe- centrated on outlier detection in numerical time-series (e.g. cial care is taken to keep the model and the resulting alerts financial time series like in [5], time series arising from in- interpret-able - since detecting an error without a meaning- frastructure monitoring) and discrete-event sequences (e.g. ful explanation about its details is of limited use to end user fraud detection like in [4], cyber-security applications like (the results need to be actionable). in [3]). Data representation (the way of encoding the rele- vant information) is vital to the practical performance of an Keywords anomaly detection approach (does it capture relevant events Anomaly detection, Outlier detection, Infrastructure moni- or just insignificant variation). toring However, all of these numeric data series have to be “pre- 1. INTRODUCTION pared” in advance. A developer had to think ahead about the potential problems that can arise and expose appropri- Modern IT systems are getting increasingly complex and ate measurements. When this data is available, it can clearly distributed. On-line monitoring of their health is becoming signal the problem to the operators. But what happens when critical for their normal operation. The data that is being a new type of outage occurs and no measure to detect it was collected about these systems comes in diverse time series put in place? In such case the infrastructure maintainers (numerical, categorical, text), potentially huge in volume typically resort to inspection of raw log files. Writing a line and arriving with different latencies. For instance, complex of text to console or file output is often the only indica- systems often expose metrics about their performance such tion that something happened or has not happened when it as number of served requests or size of internal data struc- should have. In our experience, all complex IT systems pro- tures. One can also monitor systems performance indirectly duce such files and they are still used to solve the hardest problems and errors. The log files may be very unstructured in practice and may contain extremely diverse information. From errors, warn- ings, database calls and initialization steps, to casual coun- ters and observations. The log lines themselves may be un- structured: their content might be unordered and they may contain text written in natural language (e.g. error mes- sages). Even then, these bits of information may or may not 35 be written in a easy-to-parse form (e.g. JSON format). The Algorithm 1 General anomaly pipeline only thing that can be expected of each line is a timestamp - while input data available do a clear indication of time on the server, when this particular parse data and extract record line was written to the log. Even if this data is missing from calculate anomaly score some lines, the sequential order of writing to file helps us if score above threshold then narrow down the possible timestamp for each line. We can report anomaly simply choose to re-use the last timestamp before that line. end if add record to model Many algorithms and approaches to anomaly detection have end while been proposed in the literature. [1] provides an excellent overview of the field. New approaches are getting increas- ingly more sophisticated at dealing with multidimensional numeric data, discrete, sequential or even spatial data. The unstructured nature of raw log files makes the problem amend- able to text-mining based approaches. This article presents work in that direction. 2. ANOMALY-DETECTION End-users in charge of maintaining a large IT system will typically be concerned with two types of anomalies. Either they will want to avoid a sudden, critical degradation of performance, or they will want to know if the performance has been slowly degrading and attempt to prevent that. Figure 1: Text-processing anomaly pipeline Abrupt change - In this scenario, the system ”falls of the cliff” - performance plummets and multiple parts of the sys- tem typically experience severe problems. Users will most The score value is then used to decide if the incoming data often be notified of the problem via many channels. Hence, record is an anomaly - either by comparing against some the raw log files are not the first place they will start looking manually predefined static threshold or against a dynamic at. However, if the cause of the problem cannot be deter- one, which uses historical score values to determine the thresh- mined or the exact timeline of events of the disaster cannot old autonomously. Finally, in case of an anomaly, an alert be established, the information from the log files will also be is created that contains enough data to explain what was used as a part of the forensic analysis. observed and why it was tagged as an anomaly. Gradual deterioration - IT systems that have been in Only when all of the above steps are done, we add the new production environment for a long time can experience grad- data point to the model. We can store it internally in the ual changes. These changes do not necessarily cause catas- model for some time, but we have no guarantee to be able to trophic failure overnight, but degrade the performance over ever again access all the historical data (e.g. for retraining a longer period of time. Such changes are very difficult to of the algorithm). detect by the programmer as they are very subtle, e.g. they are only observable on the monthly scale. And often the only place where this can be detected is with a long term 2.2 Defining normality analysis based on the raw log files. It is crucial for any anomaly-detection system to be able to tell what normal is, so that based on this notion it can 2.1 General pipeline say that something is not normal, i.e. anomalous. When available, domain knowledge and experience from human Algorithm 1 shows the general structure of a typical anomaly- experts can be used to capture the appropriate aspect of detection system that operates on a stream. normality in the data, which greatly improves the quality of the reported anomalies. First, a data record is extracted from the incoming data. As stated before, the timestamp is always defined. The record In general, however, the systems need the ability to au- can also incorporate recent past data (previous lines of log tonomously define normality. This ability greatly depends file) and contextual data (e.g. server overall status, holidays on the scoring function and it is common practice to define indicator). it in a way that only the values on the border of observed values are indicators of an anomaly. In the second step a score is calculated that should reflect a record’s anomalousness (non-normality, novelty). The sim- pler the score, the easier the subsequent steps are. Addi- 2.3 Detecting anomalies on raw log files tionally, we prefer models and scoring functions that can We will now present our approach to anomaly detection easily be explained to the end user, since he is supposed based on text processing techniques. The steps of the ap- to act on them. Lastly, the score should be constructed in proach are given in Algorithm 2 and graphically presented such manner that the anomalous examples fall on one edge in Figure 1. of the spectrum (e.g. the higher the score, the bigger the anomaly). The log parser processes a single line at the time and each 36 Algorithm 2 Log-file anomaly pipeline Table 1: Synthetic Data - window = 1 min while input data available do NN rate Precision Recall read data from log file 0.003 0.06 1.00 parse data and extract BOW insert into time windows 0.001 0.15 0.66 calculate distance in kNN 0.0005 0.23 0.66 if score quantile above threshold then create explanation using distance report anomaly Table 2: Synthetic Data - window = 1 h end if NN rate Precision Recall add record to model 0.05 0.07 0.66 end while 0.03 0.14 0.66 0.01 0.33 0.33 time emits a record with several features. In the second step we extract standard BOW (bag-of-words) feature vectors. and use these instead of whole texts. There are several possible ways of how to extract features and how to weight each feature dimension. We chose to 2.3.2 Anomaly explanation use a simple representation where we extract tags, such as We construct the explanation for an individual alert by find- server=x and process=mytask.tsk. We collect all these tags ing its nearest neighbor and subtracting one vector from from the record and assign them weight 1. This means that the other and squaring each element of the resulting vector. individual lines can produce vectors of varying lengths, as Each dimension is thus attributed with an ”anomalousness” they are not normalized. We could also normalize them or score, and the highest scoring dimensions contributed the re-weight them using the TFIDF weighting scheme ([6]) to most to the distance to the nearest neighbor. The expla- down-weight frequent tags. nation to the user contains a sorted list of highest scoring dimensions (clipped to avoid information overload). To capture the wider context of each record, we aggregate a set of records within a time-window of predefined length 3. EXPERIMENTAL RESULTS and emit a combined record (simple normalized sum of all vectors) that represents that window. In the fourth step, we 3.1 Evaluation use the k-nearest-neighbor algorithm (kNN) to find which In most real-world anomaly-detection cases we receive a k windows from the past are the closest to current window. dataset that is largely unlabelled. The labels we have usually This unsupervised algorithm was chosen because it can han- denote some special catastrophic situations that users expe- dle skewed distributions very well. rienced and want to avoid in the future. The rest of the data can be assumed to be mostly ”normal”, but unknown anoma- The average distance of the k neighbors is used as an anomaly lies may also remain in the dataset. The standard metrics score: the further away an instance is to its nearest neighbor to evaluate the performance of an anomaly detector are pre- set, the more anomalous it is. The value of the parameter k cision (#true anomalies / #predicted anomalies) and recall is usually small, between 1 and 5. The best value depends (#predicted true anomalies / #true anomalies). Low preci- on the data domain and should be determined by experi- sion translates to a higher burden on the user that inspects menting. the anomalies (each inspection has some cost) and low recall translates to missing many anomalies and increasing risk. If The absolute scale of the score may vary from problem to we suspect that the dataset is not labelled completely (unla- problem and also depends on the feature representation. For belled anomalies present) the precision might be estimated that reason we used a quantile based approach: we compare pessimistically and recall might be measured optimistically. the anomaly score of a new instance with scores of recently In such cases manual inspection of false positives might lead observed data points (the size of the of recently observed to discovering new relevant types of anomalies present in the data point set is controlled with a parameter learning win- dataset. dow ). If the score is higher than a large (example 0.999) fraction of scores (controlled by a parameter nn rate), then 3.2 Synthetic data we classify the instance as an anomaly. The quantile (1 - We generated log files by simulating parallel execution of nn rate) directly controls the detection rate (0.001 corre- several processes, each having a specific pattern of writing sponds to classifying 0.1% of instances as anomalies) under to log. We then manually inject 8 instances of anomalous the assumption that the data distribution is stationary. entries, 2 per week, each pair occurring within one minute. 2.3.1 Tokenization We use 10 days for the kNN learning window length, so there When tokenizing input tests we have several options. If we will be no anomalies in the first 10 days. For the length of the know that some common patterns exist on how the special input-grouping window, we experimented with two settings: entities are marked inside the text, we can extract them and 1 minute and 1 hour. In the former setting we have 6 original create useful features for BOW step. For instance we can anomalies, but in the later, we only have 3 since each pair of use message text directly to create BOW. Alternatively, we adjacent anomalies (they are 1 minute apart) gets collapsed can just find identifiers of the origin of the message (e.g. into the same hour. We set the parameter k to 1 - thus process name, method name, class name, page name etc.) making the explanation of the anomaly very simple. 37 signals (which anomalies have a high impact on system?), in- Table 3: Web logs - an anomaly explanation example dicate possible root causes (what might have caused a par- File Val Near Contr /shuttle/missions/sts-68/mission-sts-68.html 0.707 0 0.354 ticular anomaly?) and give predictions (what may follow /images/NASA-logosmall.gif 0 0.505 0.180 after detecting a particular type of anomaly?). /htbin/cdt main.pl 0 0.416 0.122 The popularity of deep learning techniques ([7]) is also felt Table 1 shows the results for windows of length 1 minute. in the anomaly-detection field and we plan to study their When parameter nn rate, which controls the sensitivity to application to multivariate analysis. Most often, autoen- outliers, was set to 0.03, it correctly detected all 6 manually coders are used to create compact descriptions of the data inserted anomalies. Table 2 shows the results for windows and may also be used to highlight the dimensions with high of length 1 hour. This granularity of data is too coarse and reconstruction error. Another interesting approach is to use hides certain anomalies that remain undetected. generative adversarial networks ([2]), where two neural net- works are contesting in a zero-sum game: one network gen- 3.3 Web-server logs erates “normal” candidates and the other one discriminates We analyzed browsing pattern logs from a production web between generated examples and true examples. We cur- server where the logs contained information on web-page rently see two major challenges for broader use of deep neu- and file requests. The specific feature of this web site is ral networks in anomaly detection systems. The first one is that it is almost completely static - there are almost no new the ability to explain the results to the end-user in an ac- pages being added to it, so the browsing patterns should tionable way. The second one is processing of stream data be relatively constant. The dataset spanned a period of and updating the model on-the-fly. Currently, existing and one month. We set the summarization window length to 5 simpler techniques (e.g. kNN, clustering, statistical tests) minutes and the kNN comparison window length to 10 days. provide much better support for these requirements. How- The parameter k was again set to 1 and the anomaly rate ever, the linearity in describing the feature space is an issue was set to 0.001. The parameters were hand-tuned. that we would like to address in the future and deep neural networks present a promising approach. When analyzing such data the anomalies that we are in- terested should capture both system failure (malfunctioning 5. REFERENCES software, infrastructure failure) as well as malevolent behav- [1] C. C. Aggarwal. Outlier Analysis. Springer New York, ior (denial-of-service attack (DoS), a hacker-induced scan- New York, New York, 2013. ning for exploits). In these cases the system should ideally [2] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, produce anomalies with strong dimensional outliers. D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, We manually inspected anomalies where the strongest di- M. Welling, C. Cortes, N. D. Lawrence, and K. Q. mension contributed more than 30% of the total anomaly Weinberger, editors, Advances in Neural Information score. An example is shown in Table 3, where columns V al Processing Systems 27, pages 2672–2680. Curran means value in current record, N ear means value in the Associates, Inc., 2014. nearest record and Contr means contribution to the total [3] A. Lakhina, M. Crovella, and C. Diot. Diagnosing distance. It turns out that these anomalies corresponded network-wide traffic anomalies. In Proceedings of the to the rare events when new content was added to the web 2004 Conference on Applications, Technologies, page. No DoS or hacker attacks were detected in the ob- Architectures, and Protocols for Computer served time period. So the vector dimension for the file Communications, SIGCOMM ’04, pages 219–230, New (mission − sts − 68.htm) was V al = 0.707. This file was not York, NY, USA, 2004. ACM. present in the nearest record (N ear = 0) and this dimension [4] X. Liu, P. Zhang, and D. Zeng. Sequence matching for contributed 0.354 of the total distance between this record suspicious activity detection in anti-money laundering. and the nearest one. In C. C. Yang, H. Chen, M. Chau, K. Chang, S.-D. Lang, P. S. Chen, R. Hsieh, D. Zeng, F.-Y. Wang, 4. CONCLUSIONS AND FUTURE WORK K. M. Carley, W. Mao, and J. Zhan, editors, ISI We presented a novel combination of known approaches to Workshops, volume 5075 of Lecture Notes in Computer anomaly detection using techniques developed in the field of Science, pages 50–61. Springer, 2008. text mining. Using these we were able to extract valuable [5] C. Phua, V. C. S. Lee, K. Smith-Miles, and R. W. information from raw textual log files that are normally only Gayler. A comprehensive survey of data mining-based used for manual inspection and forensic analysis. fraud detection research. CoRR, abs/1009.6119, 2010. [6] G. Salton, A. Wong, and C. S. Yang. A vector space Our algorithm is based on processing log files, but many model for automatic indexing. Commun. ACM, other sources of information can be used to extract anoma- 18(11):613–620, Nov. 1975. lies in modern IT systems. Our long-term goal is to de- [7] S. Zhai, Y. Cheng, W. Lu, and Z. Zhang. Deep sign what we call Full-spectrum anomaly detection system Structured Energy Based Models for Anomaly (FSADS) that will be able to import many different types of Detection. ArXiv e-prints, May 2016. data streams, covering a wide range of aspects of an IT sys- tem (inputs, outputs, internal performance, database per- formance, network communications etc.). After each single source of data is analyzed and anomalies are extracted, the next step in FSADS will correlate them, determine critical 38 Usage of SVM for a Triggering Mechanism for Higgs Boson Detection Klemen Kenda Dunja Mladenić Jožef Stefan Institute, Aritificial Intelligence Laboratory Jožef Stefan Institute, Aritificial Intelligence Laboratory Jožef Stefan International Postgraduate School Jožef Stefan International Postgraduate School Jamova 39, 1000 Ljubljana, Slovenia Jamova 39, 1000 Ljubljana, Slovenia klemen.kenda@ijs.si dunja.mladenic@ijs.si ABSTRACT Today advanced classification methods based on machine learning Real-time classification of events in high energy physics is are used regularly. essential to deal with huge amounts of data, produced by proton- State of the art methods for this type of problems include deep proton collisions in ATLAS detector at Large Hadron Collider in neural networks and gradient boosting [10][11][12]. Experiments CERN. With this work we have implemented a triggering at CERN prefer the usage of gradient boosting classifiers as they mechanism method for saving relevant data, based on machine are able to evaluate large amounts of data (more than 20 × 106 learning. In comparison with the state of the art machine learning events/s) [4]. methods (gradient boosting and deep neural networks) shortcomings of Support Vector Machines (SVM) have been The success of both methods is based on their intrinsic property of compensated with extensive feature engineering. Method has been introducing non-linearity into the system. In our work we want to evaluated with special metrics (average median significance) compare basic linear methods and Support Vector Machines suggested by the domain experts. Our method achieves (SVM) with different kernels to the state of the art models. significantly higher precision and 8% lower average median Additionally, we want to enrich the data by intensive feature significance than the current state of the art method used at ATLAS engineering. detector (XGBoost). The results of feature engineering can be used for further physical Categories and Subject Descriptors interpretation of relevant physical phenomena. H.2.8 [Database Applications]: Data mining, scientific databases 2. DATA General Terms Dataset has been made public by the ATLAS collaboration for the Higgs Boson Machine Learning Challenge on Kaggle in 2014 [3]. Algorithms, Measurement, Experimentation. It contains data from the ATLAS detector simulator (real labelled Keywords data would be impossible to obtain). The winning method from the Support Vector Machine, Gradient Boosting, Classification, High challenge is being used in the ATLAS experiment today [4]. Energy Physics, Higgs Boson 1. INTRODUCTION ATLAS and CMS experiments have announced discovery of the Higgs boson in 2012 [1]. Experiments have been conducted on Large Hadron Collider (LHC) in CERN in Geneva. The discovery has been succeeded by a Nobel Prize in Physics, awarded to François Englert and Peter Higgs. The existence of the particle, which gives mass to other elementary particles, has been predicted around 50 years ago [6][7][8]. Higgs boson decays almost instantly and can be observed only through its decay products. Initially the particle has been observed through 𝐻 → 𝛾𝛾, 𝐻 → 𝑍0𝑍0 and 𝐻 → 𝑊+𝑊− decays. These decays leave a signature that is relatively easy to interpret. The next steps required analysis of Higgs boson decay into fermion pairs: τ leptons or b quarks. Figure 1. Distribution of signal (yellow) and background In this paper we focus on a special topology of 𝐻 → 𝜏 +𝜏− decay (green) according to most informative attribute [9]. Due to similarities with other decays this particular decay is DER_mass_MMC (mass of Higgs boson candidate) [5]. very difficult to classify. Distinguishing background (events that do not belong to the 𝐻 → 𝜏 +𝜏 − decay) from signal (events that belong to Higgs boson decay) requires the use of state of the art machine 2.1 Data Description learning methods. Dataset consists of 250,000 instances. 85,667 represent signal, In the past the task has often been solved with simple cut-off 164,333 represent background. Each instance consists of 32 techniques based on statistical analysis, performed by expert users. attributes and 1 target variable. All the attributes are numerical (continuous), target variable is nominal (binary ). 2 of the attributes 39 should not be used for classification purposes, as they represent id introduced to overcome the shortcomings of linear SVM in of the instance and probability of such an event happening in the comparison with gradient boosting or deep neural networks. experiment [4]. We have built new features from original attributes by transforming There are missing values in the data. 11 attributes could not always them with some common functions like 𝑒𝑥, 𝑥2, 𝑥3, √𝑥 and 𝑙𝑜𝑔 (𝑥). be measured due to characteristics of the detector. Distribution of Additionally we have used k-means clustering to generate an the missing values is different for signal and for background. additional attribute (cluster id). All the generated feature sets are shown in Table 1. Table 1. Attribute sets used for SVM. Set Description 1 Original feature set. 2 Added missing values. 3 Filtered missing values and all 𝑒𝑥 derivatives. 4 Filtered missing values, 𝑒𝑥 and all 𝑥2 derivatives. 5 Filtered missing values, 𝑒𝑥, 𝑥2 and all 𝑥3 derivatives. 6 Filtered missing values, 𝑒𝑥, 𝑥2, 𝑥3 and all √𝑥 derivatives. Figure 1. Plot of 1st PCA component against the 3rd. Red dots Filtered missing values, 7 𝑒𝑥, 𝑥2, 𝑥3, represent signal instances, green dots represent background √𝑥 and all 𝑙𝑜𝑔 (𝑥) derivatives. instances [5]. Selection of most relevant transformations by one 8 attribute. The signal is limited to the events representing only one possibility 9 Unfiltered set of transformations by one attribute. for 𝜏 +𝜏 − pair decay [4]. 10 Unfiltered set of 𝑥 . 𝑖𝑥𝑗 2.2 Data Understanding Set of attributes by one of HiggsML winners (Tim 11 The main task of our method is to separate the signal from the Salimans, DNN). background, based on the ATLAS detector measurements. As vast 12 Unfiltered set of 𝑥2 2. 𝑖 + 𝑥𝑗 amounts of data (a few terabytes/day) are generated within the process it is crucial that only the relevant events are detected and 13 Unfiltered set of 2 2 𝑒𝑥 +𝑥 𝑖 𝑗 . stored [4]. 14 Unfiltered set of √𝑥2 + 𝑥2. Exploratory analysis has shown (see Figure 1) that this task can not 𝑖 𝑗 be successfully accomplished with simple cut-off techniques based 15 Unfiltered set of 2 (1 + 𝑥 . on a single attribute. Figure 2 depicting PCA components plot is a 𝑖𝑥𝑗) bit more promising as parts of phase space can clearly be assigned 16 Filtered set of transformations by 1 and 2 attributes. to one of the classes. 17 (8) with k-means cluster id. Attributes are divided into 3 groups. First group contains 18 primary attributes (measured in the detector), second group contains 12 derived attributes (relevant physical phenomena Filtering of the features has been done manually, with a simple cut- calculated from primary attributes) and 2 metadata values (weight off technique based on feature importance as obtained from linear and event id). Detailed exploratory data analysis can be found in SVM model. [5]. 3. MACHINE LEARNING METHODS 2.3 Data Preprocessing USED ATLAS detector enables good precision of all measurements, Baseline experiments have been carried out with simple cut- therefore expected noise in the data is very small and it can not be further filtered. Missing values have been dealt with in two off techniques and linear methods like logistical regression different ways. Firstly – we used “replacement with average” and Naïve Bayes classifier. As state of the art methods we strategy to fill in the missing data and secondly, we generated included gradient boosting and gradient boosting adjusted additional binary features, representing missing attribute values. for the approximate median significant metrics (see Section SVM expects input data to be normalized, therefore the features 3.2) [11]. have been normalized with average and standard deviation values We are proposing to use SVM method [12]. Linear SVM set to 1. Data transformation has been handled with Pandas library can be used for feature selection with large number of in Python. attributes. It can discover most relevant features in a large feature set. 2.4 Feature Engineering The main task of our work has consisted of extensive feature engineering, where non-linear combinations of features were 40 3.1 Brief Description of SVM On our system SVM learning phase took ~1 hour. For time In our setting we are solving a binary classification problem. Let us optimization purposes normal evaluation with training and test set assume, that the classes are linearly separable in our space. In has been performed. Training set consisted of 225,000 and test set general, there are many different hyper planes that can separate the of 25,000 instances. two classes. Support vector machine (SVM) method is also called Table 2. Evaluation of different attribute sets on SVM with maximum margin classifier. There exists only one hyper plane that linear kernel. maximizes margin between the two classes [12]. Attribute set Prec. Rec. Acc. 𝑭 AMS 𝟏 1 0.665 0.548 0.749 0.600 1.999 3 0.748 0.655 0.805 0.698 2.526 4 0.748 0.654 0.805 0.698 2.528 5 0.740 0.657 0.802 0.696 2.478 6 0.743 0.683 0.809 0.712 2.547 7 0.734 0.690 0.807 0.711 2.516 8 0.732 0.670 0.802 0.700 2.482 10 0.744 0.705 0.815 0.724 2.582 11 0.694 0.584 0.768 0.634 2.201 12 0.744 0.705 0.815 0.724 2.583 13 0.744 0.709 0.816 0.726 2.581 14 0.744 0.705 0.815 0.724 2.583 Figure 2. Maximum margin of dividing hyper plane in 15 0.744 0.710 0.816 0.726 2.578 SVM [5]. SVM classifier is derived from maximization of the margin, which 16 0.740 0.684 0.809 0.711 2.553 2 can be translated into minimization of ||𝑤|| [5][12]. As we are dealing with data sets, where classes are not separable, we need to Results from extensive feature engineering are shown in Table 2. consider a soft margin method that would take into account Linar SVM performed similar to linear baseline methods (logistic classification error. SVM is therefore solving minimization regression, Naïve Bayes). AMS score was ~2.00. Best feature sets 2 for linear SVM were (10), (12), (13) and (14). These feature sets problem of ||𝑤|| + 𝐶 ∑𝑛 𝜉 , where is a classification error 𝑖=1 𝑖 𝜉𝑖 include two-attribute transformations, e.g., x . It is interesting to ixj metrics and C is a parameter that controls the influence of 𝜉. notice that filtered feature sets performed slightly worse. Extensive feature generation achieved almost 30% better AMS results (1.999 3.2 Brief Description of the Evaluation on basic feature set compared to 2.583). Criteria Table 3. Evaluation of different methods and attribute sets Evaluation of the results is to be done with measures derived from compared to baseline and state-of-the-art methods. confusion matrix (accuracy, precision, recall, 𝐹 ). The evaluation 1 metrics ( approximate median significance) is defined as Method and attribute set Prec. Rec. Acc. 𝑭𝟏 AMS 𝑠 𝐴𝑀𝑆 = √2 (𝑠 + 𝑏 + 𝑏 simple window 0.560 0.824 0.716 0.667 1.579 𝑟𝑒𝑔) ln (1 + ) − 2𝑠 𝑏 + 𝑏𝑟𝑒𝑔 log. reg. (1) 0.668 0.535 0.749 0.594 2.015 𝑠 represents sum of event probabilities of true positives (signal), 𝑏 represents sum of event probabilities of true negatives SVM-LIN (13) 0.744 0.709 0.816 0.726 2.581 (background), 𝑏 is set to 10 and represents a pre-set 𝑟𝑒𝑔 GBC (8) 0.787 0.703 0.832 0.742 2.856 regularization parameter. The metrics favorizes recall before precision. In real setting this algorithm is used as a triggering SVM-r (8) 0.791 0.718 0.837 0.752 2.940 mechanism for saving relevant data. Probability for a positive opt. SVM-r (8) 0.907 0.446 0.793 0.598 3.451 example in the real data is only around 𝑝 ≈ 2 × 10−5, therefore we do not want to lose many of them. XGBoost (1) 0.665 0.806 0.793 0.729 3.735 Table 3 contains results of baseline, state-of-the-art and the 4. EVALUATION proposed SVM . Best feature sets for selected methods were chosen. Experiments have been carried out in Python. Data loading and Baseline methods are simple window (based on cut-off technique cleaning has been accomplished with Pandas library, on candidate particle mass) and logistic regression. As state of the implementation of SVM, scaling and other methods have been art methods we included: gradient boosting (GBC) and current state taken from scikit-learn package. Default parameters for of the art (XGBoost, gradient boosting optimized for AMS). SVM have been used. 41 Proposed methods are linear SVM, SVM with RBF kernel (SVM - 7. REFERENCES r) and optimized SVM with RBF kernel (opt. SVM -r). Usage of kernels (RBF and polynomial kernels have been tested) [1] G. Aad et al. Observation of a new particle in the search for improved AMS score for another ~15%. Because of the nature of the Standard Model Higgs boson with the ATLAS detector at SVM kernels in this setting 2-attribute transformations were less the LHC. Physics Letters B, 716(1):1 – 29, 2012. efficient than 1-attribute transformations. Selection of most relevant transformations by 1 attribute (set (8)) gave the best [2] S. Chatrchyan et al. Observation of a new boson at a mass of results. Method behaved better than gradient boosting classifier 125 GeV with the CMS experiment at the LHC. Physics (GBC) on the same training set. However, methods were not Letters B, 716(1):30 – 61, 2012. optimized to maximize AMS score. The difference however [3] HiggsML challenge. https://www.kaggle.com/c/higgs-boson, suggests that the usage of SVM might be a promising way to 2014. [Online; access April 20, 2016]. proceed. [4] C. Adam-Bourdarios, G. Cowan, C. Germain, I. Guyon, B. Finally we optimized SVM with RBF kernel for AMS score and Kégl in D. Rousseau. The Higgs boson machine learning compared it to XGBoost method, which implements gradient challenge. In Workshop on High-energy Physics and boosting, optimized for AMS. Optimization has been done based Machine Learning, HEPML 2014, held at NIPS 2014, on threshold for SVM confidence score. Our method performs Montreal, Quebec, Canada, December 8-13, 2014 [30], pages approximately ~8% worse than the state of the art. There is, 19–55. however, a big difference with XGBoost. Our method yields higher [5] Kenda, K., Podobnik, T., Gorišek, A. Uporaba metod precision than the other methods and still preserves very high AMS strojnega učenja pri analizi podatkov, zajetih z detektorjem score. The proposed method also performs ~20% better than other ATLAS. Diploma thesis. 2016. Faculty of Mathemathics and SVM based methods reported in HiggsML Challenge [3]. Physics, University of Ljubljana [6] P. W. Higgs. Broken symmetries, massless particlees and 5. CONCLUSION gauge fields. Physics Letters, 12:132–133, September 1964. In our work we have examined the potential of SVM for a [7] F. Englert, R. Brout. Broken Symmetry and the Mass of triggering mechanism in high-energy physics domain. With Gauge Vector Mesons. Physical Review Letters, 13:321–323, extensive feature engineering we have also provided an interesting August 1964. input for high energy physics experts, where most effective [8] P. W. Higgs. Broken symmetries and the masses of gauge generated features could be analyzed through domain knowledge. bosons. Phys. Rev. Lett., 13:508–509, Oct 1964. Our method achieves more than 200% better AMS score compared to cut-off techniques, based on statistical approach. Further, our [9] G. Aad et al . Evidence for the Higgs-boson Yukawa coupling methods achieves ~20% better AMS score than other SVM based to tau leptons with the ATLAS detector. JHEP, 04:117, 2015. methods reported by HiggsML Challenge competitors, but [10] T. Chen, T. He. Higgs boson discovery with boosted trees. In performs ~8% worse than current state of the art (XGBoost). There HEPML 2014, held at NIPS 2014, pages 69–80. is however a significant difference between our method and state [11] T. Chen, C. Guestrin. XGBoost: A scalable tree boosting of the art. Although achieving comparable AMS score, our methods system. CoRR, abs/1603.02754, 2016. achieves much better precision. This might make SVM based methods valuable members of an ensemble method. [12] Boser, B. E., Guyon, I. M., Vapnik, V. N. A training algorithm for optimal margin classifiers. Proceedings of the Beside adding SVM methods to ensembles and trying to improve fifth annual workshop on Computational learning theory – state of the art, further work could be done with adapting the SVM COLT '92. p. 144. optimization to AMS metrics. In our work features were selected based on weight-importance. Often different transformations of the [13] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. same attributes have been selected. Features that could improve our Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. models only by little have potentially been left out. This should be Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. studied further. Optimization of SVM parameters should also be Brucher, M. Perrot in E. Duchesnay. Scikit-learn: Machine performed. learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. 6. ACKNOWLEDGMENTS This work was partially supported by the Slovenian Research Agency. 42 1 A methodology to evaluate the evolution of networks using topological data analysis Joao Pita Costa * ** and Tihana Galinac Grbac * Abstract—Networks are important representations in computer science to communicate structural aspects of a given system of interacting components. The evolution of a network has several topological properties that can provide us information on the network itself. In this paper, we present a methodology to compare the the topological characteristics of the evolution of a network, encoded into a (persistence) diagram that tracks the lifetimes of those features. This will enable us to classify the evolution of networks based on the distance between the diagrams that represent such network evolution. In that, we also consider complex vectors that bring a complementary perspective to the distance-based classification that is closer to the computational Fig. 1. The evolving network ScienceAtlas of the collaborations in scientific methods, aims to enhance the computational efficiency of those works by Slovenian researchers, evolving over a 9-year period with 3-year comparisons, and that is by itself a source of open research leaps. Each node represents an author and each edge represents a collaboration. questions. The nodes with degrees smaller than 20 are filtered out so that the networks are not too large to be visualized. 8 8 I. I 10 10 NTRODUCTION 1 5 1 5 7 7 9 9 A. Comparing the topology of the evolution of networks 4 12 4 12 13 13 3 3 19 23 Networks that change as a function of time - known as 23 14 14 18 18 evolving networks - are a natural extensions of undirected 15 22 15 22 graphs (i.e., standard (static) networks). Almost all real world 17 21 17 networks evolve over time, either by adding or removing nodes t = 1 t = 2 6 6 or edges. The example of scientific collaboration analysis, such 8 8 10 10 1 5 5 11 1 11 as in the example of Figure 1 shows such a network. 7 7 9 9 The analysis of the evolution of a network is a matter of 4 20 12 2 4 20 12 13 13 interest transversal to many fields of knowledge, from social 3 19 19 23 3 14 23 14 network analysis and scientific collaboration to computational 18 18 15 22 15 22 biology. A standard example is the network dynamics of a 16 social network such as Twitter should consider an evolution 21 17 21 17 t = 3 t = 4 through time where new nodes come up as new members join, and new edges are created mirroring the new relationships Fig. 2. An example of an evolving network based on an undirected graph between members that appear [1]. Often all of these processes that is growing over time by adding new edges and nodes at each step. occur simultaneously in social networks. Collaborative networks are a prime example of evolving A biological network, on the other hand, is an approximate networks, where nodes represent authors and edges represent mathematical representation of connections found in ecologi- scientific collaborations. This is illustrated in Figure 1. It shows cal, evolutionary, and physiological research, among others. An the plot of three phases of an instance in the scientific commu- example of a relevant application of such analysis of biological nity in Slovenia [13] using ScienceAtlas, a web portal available networks with respect to human diseases is network medicine. at scienceatlas.ijs.si integrating data about 35272 researchers, It considers networks in biological systems containing many 5716 projects, 82905 publications and 17190 video lectures. components connected within complicated relationships but This too allows visualizing collaboration and competences of organized by simple principles [1]. the researchers [14]. In this paper, we focus on the comparison of the evolution of two (or more) given networks. Our approach considers topo- * University of Rijeka, Croatia, ** Quintelligence, Ljubljana, Slovenia logical data analysis (TDA), allowing us to encode the topo- logical features of the corresponding evolving networks onto Category: G.2.2: Mathematics of Computing: Discrete Mathematics ap- diagrams, and using standard methods to compute distances plications Keywords: Network, undirected graph, persistent homology, computational between them. In that, we can classify networks according to topology, persistence diagram. the distance between the topology of their evolution. 43 2 The TDA approach to the study of networks is not itself new. It had several widespread applications from collaboration networks [3] to functional brain networks [15]. There are β0 = 2, β1 = 0 β0 = 3, β1 = 0 β0 = 2, β1 = 0 β0 = 1, β1 = 0 β0 = 1, β1 = 1 β0 = 1, β1 = 0 several ways of considering a height function in a network including: (i) considering weights in the edges of the network β0 - weighted network - and then having the function built by β1 threshold those weights [17]; (ii) measuring the distance from each node to each other by counting the minimal number of edges between them and then building the height function Fig. 3. The computation of persistent homology on a simplicial complex based on that distance [11]; among others. This permits us changing in time [3]. The colors correspond to the topological features to which the lifetime is tracked in the persistence barcode below. The Betti to use persistent homology over such height function. Another numbers indicate the number of connected components β0 in dimension zero, possibility is to consider the maximal cliques as the simplicial the number of holes in the network β1 in dimension one, and the number of complexes (named clique complexes) that feed the persistence tunels and voids β2 in dimension two. algorithm and proceed with the computation directly over that [5]. We used the latter approach to compute the persistence of identifies the closest matching elements of each persistence the networks generated for the purpose of this paper. diagram and determines the global distance based on what is the biggest of those distances. The cost of taking a point B. Basic notions in persistent homology p = (p1, p2) to a point q = (q1, q2) in R2 is given by the L Topology is a field of study in mathematics concerned in ∞ norm kp − qk∞ = max|p1 − q1|, |p2 − q2|. Then, the bottleneck distance between persistence diagrams X and Y is the quality aspects of an object. It focus on the properties that computed by taking the infimum over all such matchings, i.e., are preserved through deformations, twistings, and stretchings d kx − η(x)k of the given continuous objects (e.g. linear maps) in multidi- B (X, Y ) = inf η supx∈X ∞, where the infimum is taken over all bijections η from X to Y . Each point with mensional scenarios. Computational topology takes advantage multiplicity k in a multiset is interpreted as k individual points, of simplification methods (e.g. the triangulation of a space) to and the bijection is interpreted between the resulting sets [4]. permit the computation of topological invariants. One of those computations is homology which evaluates the connectedness of, e.g., a network at different dimensions separately. Thus, C. The motivation of EVOSOFT homology is a natural choice when it comes to the study of the topology of a network. Now, if we consider a monotone Nowadays, software systems start to interconnect to provide function describing the time variable in, e.g., an evolving new and innovative applications and services that drives new network, we can track its homology changes. This notion is development opportunities in all domains. Therefore these soft- known as persistent homology and is rooted in TDA, allowing ware systems have gradually evolved into large scale complex for retrieving the essential topological features of an object systems and we lack models for their further management [2]. Formally, persistent homology computes the topological and evolution. One of the key aspects of such systems is the features of a growing sequence of spaces ∅ = X0 ⊆ X1 . . . ⊆ ability to model and predict their behaviour to achieve the Xn = X, known as a filtration of the space X. Hi(X) is the i- required quality of operations to fulfill human expectations in th homology group of X, with an associated i-th Betti number all domains. In that, the project Evolving Software Systems: of X,βi, corresponding to the measure of connectedness in the Analysis and Innovative Approaches for Smart Management i-th dimension (cf. [5]). Using the inclusion maps Xj → Xj+1 (EVOSOFT) aims to understand how abstract software struc- we can identify copies of Z2 in the homology groups Hi(Xj) tures can be used to model global system properties (e.g. fault and Hi(Xj+1) of a filtration and track where the homology distributions). Understanding how to use software structure changes. We do that by recording when a new copy appears to model fault distributions can help us to improve system (i.e. ”is born”), and when an existing copy persists or merges to reliability. EVOSOFT observes software structure as networks an existing one (i.e. ”dies”). That persistence of the topological with nodes representing various software functions that are feature is tracked by a lifetime bar (as shown in Figure 3) interconnected to each other by function calls. In particular, a that can be equivalently represented by an ordered pair (x, y), software graph structure considers nodes as program functions where x is the birth time and y is the death time. The (e.g. classes in object oriented paradigm, functions or modules multiset of all such points exists in the plane subset defined in functional programming) and edges as function calls or by 0 < x < y that encodes the topology of a space and signals transferred in communication among these program is known as persistence diagram. Several topological features functions. EVOSOFT aims to observe how large software can have the same lifetimes and therefore some of the points systems evolve from version to version, and understand the in the persistence diagram are repeated in the multiset. We relationship between the change in software structure during its refer to their amount as multiplicity. We consider the infinite evolution, and the change in software fault distributions across points in the diagonal as points of the persistence diagram its structure. Previous empirical studies in [9], [10], [18] show with null lifetime. The standard method to compare two that communication structures among the program functions persistence diagrams - called bottleneck distance - measures significantly influence system fault distributions. This is what the cost of finding a correspondence between their points. It motivated us to further explore this relationship. 44 3 To compare the evolution of networks we consider the dis- tance between the corresponding persistence diagrams, using the bottleneck distance. This permits a fast computation of the distance between the (persistence diagrams representing the) Fig. 4. The methodology diagram to encode and compare the topology topology of two evolving networks. In the case of the persis- of the evolution of two (or more) networks using TDA. It considers three tence diagrams encoding the topology of evolving networks phases: (i) the data, where we input the evolving network represented by an A, B and C represented in Figure 5, we get d(A, B) = 0 and adjacency matrix; (ii) the topology, where a persistence diagram encodes the d(B, C) = d(A, C) = 1. This discards the points with infinite topological invariants of the network evolution; and (iii) the distance, held between persistence diagrams representing the topology of given evolving persistence that are less relevant when considering dimension 1 networks. diagrams. The computations were done using the TDA package 8 8 8 10 10 10 available in R [7]. In this example we can explore the distance 1 5 1 5 1 5 7 7 7 between several possible evolution of a network. In it, shows 9 9 9 4 12 4 12 4 12 13 13 13 how TDA can contribute to better understand the behavior of 3 19 19 19 23 3 3 14 23 14 23 14 a certain network. 18 18 18 15 22 15 22 15 22 21 17 21 17 21 17 III. THE EVOSOFT EXPERIMENTS For the purpose of this research we will use the EVOSOFT 4 4 4 motivation to generate networks that fit that scenario and allow 3 3 3 us to compare the evolution of networks in that context. In these preliminary experiments we shall consider data rep- resenting the evolution of networks based on the empirical 1 2 3 1 2 3 4 1 2 3 4 analysis of the evolution of complex software systems [8]. Fig. 5. The presented methodology applied to the comparison of the evolution In these experiment we will generate networks with labeled 2 of three networks sharing the same evolution as the network represented in nodes - not ordered pairs in R - and extract all maximal Figure 2 but with differences in phase 2. Each evolving network is associated cliques from it. The maximal cliques serve us to construct with one persistence diagram that encodes its topology. This permits us to clique complexes with which we are able to later on compute visualize the relevant topological features of the evolution of the networks. the topology of those networks. In these experiments we shall obtain the EVOSOFT evolving networks provided by their II. USE-CASE METHODOLOGY TO ENCODE AND COMPARE graph’s Boolean adjacency matrix. Those matrices must be EVOLVING NETWORKS consistent with the evolution of the network in the sense that existing maximal cliques in phase i must maintain or enlarge The problem of tracking and comparing the evolution of in the phase i + 1 during the updates of a software version. networks can be very demanding and complex due to the com- The persistence diagrams computed by Perseus shall exhibit binatorial properties of networks. In the following section we the encoded topology of evolving networks corresponding to shall describe the methodology diagram to encode a compare different pieces of software. the topology of the evolution of networks (as illustrated in The comparison between the topology of a pair of evolving Figure 4). It considers persistent homology to encode the topo- networks given by the adjacency matrix is given by the logical features of the evolution of a network using persistence bottleneck distance between the corresponding diagrams. That diagrams. In that, we first provide the evolving network given distance can be computed using the R library [7]. When by one Boolean adjacency matrix for each phase of network considering other evolving networks we can calculate the development. We then compute the persistent homology of pairwise distance between all of them and consider single the evolving network by feeding the concatenated matrices a linkage clustering based on this metric (as in earlier TDA suitable algorithm. It will encode the topology of each evolving applications to gene expression data as in [12]) to allow network, representing it by one unique persistence diagram classification based on the topology of network evolution. each. Finally, we measure the bottleneck distance between persistence diagrams to identify how close are the evolving networks to each other based on their topology. IV. COMPARISON THROUGH COMPLEX VECTORS To the purpose of this paper, we used the software library A possible algebraic representation of persistence diagrams Perseus [16] to compute the homology of a the evolving is offered by complex polynomials. The method layed out in network represented in Figure 2, given by the graph’s Boolean [6] can lead to avoid tedious and less meaningful computations adjacency matrix. The network is provided to Perseus as a of bottleneck distance, since far polynomials represent far list of cliques including the time of appearance. The output of persistence diagrams (the converse is known not to be true). A that procedure is a persistence diagram that corresponds to the fast comparison of the coefficient vectors can reduce the size of topological changes within the evolution of that network. The the database to be classified by the bottleneck distance. We can evolving network A on the left has four stages as illustrated then focus on close persistence diagrams for which we want to in Figure 2. The evolving networks B and C are variations of calculate precise measures. This should complement existing the evolution of the end network in A with different phases at methods, rising the efficiency of computations for large evolv- time t = 2, as represented in 5. ing networks. Given a persistence diagram D described by 45 4 its points p1 = (u1, v1), . . . , ps = (us, vs) with multiplicities ACKNOWLEDGMENT r1, . . . , rs, respectively, the method considers complex num- The authors would like to thank to Barbara di Fabio for bers z1 = u1 + iv1, . . . , zs = us + ivs. This allows us to the usefull discussions on complex vectors and advice on associate to D the complex polynomial f D(t) = Πs (tz j=1 j )rj further research, and to Primož Škraba for his comments and where rj is the multiplicity of the point pj. It was shown in [6] suggestions. The first and second authors would like to thank that the first k coefficients are the ones carrying most of the to the support of the Croatian Science Foundation’s funding of relevant information and, therefore, the choice of a threshold the project EVOSOFT (UIP-2014-09-7945). The second author k can reduce the computational complexity. would also like to thank to the support by the University of The unpublished 2-part algorithm by the authors of [6] Rijeka Research Grant 13.09.2.2.16 funding. permits us to input a persistence diagram in order to compute a complex vector out of it. Then the same algorithm com- R pares two complex vectors corresponding to two persistence EFERENCES diagrams to output a float corresponding to the distance [1] A. L. Barabási. Network medicine - from obesity to the diseasome. between those vectors. At the moment, this approach to convert New England Journal of Medicine, 357(4):404–407, 2007. persistence diagrams into complex vectors can be applied only [2] G. Carlsson. Topology and data. Bulletin of the American Mathematical when neglecting points with infinite persistence. In the running Society, 46(2):255–308, 2009. example we get the polynomial p [3] C. J. Carstens and K. J. Horadam. Persistent homology of collaboration A = (t − 1 − 3i)(t − 2 − networks. Mathematical problems in engineering, 2013. 4i)(t − 3 − 4i) = pB and pC = (t − 1 − 3i)(t − 2 − 4i)2, not [4] H. Edelsbrunner and J. Harer. Persistent homology-a survey. Contem- considering points of infinite persistence. We then develop the porary mathematics, 453:257–282, 2008. polynomials to identify their coefficients into a complex vector. [5] H. Edelsbrunner and J. L. Harer. Computational Topology. American The distance between the three complex vectors corresponds Mathematical Society, Providence, RI, 2010. to a basic classification of the given evolving networks. This [6] B. D. Fabio and M. Ferri. Comparing persistence diagrams through is not a dense case where we would need additional tools complex vectors. In International Conference on Image Analysis and like complex vectors. Though, real life examples of evolving Processing, pages 294–305, 2015. networks are appropriate cases of such needs due to their [7] B. T. Fasy, J. Kim, F. Lecci, and C. Maria. Introduction to the R package inherent complexity. TDA. arXiv:1411.1830, 2014. [8] T. Galinac and S. Golubić. Project overlapping and its influence on the V. CONCLUSIONS AND FURTHER WORK product quality. In Proceedings of the 8th International Conference on Telecommunications, 2005. ConTEL 2005., volume 2, pages 655–662, In this paper we have discussed the topological data analysis June 2005. of evolving networks. In that we presented a method to encode [9] T. G. Grbac and D. Huljenic. On the probability distribution of faults the topology of the evolution of a given network through in complex software systems. Information & Software Technology, a persistence diagram, and its potential for a classification 58:250–258, 2015. based on a chosen distance between diagrams. The inherent [10] T. G. Grbac, P. Runeson, and D. Huljenic. A second replicated complexity of an evolving network demands for the data quantitative analysis of fault distributions in complex software systems. simplification methods to be available and appropriate to IEEE Trans. Software Eng., 39(4):462–476, 2013. the nature of the considered object. In that, the TDA-based [11] D. Horak, S. Maletić, and M. Rajković. Persistent homology of complex networks. Journal of Statistical Mechanics: Theory and Experiment, methodology in this paper can contribute to the analysis and 2009(3):P03034, 2005. interpretation of evolving networks and their behaviour. The [12] M. Juda. Topological structures in gene expression data. unpuplished experiments in real data are valuable to improve this method. work presented at the Genetic Analysis Workshop 19, 2014. In that, the collaborations with the earlier mentioned Slovenian [13] M. Karlovčec, B. Lužar, and D. Mladenić. Core-periphery dynamics Science Atlas would be welcome, allowing us to further in collaboration networks: the case study of slovenia. Scientometrics, explore the interpretation of the topology of the evolution of 109.3:1561–1578, 2016. these collaborative networks and the distance between them. [14] M. Karlovčec, D. Mladenić, M. Grobelnik, and M. Jermol. Conceptual- Further work includes the processing of EVOSOFT existing ization of science using collaboration and competences. The Electronic Library, 34.1:2–23, 2016. networks, as well as the interpretation of results in the context [15] H. Lee, M. K. Chung, H. Kang, B.-N. Kim, and D. S. Lee. Dis- of that field of knowledge. It can provide new challenges criminative persistent homology of brain networks. IEEE International specific to the available data and to its role and usage in Symposium on Biomedical Imaging: From Nano to Macro, pages 841– the field. In particular, the interpretation of the persistent 844, 2011. topological features captured in EVOSOFT experiments repre- [16] V. Nanda. Perseus, the persistent homology software. http://www.sas. sents a relevant open problem that requires a deeper analysis upenn.edu/∼vnanda/perseus. Accessed: 2017-09-05. based on the EVOSOFT expertise and the manipulation of [17] G. Petri, M. Scolamiero, I. Donato, and F. Vaccarino. Topological strata the topological results. Lastly, the mathematical development of weighted complex networks. PloS one, 8(6):p.e66506, 2013. of the complex vector method, that contributes to the study of [18] J. Petric and T. G. Grbac. Software structure evolution and relation to evolving networks in general, is a rather computational method system defectiveness. In 18th International Conference on Evaluation and Assessment in Software Engineering13-14, 2014, pages 34:1–34:10, that is suitable to the application of compatible algorithms, 2014. allowing potential engineering applications. Moreover, it is itself a great source of open mathematical problems that we shall consider in further research (e.g. stability [5]). 46 Improving mortality prediction for intensive care unit patients using text mining techniques Primož Kocbek1, Nino Fijačko1, Milan Zorman2, Simon Kocbek1,3, Gregor Štiglic1,2 1Univerza v Mariboru Fakulteta za zdravstvene vede, Maribor, Slovenija, +386 2 300 47 00 2Univerza v Mariboru Fakulteta za elektrotehniko, računalništvo in informatiko, Maribor, Slovenija, +386 2 220 7000 3Kinghorn Centre for Clinical Genomics, Garvan institute of Medical Research, Sydney, Australia, +61 (02) 9295 8100 {primoz.kocbek, nino.fijacko, milan.zorman, gregor.stiglic}@um.si, skocbek@gmail.com ABSTRACT 1. INTRODUCTION Numerous severity assessment scores for estimation of in-hospital Predicting the mortality of ICU patients is a complex and dynamic mortality in Intensive Care Unit (ICU) have been developed over process. Critical illness severity assessment scores, such as Acute the last 40 years. In this study, we predicted 1-month mortality in Physiology and Chronic Health Evaluation I-IV (APACHE), chronic kidney disease (CKD) patients using the open Medical Sequential Organ Failure Assessment score (SOFA), Mortality Information Mart for Intensive Care III (MIMIC III) database. Probability Model I-III (MPS), or Simplified Acute Physiology Additionally, we observed the improvement in predictive Score I-III (SAPS), help clinicians detect patient problems earlier, performance and interpretability of the baseline model used in thus providing a better holistic treatment for patients and making ICUs to a more complex model using simple features such as patient care more cost-effective. The sheer number of different unigrams or bigrams, as well as advanced features extracted from severity scores used is partly because of the quality of recorded data textual nursing notes. For the latter, MetaMap extraction tool was needed to calculate them. An example of such severity score is used to extract medical concepts based on the Unified Medical APACHE IV, which tends to have the best discriminative Language System (UMLS) terminology. We used a logistic performance but the data needed to compute the score is complex regression based classifier, built using Simplified Acute Physiology and hospitals would need to develop a good enough high-quality Score II (SAPS II), age and gender, as a baseline model. The database for analysis of risk stratification [1, 2]. baseline model was then compared to regularized logistic The MIMIC III database [3], a free public-access intensive care unit regression based classifier built using simple and more complex repository, is widely used for predicting the mortality of ICU additional features. The Area Under the ROC Curve (AUC) results patients, where developers provided several severity scores for the for the baseline predictive performance improved from of 0.761 to database (e.g., OASIS, SAPS, SAPS II, SOFA), but also noted that 0.782 when frequency of unigrams and bigrams were used to build for APACHE IV the coding of the diagnostic component is difficult the model. In a similar scenario, where unigram and bigram and might lack accuracy [4]. Part of MIMIC III are free text nursing frequency was replaced with Term Frequency–Inverse Document notes, which represent a good candidate source of information for Frequency (TF-IDF) based feature values, AUC further increased mortality risk prediction, as they contain a detailed and regularly- to 0.786. updated record of the interventions performed, medications This paper represents an opportunity in extracting new knowledge administered, vital signs, and physical examination findings, all of in the form of unigrams, bigrams or concepts extracted from textual which carry highly specific information about the patient’s notes accompanied by regression coefficient values that can be dynamic physiological state and eventual outcome [5]. Because interpreted as relations between the features and the outcome. The such data is unstructured, our purpose is knowledge discovery combination of both can provide added value in decision support where we observe the improvements in predictive performance and systems in ICU departments, where data is collected in electronic interpretability of predictive models based on additional features medical records (EMRs) in real-time. extracted from nursing notes collected in EMRs. More precisely, we aim to predict one month mortality in CKD (ICD 9 code 585.x) Categories and subject descriptors patients and compare the improvement of the baseline model H.2.6 [Information Systems]: Database Machines performance by including simple features such as unigrams or H.2.8 [Information Systems]: Database Applications bigrams as well as more advanced features extracted from text in the form of medical concepts using MetaMap [6] to define mapping General Terms between textual notes and UMLS terminology. Algorithms, Measurement, Documentation, Performance, Reliability, Experimentation. 2. METHODS The data were obtained from the MIMIC III database, version 1.3, Keywords to select 58,976 hospitalizations for 46,520 patients who stayed in Text mining, ICU, database, machine learning, mortality critical care units of the Beth Israel Deaconess Medical Center in prediction. Boston between 2001 and 2012. The database included 26 linked tables, which can be merged mostly by patient or hospitalization identification numbers. Our focus were nursing notes (i.e., free text 47 notes from patients with CKD diagnosis), where we excluded features in our model. Please note that several meta mappings may hospitalization of patients that died within 24 hours of admission be found in a sentence. Parentheses contain the concept’s preferred and nursing notes that were not fully updated, where duplication of name while square brackets contain the concept’s semantic type. data was likely. That left us with 10,867 nursing notes from 4,381 One of our goals was interpretability and avoidance of over-fitting, hospitalizations. The first nursing note was taken on average 7.8 therefore we restricted model building to regularized linear models, hours after admission (85.2 % hospitalizations have at least two further we narrowed the selection to L1 regularization models or notes), second taken on average 14.7 hours after admission (38.1 % least absolute shrinkage and selection operator (lasso), which had at least 3 taken) and the third one was taken on average 17.5 includes feature selection functionality which was needed due to hours after admission. A slight majority of hospitalized patients high number of features in our datasets [8]. We expanded on the were male (59.4 %), with an average age of 65.6 (Standard work of Marafino et al. [4], where they used the MIMIC II dataset Deviation (SD) 15.2)) and a 13.4 % mortality rate (death during or and predicted mortality via stochastic gradient descent-based up to one month after the hospitalization was recorded). classifiers with TF-IDF on the extracted unigrams and bigrams for Developers of the database also included source code for patients that died during the given ICU stay. All experiments were calculation of several severity scores (i.e., OASIS, SAPS, SAPS II implemented in R language and environment for statistical and SOFA), and we selected SAPS II as the main feature in the computing [9] using glmnet package [10] to build and validate baseline model, since it is used on daily bases in hospitals. The predictive models. input features of our baseline model consisted of SAPS II score, age and gender. 3. RESULTS Nursing notes were initially processed using traditional text Results presented in this section were obtained from four scenarios extraction algorithms with stemming and removal of stop words, where we compared different combinations of two types of which produced 51,680 unique unigrams and 363,055 unique extracted features (n-grams versus concepts) and two types of the bigrams. Both frequency and TF-IDF tables were prepared. The extracted feature values (frequency vs. TF-IDF). text from the nursing notes was also processed using the MetaMap The baseline classifier with basic features (SAPS II score, age and tool from the US National Library of Medicine. MetaMap identifies gender) to predict mortality was used to evaluate the performance and normalizes biomedical terminology from the Unified Medical gain when more complex classifiers were built. Initially, we were Language System (UMLS). Binary representations of Bags of interested in measuring the improvement of the baseline predictive Phrases (BOP) identified by MetaMap, and their UMLS (Concept performance by adding unigram and bigram features to SAPS II, Unique Identifiers) CUIs were used as features for the classifier. age and gender of the patients. At the same time, we were observing Space characters in phrases were replaced with underline character. the complexity of the predictive models by observing the number Word sense disambiguation was used to distinguish similar words of features that were included in the models. with same structure. In addition, phrases were marked with whether Second row in Table 2 presents the results when frequency of the associated concepts were found in a positive or negative unigrams and bigrams were used to build the model. It can be seen context. To identify the polarity of phrases (negative or positive), that baseline predictive performance improved from the AUC of the NegEx module [7] of MetaMap was enabled in order to identify 0.761 to 0.782 when frequency information of unigrams and the polarity of phrases (negative or positive). NegEx implements a bigrams was used to build the model. With an improvement of more simple algorithm that contains several regular expressions than 2% in AUC it is also important to note that in this scenario we indicating negation, filters out sentences containing phrases that obtained the simplest models in terms of interpretation with only 9 falsely appear to be negation phrases, and limits the scope of the features on average. In a similar scenario where unigram and negation phrases. bigram frequency was replaced with TF-IDF based feature values For better understanding, we provide a short example of MetaMap- resulted in further improvement with AUC of 0.786. In the next two annotated phrases from part of the sentence “URINE experiments, we replaced unigrams and bigrams with UMLS MICROSCOPY” (Table 1). concepts that were extracted from free text (nursing notes). Table 2 demonstrates further improvement of the classification Table 1. Example of MetaMap-annotated phrases from part performance as the AUC in case of TF-IDF increased to 0.789, of the sentence “URINE MICROSCOPY” while even further improvement with a mean AUC of 0.791 was Meta Candidates measured in frequency based features experiment. Tables 3 and 4 Score Matched concept provide more detailed overview of selected features along with the 1000 C0430397: Urine microscopy (Microscopic number of times a feature was included in a predictive model during urinalysis) [Laboratory Procedure] 100 cross-validation runs. It can be seen that TF-IDF produced 861 C0026018: Microscopy [Laboratory Procedure] predictive models with a larger number of features and therefore 789 C0205288: Microscopic [Qualitative Concept] represents a richer set of concepts that can be used to warn a 694 C0042036: Urine [Body Substance] medical expert of a potential threat to a patient. On the other hand, 694 C0042037: Urine (In Urine) [Functional Concept] a complex model (in case of TF-IDF experiment, more than 35 694 C2963137: Urine (Portion of urine) [Body features were used in a model on average) might represent a Substance] challenge for medical experts when interpretation of models is Meta Mapping needed. Score Matched concept 1000 C0430397: Urine microscopy (Microscopic 4. DISCUSSION AND CONCLUSIONS urinalysis) [Laboratory Procedure] In this paper, we observed the improvements in predictive performance and interpretability of predictive models based on new The meta candidates are all discovered mappings that are ordered features extracted from nursing notes collected in EMRs. More according to an evaluation metric described in [7], while meta precisely, we predicted one month mortality, at the end of 24-hours mappings represent the selected phrases which finally represent spent in the ICU, for CKD patients. The improvement of the 48 Table 2. Summary of predictive performance measures for different experiments using features extracted from nursing notes AUC Sensitivity Specificity PPV NPV Selected features Baseline (SAPS II) 0.761 0.712 0.687 0.283 0.933 1.0 [0.757-0.766] [0.704-0.720] [0.680-0.695] [0.277-0.288] [0.931-0.935] [1.0-1.0] Unigrams and 0.782 0.727 0.714 0.306 0.939 9.1 bigrams (frequency) [0.778-0.786] [0.721-0.734] [0.707-0.722] [0.300-0.313] [0.937-0.940] [6.6-11.6] UMLS concept 0.791 0.736 0.716 0.310 0.941 17.6 mapping (frequency) [0.787-0.795] [0.728-0.745] [0.708-0.724] [0.304-0.316] [0.939-0.943] [14.8-20.5] Unigrams and 0.786 0.733 0.712 0.306 0.940 25.1 bigrams (TF-IDF) [0.782-0.790] [0.725-0.740] [0.704-0.720] [0.300-0.313] [0.938-0.941] [21.7-28.6] UMLS concept 0.789 0.747 0.700 0.302 0.942 35.5 mapping (TF-IDF) [0.785-0.793] [0.741-0.754] [0.692-0.709] [0.296-0.308] [0.94-0.943] [31.0-40.0] baseline model (SAPS II, gender and age) in comparison to to interpretability of such models. As already noted in [12] in case predictive models that included unigrams, bigrams as well as more of similar predictive performance on training set, the simplest advanced features extracted from text in the form of medical models often also perform the best on the test set. Therefore, we concepts using the MetaMap extraction tool was also observed. should also take the complexity of models with similar The results show high level of predictive performance that can be performance into account. In case of our study, the complexity of compared to a similar study by Brabrand et al. [11] where it was the four proposed models ranges from 9 up to approximately 35 shown that using clinical intuition of the admission staff produced selected features. In case of both unigram and bigram based comparable predictions in terms of AUC when identify patients at models, it would perhaps make sense to use the simpler model as it risk of dying. However, it has to be noted that Brabrand and does not significantly differ in predictive performance at a colleagues did not focus on a specific group of patients. significantly lower complexity of the simpler, frequency based model. Table 3. Frequency of specific features selected in the UMLS concept mapping (Frequency) experiment Table 4. Frequency of specific features selected in the UMLS Single_count_all_CKD_all_feat N concept mapping (TF-IDF) experiment DNR_(DNR_-_Do_not_resuscitate)_[Finding] 100 Single_tfidf_all_CKD_all_feat N Map_(Functional_Map)_[Conceptual_Entity] 100 DNR_(DNR_-_Do_not_resuscitate)_[Finding] 100 SAPSII 100 Meeting_(Meetings)_[Health_Care_Activity] 98 Meeting_(Meetings)_[Health_Care_Activity] 100 Coccyx_(Entire_coccyx)_[Body_Part_Organ_ SAPSII 100 93 or_Organ_Component] CMO_(Chronic_multifocal_osteomyelitis)_ PICC_line_(Peripherally_inserted_central_ [Disease_or_Syndrome] 98 86 catheter_(physical_object))_[Medical_Device] PICC_line_(Peripherally_inserted_central_ Anuria_[Disease_or_Syndrome] 85 catheter_(physical_object))_[Medical_Device] 97 CMO_(Chronic_multifocal_osteomyelitis)_ Family_[Family_Group] 96 82 [Disease_or_Syndrome] Coccyx_(Entire_coccyx)_[Body_Part_Organ Family_[Family_Group] 70 _or_Organ_Component] 88 vascular_(Blood_Vessel)_[Body_Part_Organ_ Heels_(Heel)_[Body_Location_or_Region] 88 51 or_Organ_Component] Pressors_[Pharmacologic_Substance] 83 Bilirubin_[Biologically_Active_ 45 Map_(Functional_Map)_[Conceptual_Entity] 74 SubstanceOrganic_Chemical] Levophed_[Organic_ChemicalPharmacologic_ Prognosis_(Forecast_of_outcome)_ 45 Substance] 73 [Health_Care_Activity] loosen_(Loosening)_[Functional_Concept] 70 error_[Qualitative_Concept] 35 neg_DNR_(DNR_-_Do_not_resuscitate)_ Necrotic_(Necrosis)_ 34 [Finding] 68 [Organ_or_Tissue_Function] Pleural_effusion_(Pleural_effusion_fluid)_ Worsening_(Worse)_[Qualitative_Concept] 68 28 [Body_Substance] error_[Qualitative_Concept] 67 Poor_prognosis_(Prognosis_bad)_ dysfunction_(physiopathological)_[Functional_ 28 [Finding] Concept] 64 Thick_[Qualitative_Concept] 28 Bilirubin_[Biologically_Active_SubstanceOrganic_ Hypotensive_[Pathologic_Function] 27 Chemical] 62 dysfunction_(physiopathological)_ Anasarca_[Pathologic_Function] 59 25 [Functional_Concept] Coccyx_(Bone_structure_of_coccyx)_ Brain_[Body_Part_Organ_or_ [Body_Part_Organ_or_Organ_Component] 51 23 Organ_Component] Poor_prognosis_(Prognosis_bad)_[Finding] 50 When providing the predictive models for healthcare experts to From the most frequently selected features (Table 3 and 4) we can support their work in clinical practice, we should also pay attention observe some very general concepts, like “do not resuscitate 49 (DNR)” and high SAPS II score, indicating higher mortality. Also, [3] Johnson, A.E., Pollard, T.J., Shen, L., Lehman, L.W.H., Feng, family related concepts can be easily interpreted by a fact that M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L.A. and Mark, physicians usually call family members to discuss the severity of R.G., 2016. MIMIC-III, a freely accessible critical care database. the situation, especially when the situation is critical or life Scientific data, 3. threatening. Additional features indicating higher mortality are [4] Marafino, B. J., Boscardin, W. J., and Dudley, R. A. 2015. concepts related to bones like “coccyx” and “heel”, which could Efficient and sparse feature selection for biomedical text indicate specific problems related to calcification, frequently classification via the elastic net: Application to ICU risk observed in CKD patients. Medical terms such as “peripherally stratification from nursing notes. Journal of biomedical inserted central catheter”, “chronic multifocal osteomyelitis” and informatics. 54 (April 2015), 114–120. DOI= “use of pressors (pharmacologic_substance)” could be interpreted 10.1016/j.jbi.2015.02.003. as signs of worsening health situation. We plan to investigate these conclusions in more detail in future work. It is also interesting to [5] MIT Laboratory for Computational Physiology-mimic-code: note that the UMLS frequency model's most selected variable was 2017. https://github.com/MIT-LCP/mimic-code/tree/master/conce the medical term “Anuria (non-passage or less than 100 milliliters pts/severityscores. Accessed: 2017- 09- 04. passage of urine a day)”, which was not selected in the UMLS TF- [6] Metamap: Mapping text to the umls metathesaurus. 2006. IDF model. https://pdfs.semanticscholar.org/e262/a22134cca0e484be1095160 Further development of our model will include extensions where a cc4ec8d9e7624.pdf. Accessed: 2017- 09- 04. shorter (e.g., 6 or 12 hours) period would be used to provide “early [7] Chapman, W. W., Bridewell, W., Hanbury, P., Cooper, G. F., warning” signal to healthcare experts working in the ICU. and Buchanan, B. G. 2001. A simple algorithm for identifying Additional features can be extracted from MIMIC-III that would negated findings and diseases in discharge summaries. Journal of further improve the predictive performance and possibly also the biomedical informatics. 34, 5, (October 2001), 301–310. DOI= interpretability of the models. 10.1006/jbin.2001.1029. 5. ACKNOWLEDGEMENT [8] Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B The authors would like to acknowledge financial support from the (Methodological), pp.267-288. Slovenian Research Agency (research core funding No. P2-0057 and bilateral grant ARRS-BI-US/16-17-064). [9] R Core Team, 2013. R: A language and environment for statistical computing. R Foundation for Statistical Computing, 6. REFERENCES Vienna, Austria. URL http://www.R-project.org/. [1] Keegan, M. T., Gajic, O., and Afessa, B. 2012. Comparison of [10] Friedman, J., Hastie, T. and Tibshirani, R., 2010. APACHE III, APACHE IV, SAPS 3, and MPM0III and Influence Regularization paths for generalized linear models via coordinate of Resuscitation Status on Model Performance. Chest. 142, 4 (April descent. Journal of statistical software, 33(1), p.1. 2012), 851–858. DOI= 10.1378/chest.11-2164. [11] Brabrand, M., Hallas, J., and Knudsen, T. 2014. Nurses and [2] Wu, V. C., Tsai, H. B., Yeh, Y. C., Huang, T. M., Lin, Y. F., physicians in a medical admission unit can accurately predict Chou, N. K., ... and Wu, M. S. 2010. Patients supported by mortality of acutely admitted patients: a prospective cohort study. extracorporeal membrane oxygenation and acute dialysis: acute PloS one, 9, 7, (July 14), e101739. DOI= physiology and chronic health evaluation score in predicting 10.1371/journal.pone.0101739. hospital mortality. Artificial organs. 34, 10 (May 2010), 828–835. [12] Stiglic, G., Kocbek, S., Pernek, I., and Kokol, P. 2012. DOI= 10.1111/j.1525-1594.2009.00920.x. Comprehensive decision tree models in bioinformatics. PloS one, 7, 3, (March 30), e33812. DOI= 10.1371/journal.pone.0033812 50 Indeks avtorjev / Author index Belyaeva Evgenia ......................................................................................................................................................................... 15 Brank Janez .................................................................................................................................................................................... 7 Doyle Casey ................................................................................................................................................................................. 27 Fijacko Nino ................................................................................................................................................................................. 47 Fortuna Blaz ................................................................................................................................................................................. 35 Fuart Flavio .................................................................................................................................................................................. 15 Galinac Grbac Tihana ................................................................................................................................................................... 43 Grobelnik Marko ...................................................................................................................................................................... 7, 15 Herga Zala .................................................................................................................................................................................... 27 Jovanoski Viktor .......................................................................................................................................................................... 35 Karlovcec Mario ........................................................................................................................................................................... 35 Kenda Klemen .............................................................................................................................................................................. 39 Kladnik Matic ............................................................................................................................................................................... 23 Kocbek Primoz ............................................................................................................................................................................. 47 Leban Gregor ........................................................................................................................................................................... 7, 15 Mladenic Dunja ................................................................................................................................................................ 11, 23, 39 Moore Pat ..................................................................................................................................................................................... 27 Novak Erik ................................................................................................................................................................................... 31 Novalija Inna ................................................................................................................................................................................ 31 Pita Costa Joao ....................................................................................................................................................................... 15, 43 Pollak Senja .................................................................................................................................................................................. 19 Repar Andraz ............................................................................................................................................................................... 19 Rupnik Jan .................................................................................................................................................................................... 35 Torkar Miha ................................................................................................................................................................................. 11 51 52 Konferenca / Conference Uredila / Edited by Odkrivanje znanja in podatkovna skladišča - SiKDD / Data Mining and Data Warehouses - SiKDD Dunja Mladenić, Marko Grobelnik Document Outline A - Naslovnica-SPREDNJA - C B - Naslovnica - notranja - C C- Kolofon - C D-E - IS2017 - skupni zacetni del Blank Page F - Kazalo - C G - Naslovnica podkonference - C H - Predgovor - C I - Programski odbor - C J - PDF - C 01 - Brank_Wikifier 02 - Torkar_NewsFinancialMarket 03 - Costas_MIDAS 04 - Repar_Translation 05 - Kladnik_Segmetnation Matic Kladnik, Luka Stopar, Blaz Fortuna, Dunja Mladenić 06 - Herga_ClinetRisk 07 - Novak_Skill Introduction Related Work Data Acquisition Dashboard Conclusion And Future Work Acknowledgments References 08 - Jovanoski_Anomalies 09 - Kenda_HiggsBoson 10 - Costas_Topology 11 - Kocbek_Mortality 1. INTRODUCTION 2. METHODS 3. RESULTS 4. DISCUSSION AND CONCLUSIONS 5. ACKNOWLEDGEMENT 6. REFERENCES K - Index - C L - Naslovnica-ZADNJA - C Blank Page Blank Page Blank Page Blank Page Blank Page Blank Page