Zbornik 21. mednarodne multikonference INFORMACIJSKA DRUŽBA - IS 2018 Zvezek C Proceedings of the 21st International Multiconference INFORMATION SOCIETY - IS 2018 Volume C Odkrivanje znanja in podatkovna skladišča - SiKDD Data Mining and Data Warehouses - SiKDD Uredila / Edited by Dunja Mladenić, Marko Grobelnik http://is.ijs.si 8.–12. oktober 2018 / 8–12 October 2018 Ljubljana, Slovenia Zbornik 21. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2018 Zvezek C Proceedings of the 21st International Multiconference INFORMATION SOCIETY – IS 2018 Volume C Odkrivanje znanja in podatkovna skladišča - SiKDD Data Mining and Data Warehouses - SiKDD Uredila / Edited by Dunja Mladenić, Marko Grobelnik http://is.ijs.si 8.–12. oktober 2018 / 8–12 October 2018 Ljubljana, Slovenia Urednika: Dunja Mladenić Laboratorij za umetno inteligenco Institut »Jožef Stefan«, Ljubljana Marko Grobelnik Laboratorij za umetno inteligenco Institut »Jožef Stefan«, Ljubljana Založnik: Institut »Jožef Stefan«, Ljubljana Priprava zbornika: Mitja Lasič, Vesna Lasič, Lana Zemljak Oblikovanje naslovnice: Vesna Lasič Dostop do e-publikacije: http://library.ijs.si/Stacks/Proceedings/InformationSociety Ljubljana, oktober 2018 Informacijska družba ISSN 2630-371X Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni knjižnici v Ljubljani COBISS.SI-ID=31884839 ISBN 978-961-264-137-5 (pdf) PREDGOVOR MULTIKONFERENCI INFORMACIJSKA DRUŽBA 2018 Multikonferenca Informacijska družba (http://is.ijs.si) je z enaindvajseto zaporedno prireditvijo osrednji srednjeevropski dogodek na področju informacijske družbe, računalništva in informatike. Letošnja prireditev se ponovno odvija na več lokacijah, osrednji dogodki pa so na Institutu »Jožef Stefan«. Informacijska družba, znanje in umetna inteligenca so še naprej nosilni koncepti človeške civilizacije. Se bo neverjetna rast nadaljevala in nas ponesla v novo civilizacijsko obdobje ali pa se bo rast upočasnila in začela stagnirati? Bosta IKT in zlasti umetna inteligenca omogočila nadaljnji razcvet civilizacije ali pa bodo demografske, družbene, medčloveške in okoljske težave povzročile zadušitev rasti? Čedalje več pokazateljev kaže v oba ekstrema – da prehajamo v naslednje civilizacijsko obdobje, hkrati pa so notranji in zunanji konflikti sodobne družbe čedalje težje obvladljivi. Letos smo v multikonferenco povezali 11 odličnih neodvisnih konferenc. Predstavljenih bo 215 predstavitev, povzetkov in referatov v okviru samostojnih konferenc in delavnic. Prireditev bodo spremljale okrogle mize in razprave ter posebni dogodki, kot je svečana podelitev nagrad. Izbrani prispevki bodo izšli tudi v posebni številki revije Informatica, ki se ponaša z 42-letno tradicijo odlične znanstvene revije. Multikonferenco Informacijska družba 2018 sestavljajo naslednje samostojne konference:  Slovenska konferenca o umetni inteligenci  Kognitivna znanost  Odkrivanje znanja in podatkovna skladišča – SiKDD  Mednarodna konferenca o visokozmogljivi optimizaciji v industriji, HPOI  Delavnica AS-IT-IC  Soočanje z demografskimi izzivi  Sodelovanje, programska oprema in storitve v informacijski družbi  Delavnica za elektronsko in mobilno zdravje ter pametna mesta  Vzgoja in izobraževanje v informacijski družbi  5. študentska računalniška konferenca  Mednarodna konferenca o prenosu tehnologij (ITTC) Soorganizatorji in podporniki konference so različne raziskovalne institucije in združenja, med njimi tudi ACM Slovenija, Slovensko društvo za umetno inteligenco (SLAIS), Slovensko društvo za kognitivne znanosti (DKZ) in druga slovenska nacionalna akademija, Inženirska akademija Slovenije (IAS). V imenu organizatorjev konference se zahvaljujemo združenjem in institucijam, še posebej pa udeležencem za njihove dragocene prispevke in priložnost, da z nami delijo svoje izkušnje o informacijski družbi. Zahvaljujemo se tudi recenzentom za njihovo pomoč pri recenziranju. V letu 2018 bomo šestič podelili nagrado za življenjske dosežke v čast Donalda Michieja in Alana Turinga. Nagrado Michie-Turing za izjemen življenjski prispevek k razvoju in promociji informacijske družbe bo prejel prof. dr. Saša Divjak. Priznanje za dosežek leta bo pripadlo doc. dr. Marinki Žitnik. Že sedmič podeljujemo nagradi »informacijska limona« in »informacijska jagoda« za najbolj (ne)uspešne poteze v zvezi z informacijsko družbo. Limono letos prejme padanje državnih sredstev za raziskovalno dejavnost, jagodo pa Yaskawina tovarna robotov v Kočevju. Čestitke nagrajencem! Mojca Ciglarič, predsednik programskega odbora Matjaž Gams, predsednik organizacijskega odbora i FOREWORD - INFORMATION SOCIETY 2018 In its 21st year, the Information Society Multiconference (http://is.ijs.si) remains one of the leading conferences in Central Europe devoted to information society, computer science and informatics. In 2018, it is organized at various locations, with the main events taking place at the Jožef Stefan Institute. Information society, knowledge and artificial intelligence continue to represent the central pillars of human civilization. Will the pace of progress of information society, knowledge and artificial intelligence continue, thus enabling unseen progress of human civilization, or will the progress stall and even stagnate? Will ICT and AI continue to foster human progress, or will the growth of human, demographic, social and environmental problems stall global progress? Both extremes seem to be playing out to a certain degree – we seem to be transitioning into the next civilization period, while the internal and external conflicts of the contemporary society seem to be on the rise. The Multiconference runs in parallel sessions with 215 presentations of scientific papers at eleven conferences, many round tables, workshops and award ceremonies. Selected papers will be published in the Informatica journal, which boasts of its 42-year tradition of excellent research publishing. The Information Society 2018 Multiconference consists of the following conferences:  Slovenian Conference on Artificial Intelligence  Cognitive Science  Data Mining and Data Warehouses - SiKDD  International Conference on High-Performance Optimization in Industry, HPOI  AS-IT-IC Workshop  Facing demographic challenges  Collaboration, Software and Services in Information Society  Workshop Electronic and Mobile Health and Smart Cities  Education in Information Society  5th Student Computer Science Research Conference  International Technology Transfer Conference (ITTC) The Multiconference is co-organized and supported by several major research institutions and societies, among them ACM Slovenia, i.e. the Slovenian chapter of the ACM, Slovenian Artificial Intelligence Society (SLAIS), Slovenian Society for Cognitive Sciences (DKZ) and the second national engineering academy, the Slovenian Engineering Academy (IAS). On behalf of the conference organizers, we thank all the societies and institutions, and particularly all the participants for their valuable contribution and their interest in this event, and the reviewers for their thorough reviews. For the sixth year, the award for life-long outstanding contributions will be presented in memory of Donald Michie and Alan Turing. The Michie-Turing award will be given to Prof. Saša Divjak for his life-long outstanding contribution to the development and promotion of information society in our country. In addition, an award for current achievements will be given to Assist. Prof. Marinka Žitnik. The information lemon goes to decreased national funding of research. The information strawberry is awarded to the Yaskawa robot factory in Kočevje. Congratulations! Mojca Ciglarič, Programme Committee Chair Matjaž Gams, Organizing Committee Chair ii KONFERENČNI ODBORI CONFERENCE COMMITTEES International Programme Committee Organizing Committee Vladimir Bajic, South Africa Matjaž Gams, chair Heiner Benking, Germany Mitja Luštrek Se Woo Cheon, South Korea Lana Zemljak Howie Firth, UK Vesna Koricki Olga Fomichova, Russia Mitja Lasič Vladimir Fomichov, Russia Blaž Mahnič Vesna Hljuz Dobric, Croatia Jani Bizjak Alfred Inselberg, Israel Tine Kolenik Jay Liebowitz, USA Huan Liu, Singapore Henz Martin, Germany Marcin Paprzycki, USA Karl Pribram, USA Claude Sammut, Australia Jiri Wiedermann, Czech Republic Xindong Wu, USA Yiming Ye, USA Ning Zhong, USA Wray Buntine, Australia Bezalel Gavish, USA Gal A. Kaminka, Israel Mike Bain, Australia Michela Milano, Italy Derong Liu, USA Toby Walsh, Australia Programme Committee Franc Solina, co-chair Matjaž Gams Vladislav Rajkovič Viljan Mahnič, co-chair Marko Grobelnik Grega Repovš Cene Bavec, co-chair Nikola Guid Ivan Rozman Tomaž Kalin, co-chair Marjan Heričko Niko Schlamberger Jozsef Györkös, co-chair Borka Jerman Blažič Džonova Stanko Strmčnik Tadej Bajd Gorazd Kandus Jurij Šilc Jaroslav Berce Urban Kordeš Jurij Tasič Mojca Bernik Marjan Krisper Denis Trček Marko Bohanec Andrej Kuščer Andrej Ule Ivan Bratko Jadran Lenarčič Tanja Urbančič Andrej Brodnik Borut Likar Boštjan Vilfan Dušan Caf Mitja Luštrek Baldomir Zajc Saša Divjak Janez Malačič Blaž Zupan Tomaž Erjavec Olga Markič Boris Žemva Bogdan Filipič Dunja Mladenič Leon Žlajpah Andrej Gams Franc Novak iii iv KAZALO / TABLE OF CONTENTS Odkrivanje znanja in podatkovna skladišča - SiKDD / Data Mining and Data Warehouses - SiKDD ....... 1 PREDGOVOR / FOREWORD ....................................................................................................................... 3 PROGRAMSKI ODBORI / PROGRAMME COMMITTEES ........................................................................... 4 Preparing Multi-Modal Data for Natural Language Processing / Novak Erik, Urbančič Jasna, Jenko Miha 5 Towards Smart Statistics in Labour Market Domain / Novalija Inna, Grobelnik Marko ................................ 9 Relation Tracker - Tracking the Main Entities and Their Relations Through Time / Massri M. Besher, Novalija Inna, Grobelnik Marko ..............................................................................................................13 Cross-Lingual Categorization of News Articles / Novak Blaž .....................................................................17 Transporation Mode Detection Using Random Forest / Urbančič Jasna, Pejović Veljko, Mladenić Dunja ......................................................................................................................................................21 FSADA, an Anomaly Detection Approach / Jovanoski Viktor, Rupnik Jan ................................................25 Predicting Customers at Risk With Machine Learning / Gojo David, Dujič Darko .....................................29 Text Mining Medline to Support Public Health / Pita Costa Joao, Stopar Luka, Fuart Flavio, Grobelnik Marko, Santanam Raghu, Sun Chenlu, Carlin Paul, Black Michaela, Wallace Jonathan ......................33 Crop Classification Using PerceptiveSentinel / Koprivec Filip, Čerin Matej, Kenda Klemen .....................37 Towards a Semantic Repository of Data Mining and Machine Learning Datasets / Kostovska Ana, Džeroski Sašo, Panov Panče .................................................................................................................41 Towards a Semantic Store of Data Mining Models and Experiments / Tolovski Ilin, Džeroski Sašo, Panov Panče......................................................................................................................................................45 A Graph-Based Prediction Model With Applications / London András, Németh József, Krész Miklós ......49 Indeks avtorjev / Author index ......................................................................................................................55 v vi Zbornik 21. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2018 Zvezek C Proceedings of the 21st International Multiconference INFORMATION SOCIETY – IS 2018 Volume C Odkrivanje znanja in podatkovna skladišča - SiKDD Data Mining and Data Warehouses - SiKDD Uredila / Edited by Dunja Mladenić, Marko Grobelnik http://is.ijs.si 11. oktober 2018 / 11 October 2018 Ljubljana, Slovenia 1 2 PREDGOVOR Tehnologije, ki se ukvarjajo s podatki so v devetdesetih letih močno napredovale. Iz prve faze, kjer je šlo predvsem za shranjevanje podatkov in kako do njih učinkovito dostopati, se je razvila industrija za izdelavo orodij za delo s podatkovnimi bazami, prišlo je do standardizacije procesov, povpraševalnih jezikov itd. Ko shranjevanje podatkov ni bil več poseben problem, se je pojavila potreba po bolj urejenih podatkovnih bazah, ki bi služile ne le transakcijskem procesiranju ampak tudi analitskim vpogledom v podatke – pojavilo se je t.i. skladiščenje podatkov (data warehousing), ki je postalo standarden del informacijskih sistemov v podjetjih. Paradigma OLAP (On-Line-Analytical-Processing) zahteva od uporabnika, da še vedno sam postavlja sistemu vprašanja in dobiva nanje odgovore in na vizualen način preverja in išče izstopajoče situacije. Ker seveda to ni vedno mogoče, se je pojavila potreba po avtomatski analizi podatkov oz. z drugimi besedami to, da sistem sam pove, kaj bi utegnilo biti zanimivo za uporabnika – to prinašajo tehnike odkrivanja znanja v podatkih (data mining), ki iz obstoječih podatkov skušajo pridobiti novo znanje in tako uporabniku nudijo novo razumevanje dogajanj zajetih v podatkih. Slovenska KDD konferenca pokriva vsebine, ki se ukvarjajo z analizo podatkov in odkrivanjem znanja v podatkih: pristope, orodja, probleme in rešitve. INTRODUCTION Data driven technologies have significantly progressed after mid 90’s. The first phases were mainly focused on storing and efficiently accessing the data, resulted in the development of industry tools for managing large databases, related standards, supporting querying languages, etc. After the initial period, when the data storage was not a primary problem anymore, the development progressed towards analytical functionalities on how to extract added value from the data; i.e., databases started supporting not only transactions but also analytical processing of the data. At this point, data warehousing with On-Line-Analytical-Processing entered as a usual part of a company’s information system portfolio, requiring from the user to set well defined questions about the aggregated views to the data. Data Mining is a technology developed after year 2000, offering automatic data analysis trying to obtain new discoveries from the existing data and enabling a user new insights in the data. In this respect, the Slovenian KDD conference (SiKDD) covers a broad area including Statistical Data Analysis, Data, Text and Multimedia Mining, Semantic Technologies, Link Detection and Link Analysis, Social Network Analysis, Data Warehouses. 3 PROGRAMSKI ODBOR / PROGRAMME COMMITTEE Dunja Mladenić, Artificial Intelligence Laboratory, Jožef Stefan Institute, Ljubljana Marko Grobelnik, Artificial Intelligence Laboratory, Jožef Stefan Institute, Ljubljana 4 Preparing multi-modal data for natural language processing Erik Novak Jasna Urbančič Miha Jenko Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan International Ljubljana, Slovenia Ljubljana, Slovenia Postgraduate School jasna.urbancic@ijs.si miha.jenko@ijs.si Ljubljana, Slovenia erik.novak@ijs.si ABSTRACT to find similar items based on the model input. Throughout the In education we can find millions of video, audio and text educa- paper we focus on educational material but the approach can be tional materials in different formats and languages. This variety and generalized to other multi-modal data sets. multimodality can impose difficulty on both students and teachers The reminder of the paper is structured as follows. In section 2 since it is hard to find the right materials that match their learning we go over related work. Next, we present the data preprocessing preferences. This paper presents an approach for retrieving and pipeline which is able to process different types of data – text, video recommending items of different modalities. The main focus is on and audio – and describe each component of the pipeline in section the retrieving and preprocessing pipeline, while the recommenda- 3. A content based recommendation model that uses Wikipedia tion engine is based on the k-nearest neighbor method. We focus concepts to compare materials is presented in section 4. Finally, we on educational materials, which can be text, audio or video, but the present future work and conclude the paper in section 5. proposed procedure can be generalized on any type of multi-modal data. 2 RELATED WORK KEYWORDS In this section we present the related work which the rest of the paper is based on. We split this section into subsections – multi- Multi-modal data preprocessing, machine learning, feature extrac- modal data preprocessing and recommendation models. tion, recommender system, open educational resources Multi-modal Data Preprocessing. Multi-modal data can be seen ACM Reference Format: as classes of different data types from which we can extract similar Erik Novak, Jasna Urbančič, and Miha Jenko. 2018. Preparing multi-modal features. In the case of educational material the classes are video, data for natural language processing. In Proceedings of Slovenian KDD Con- audio and text. One of the approaches is to extract text from all ference (SiKDD’18). ACM, New York, NY, USA, Article 4, 4 pages. https: class types. In [6] the authors describe a Machine Learning and //doi.org/10.475/123_4 Language Processing automatic speech recognition system that can convert audio to text in the form of transcripts. The system can 1 INTRODUCTION also process video files as they are also able to extract audio from There are millions of educational materials that are found in dif- it. Their model was able to achieve a 13.3% word error rate on an ferent formats – courses, video lectures, podcasts, simple text doc- English test set. These kind of systems are useful for extracting uments, etc. Because of its vast variety and multimodality it is text from audio and video but would need to have a model for each difficult for both students and teachers to find the right materi- language. als that will match their learning preferences. Some like to read a Recommendation models. These models are broadly used in short scientific papers while others just like to sit back and watch many fields – from recommending videos based on what the user a lecture that can last for hours. Additionally, materials are written viewed in the past, to providing news articles that the user might in different languages, which is a barrier for people who are not be interested in. One of the most used approaches is based on fluent in the language the material is written in. Finding a good collaborative filtering [16], which finds users that have similar approach of providing educational material would help improving preferences with the target user and recommends items based on their learning experience. their ratings. Recommender systems now do not contain only one In this paper we present a preprocessing pipeline which is able algorithm but multiple which return different recommendations. to process multi-modal data and input it in a common semantic Authors of [10] discuss about the various algorithms that are used space. The semantic space is based on Wikipedia concepts extracted in the Netflix recommender system (top-n video ranker, trending from the content of the materials. Additionally, we developed a con- now, continue watching, and video-video similarity), as well as the tent based recommendation model which uses Wikipedia concepts methods they use to evaluate their system. A high level description of the Youtube recommender system is found in [3]. They developed Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed a candidate generation model and a ranking model using deep for profit or commercial advantage and that copies bear this notice and the full citation learning. Both Netflix and Youtube recommend videos based on on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). users’ interaction with them and the users history. To some extent SiKDD’18, October 2018, Ljubljana, Slovenia this can be used for educational resources but cannot be generalized © 2018 Copyright held by the owner/author(s). on the whole multi-modal data set since we cannot acquire data ACM ISBN 123-4567-24-567/08/06. about users’ interaction with, for instance, text. https://doi.org/10.475/123_4 5 SiKDD’18, October 2018, Ljubljana, Slovenia Erik Novak, Jasna Urbančič, and Miha Jenko A collaborative filtering based recommendation system for the Crawling. The first step is to acquire the educational materials. We educational sector is presented in [8]. They evaluated educational have targeted four different OER repositories (MIT OpenCourse- content using big data analysis techniques and recommended courses Ware, Università di Bologna, Université de Nantes and Videolec- to students by using their grades obtained in other subjects. This tures.NET), for which we used their designated APIs or developed gives us insight into how recommendations can be used in educa- custom crawlers to acquire their resources. For each material we tion but our focus is to recommend educational materials rather acquired its metadata, such as the materials title, url, type, language than courses. In a sense courses can be viewed as bundles of ed- in which it is written and its provider. These values are used in the ucational material; thus, our interest is recommending “parts of following steps of the pipeline as well as to represent the material courses” to the user. in the recommendations. Formatting. Next, we format the acquired material metadata. We designate which attributes every material needs to have as well as 3 DATA PREPROCESSING set placeholders for the features extracted in the following steps In this paper we focus on open educational resources (OER), which of the pipeline. By formatting the data we set a schema which are freely accessible, openly licensed text, media, and other digi- makes checking which attributes are missing easy. We do not have tal assets that are useful for teaching, learning and assessing [21]. a mechanism for handling missing attributes in the current pipeline These are found in different OER repositories maintained by univer- iteration but we will dedicate time to solve this problem in the sities, such as MIT OpenCourseWare [12], Università di Bologna [7], future. Université de Nantes [4] and Universitat Politècnica de València [5], Text Extraction. The third step, we extract the content of each as well as independent repositories such as Videolectures.NET [20], material in text form. Since the material can be a text, video or a United Nations award-winning free and open access educational audio file to handled each file type separately. video lectures repository. For text we employed textract [1] to extract raw text from the For processing the different OER we developed a preprocessing given text documents. The module omits figures and returns the pipeline that can handle each resource type and output metadata content as text. The extracted text is not perfect - in the case of used for comparing text, audio and video materials. The pipeline is materials for mathematics it does not know how to represent mathe- an extension of the one described in [11]; its architecture is shown matical equations and symbols. In that case, it replaces the equations in figure 1. What follows are the descriptions of each component with textual noise. Currently we do nothing to handle this problem in the preprocessing pipeline. and use the output as is. For video and audio we use the subtitles and/or transcriptions to represent the materials content. To do this, we use transLectures [18] which generates transcriptions and translations of a given video and audio. The languages it supports are English, Spanish, German and Slovene. The output of the service is in dfxp format crawling [17], a standard for xml caption and subtitles based on timed text markup language, from which we extract the raw text. Wikification. Next, we send the material through wikification - a forma�ng process which identifies and links material textual components to the corresponding Wikipedia pages [15]. This is done using Wikifier aud [2], which returns a list of Wikipedia concepts that are most likely i vi o de text o related to the textual input. The web service also supports cross- and multi-linguality which enables extracting and annotating materials textract transLectures text in different languages. extrac�on Wikifier’s input text is limited to 20k characters, because of which longer text cannot be processed as a whole. We split longer text into chunks of at most 10k characters and pass them to Wikifier. Here we are careful not to split the text in the middle of a sentence wikifica�on and if that is not possible, to at least not split any words. We split the text as follows. First we make a 10k characters long substring of the text. Next, we identify the last character in the substring that signifies the end of a sentence (a period, a question storing mark, or an exclamation point) and split it at that character. If there is no such character we find the last whitespace in the substring and split it there. In the extreme case where no whitespaces are found we take the substring as is. The substring becomes one chunk Figure 1: The preprocessing pipeline architecture. It is de- of the original text. We repeat the process on the remaining text signed to handle each data type as well as extract features to until it is fully split into chunks. support multi- and cross-linguality. When we pass these chunks into Wikifier, it returns Wikipedia concepts related to the given chunk. These concepts also contains 6 Preparing multi-modal data for natural language processing SiKDD’18, October 2018, Ljubljana, Slovenia the Cosine similarity between the Wikipedia concept page and the can be represented in various file formats, such as pdf and docx given input text. To calculate the similarity between the concept for text, wmv and mp4 for video, and mp3 for audio. We visualized and the whole material we aggregated the concepts by calculating the distribution of materials over file types in figure 4, but we only the weighted sum show types with more than 100 items available. n Õ Li S , k = s L ki i =1 where Sk is the aggregated Cosine similarity of concept k, n is the number of chunks for which Wikifier returned concept k, Li is the length of chunk i, L is the length of the materials raw text, and ski is the Cosine similarity of concept k to chunk i. The weight Li represents the presence of concept k, found in chunk i, in the L whole material. The aggregated Wikipedia concepts are stored in the materials metadata attribute. Data Set Statistics. In the final step, we validate the material at- Figure 4: Number of items per file type in logarithm scale. tributes and store it in a database. The OER material data set consists The dominant file type is text (pdf, pptx and docx), followed of approximately 90k items. The distribution of materials over the by video (mp4). four repositories is shown in figure 2. As seen from the figure, the dominant file type is text (pdf, pptx and docx) followed by video (mp4). The msi file type is an installer package file format used by Windows but it can also be a textual document or a presentation. If we generalize the file type distribu- tion over all OER repositories we can conclude that the dominant file type is text. This will be taken into count when improving the preprocessing pipeline and recommendation engine. Figure 2: Number of materials per repository crawled in log- 4 RECOMMENDER ENGINE arithm scale. Most materials come from MIT OpenCourse- There are different ways of creating recommendations. Some em- Ware followed by Videolectures.NET. ploy users’ interests while other are based on collaborative filter- ing. In this section we present our content based recommendation Some of the repositories offer material in different languages. engine which uses the k-nearest neighbor algorithm [13]. What All repositories together cover 103 languages, however for only 8 follows are descriptions of how the model generates recommenda- languages the count of available materials is larger than 100. The tions based on the user’s input, which can be either the identifier distribution of items over languages is shown in figure 3 where we of the OER in the database or a query text. only show languages with more than 100 items available. Most of Material identifier. When the engine receives the material identi- the materials is in English, followed by Italian and Slovene. The fier (in our case the url of the material) we first check if the material “Unknown” column shows that for about 6k materials we were is in our database. If present, we search for k most similar mate- not able to extract the language. To acquire this information, we rials to the one with the given identifier based on the Wikipedia will improve the language extraction method in our preprocessing concepts. Each material is represented by a vector of its Wikipedia pipeline. concepts where each value is the aggregated Cosine similarity of the corresponding Wikipedia concept page to the material. By calcu- lating the Cosine similarity between the materials the engine then selects k materials with the highest similarity score and returns them to the user. Because of the nature of Wikipedia concepts this approach returns materials written in different languages - which helps overcoming the language barrier. Query text. When the engine receives the query text we search for materials with the most similar raw text using the bag-of-words model. Each material is represented as a bag-of-words vector where each value of the vector is the tf-idf of the corresponding word. The Figure 3: Number of materials per language in logarithm materials are then compared using the Cosine similarity and the scale. Most of the material is in English, followed by Italian engine again returns the k materials that have the highest similarity and Slovenian. score. This approach is simple but it is unable to handle multilingual documents. This might be overcome by first sending the query text As shown in before the preprocessing pipeline is designed to to Wikifier to get its associated Wikipedia concepts and use them handle different types of material - text, video and audio. Each type in a similar way as described in the Material identifier approach. 7 SiKDD’18, October 2018, Ljubljana, Slovenia Erik Novak, Jasna Urbančič, and Miha Jenko 4.1 Recommendation Results In the future we will evaluate the current recommendation en- The described recommender engine is developed using the QMiner gine and use it to compare it with other state-of-the-art. We intend platform [9] and is available at [14]. When the user inputs a text to use A/B testing to optimize the models based on the user’s inter-query the system returns recommendations similar to the given action with them. We wish to improve the engine by collecting user text. These are shown as a list where each item contains the title, url, activity data to determine what materials are liked by the users, description, provider, language and type of the material. Clicking explore different deep learning methods to improve results, and on an item redirects the user to the selected OER. develop new representations and embeddings of the materials. We have also discussed with different OER repository owners We also aim to improve the preprocessing pipeline by improving and found that they would be interested in having the recommen- text extraction methods, handle missing material attributes, and dations in their portal. To this end, we have developed a compact adding new feature extraction methods to determine the topic and recommendation list which can be embedded in a website. The rec- scientific field of the educational material as well as their quality. ommendations are generated by providing the material identifier or raw text as query parameters in the embedding url. Figure 5 shows ACKNOWLEDGMENTS the embed-ready recommendation list. This work was supported by the Slovenian Research Agency and X5GON European Unions Horizon 2020 project under grant agree- ment No 761758. REFERENCES [1] David Bashford. 2018. GitHub - dbashford/textract: node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more! https://github.com/dbashford/textract. Accessed: 2018-09-03. [2] Janez Brank, Gregor Leban, and Marko Grobelnik. 2017. Annotating documents with relevant Wikipedia concepts. Proceedings of SiKDD. [3] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 191–198. [4] Université de Nantes. 2018. Plate-forme d’Enseignement de l’Université de Nantes. http://madoc.univ-nantes.fr/. Accessed: 2018-09-03. [5] Universitat Politècnica de València. 2016. media UPV. https://media.upv.es/#/ portal. Accessed: 2018-09-03. [6] Miguel Ángel del Agua, Adrià Martínez-Villaronga, Santiago Piqueras, Adrià Giménez, Alberto Sanchis, Jorge Civera, and Alfons Juan. 2015. The MLLP ASR Systems for IWSLT 2015. In Proc. of 12th Intl. Workshop on Spoken Language Translation (IWSLT 2015). Da Nang (Vietnam), 39–44. http://workshop2015.iwslt. org/64.php [7] Università di Bologna. 2018. Universita di Bologna. https://www.unibo.it/it. Accessed: 2018-09-03. [8] Surabhi Dwivedi and VS Kumari Roshni. 2017. Recommender system for big data in education. In E-Learning & E-Learning Technologies (ELELTECH), 2017 5th National Conference on. IEEE, 1–4. [9] Blaz Fortuna, J Rupnik, J Brank, C Fortuna, V Jovanoski, M Karlovcec, B Kazic, K Kenda, G Leban, A Muhic, et al. 2014. » QMiner: Data Analytics Platform for Processing Streams of Structured and Unstructured Data «, Software Engineering for Machine Learning Workshop. In Neural Information Processing Systems. [10] Carlos A Gomez-Uribe and Neil Hunt. 2016. The netflix recommender system: Algorithms, business value, and innovation. ACM Transactions on Management Figure 5: An example of recommended materials for the lec- Information Systems (TMIS) 6, 4 (2016), 13. ture with the title “Is Deep Learning the New 42?” published [11] Erik Novak and Inna Novalija. 2017. Connecting Professional Skill Demand with Supply. Proceedings of SiKDD. on Videolectures.NET [19]. The figure shows cross-lingual, [12] Massachusetts Institute of Technology. 2018. MIT OpenCourseWare | Free Online cross-modal, and cross-site recommendations. Course Materials. https://ocw.mit.edu/index.htm. Accessed: 2018-09-03. [13] Leif E Peterson. 2009. K-nearest neighbor. Scholarpedia 4, 2 (2009), 1883. [14] X5GON Project. 2018. X5GON Platform. https://platform.x5gon.org/search. The recommendation list consists of the top 100 materials based Accessed: 2018-09-04. [15] Lev Ratinov, Dan Roth, Doug Downey, and Mike Anderson. 2011. Local and on the query input. As shown in the figure the recommendation global algorithms for disambiguation to wikipedia. In Proceedings of the 49th contain materials of different types, are provided by different reposi- Annual Meeting of the Association for Computational Linguistics: Human Language tories and written in different languages. We have not yet evaluated Technologies-Volume 1. Association for Computational Linguistics, 1375–1384. [16] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based the recommendation engine but we intend to do it in the future. collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web. ACM, 285–295. 5 FUTURE WORK AND CONCLUSION [17] Speechpad. 2018. DFXP (Distribution Format Exchange Profile) | Speechpad. https://www.speechpad.com/captions/dfxp. Accessed: 2018-09-04. In this paper we present the methodology for processing multi- [18] transLectures. 2018. transLectures | transcription and translation of video lectures. modal items and creating a semantic space in which we can compare http://www.translectures.eu/. Accessed: 2018-09-03. [19] VideoLectures.NET. 2018. Is Deep Learning the New 42? - Videolectures.NET. these items. We acquired a moderately large open educational re- http://videolectures.net/kdd2016_broder_deep_learning/. Accessed: 2018-09-03. sources data set, created a semantic space with the use of Wikipedia [20] VideoLectures.NET. 2018. VideoLectures.NET - VideoLectures.NET. http:// videolectures.net/. Accessed: 2018-09-03. concepts and developed a basic content based recommendation en- [21] Wikipedia. 2018. Open educational resources - Wikipedia. https://en.wikipedia. gine. org/wiki/Open_educational_resources. Accessed: 2018-09-03. 8 TOWARDS SMART STATISTICS IN LABOUR MARKET DOMAIN Inna Novalija Marko Grobelnik Jožef Stefan Institute Jožef Stefan Institute Jamova cesta 39, Ljubljana, Slovenia Jamova cesta 39, Ljubljana, Slovenia inna.koval@ijs.si marko.grobelnik@ijs.si ABSTRACT respect to defined scenarios – demand analysis, skills ontology development and skills ontology evolution. In this paper, we present a proposal for developing smart labour market statistics based on streams of enriched textual data and illustrate its application on job vacancies from European 2. BACKGROUND countries. We define smart statistics scenarios including demand The development of smart labour market statistics touches a analysis scenario, skills ontology development scenario and skills number of issues from labour market policies area and would ontology evolution scenario. We identify stakeholders – provide contributions to questions related to: consumers for smart statistics and define the initial set of smart - job creation, labour market statistical indicators. - education and training systems, - labour market segmentation, KEYWORDS - improving skill supply and productivity. Smart statistics, labour market, demand analysis. For instance, the analysis of the available job vacancies could offer an insight into what skills are required in the particular area. Effective trainings based on skills demand could be organized and 1. INTRODUCTION that would lead into better labour market integration. A number of stakeholder types will benefit from the development An essential feature of modern economy is the appearance of new of smart labour market statistics. In particular, the targeted skills, such as digital skills. For instance, e-skills lead to the stakeholders are: exponential increases in production and consumption of data. - Statisticians from National and European statistical offices While job profiles vary and are still in the process of being who are interested in the application of new technologies for defined, organizations agree that they need the new breed of production of the official statistics. workers. - Individual persons who are searching for new employment Accordingly, the European institutions take major initiatives opportunities. In particular, individuals are interested in the related to digitalization of labor market, training of new skills and job vacancies that are compatible with their current skills and meeting the labour demand. in the methods (like trainings) providing the possibilities to Historically, the labour market statisticians use standard measures obtain new skills in demand. of the labour demand and labour supply based on traditional - Public and private employment agencies interested in up-to- surveys – job vacancy surveys, wage survey, labour force surveys. date employees profiles. The unemployment rate provides information on the supply of - persons looking for work in excess of those who are currently Education and training institutions from different levels and employed. Data on employment provide information on the forms of education - general/vocational education, higher demand for workers that is already met by employers. education, public/private, initial/ adult education. Educational institutions are interested in relevant skills and The data-driven smart labour market statistics intends to: topics that should be part of the curriculum programs. - use the available historical job vacancies data, - Ministries of labour/manpower, economy/industry/trade, - use the available real-time job vacancies data, education, finance, etc. The policy makers, such as - use the available real-time and historical dataset of additional ministries, are interested in the overall labour market data (described below), situation, with respect to location and time, in the labour - align data sources, market segmentation and in the processes of improving - construct models and obtain novel smart labour market supply and productivity. indicators that will complement existing labour market - Standards development organizations. National or statistics, International organizations whose primary activities are - provide a system for delivering results to the users. developing, coordinating, promulgating, revising, amending, reissuing, interpreting, or otherwise producing technical The smart labour market statistics approach will combine standards that are intended to address the needs of some advanced data processing, modelling and visualization methods in order to develop trusted techniques for job vacancies analysis with 9 relatively wide base of affected adopters. Interested in new - Social media data, such as news, Twitter data that might be technologies developed in relation to labour market. relevant for labour market. - Academic and research institutes. Public and private entities - Labour supply data (based on user profile analysis). who conduct research in relevant areas. Research institutions Open job vacancies can be found using job search services. These are interested in the development of novel methodologies services aggregate job vacancies by location, sector, applicant and usage of appearing new data sources. qualifications and skill set or type. One such service is Adzuna [4], a search engine for job ads, which mostly covers English- 3. RELATED WORK speaking countries. The European Data Science Academy (EDSA) [1] was an H2020 For data acquisition and enrichment, dedicated APIs, including EU project that ran between February 2015 and January 2018. Adzuna API, are used, as well as custom web crawlers are The objective of the EDSA project was to deliver the learning developed. The data is formatted to JSON to aid further tools that are crucially needed to close the skill gap in Data processing and enrichment. The job vacancy dataset is obtained Science in the EU. The EDSA project has developed a virtuous with respect to trust and privacy regulations, the personal data is learning production cycle for Data Science, and has: not collected. - Analyzed the sector specific skillsets for data analysts across Job vacancies usually contain the information, such as job Europe with results reflected at EDSA demand and supply position title, job description, company and job location. In such dashboard; way, job vacations that are constantly crawled/web-scraped - Developed modular and adaptable curricula to meet these present a data stream. The job title and job description are textual data science needs; and data that contain information about skills that employee should - have. Delivered training supported by multiplatform resources, introducing Learning pathway mechanism that enables On the obtained data wikification - identifying and linking textual effective online training. components (including skills) to the corresponding Wikipedia EDSA project established a pipeline for job vacancy collecting pages [5] is performed. This is done using Wikifier [6], which and analysis that will be reused for the purpose of smart statistics. also supports cross and multi-linguality enabling extraction and annotation of relevant information from job vacancies in different An ontology called SARO (Skills and Recruitment Ontology) [2] languages. The data is tagged with concepts from GeoNames has been developed to capture important terms and relationships ontology [7]. To job postings where latitude and longitude have to facilitate the skills analysis. SARO ontology concepts included been available, GeoNames location uri and location name are relevant classes to job vacancy datasets, such as Skill and added. To the postings where only location name has been JobPosting. Examples of instances of class Skill would be skills, available, the coordinates and location uri are added. such as “Data analysis”, “Java programming language” et al. The job vacancy data representation level depends on the specific ESCO [3] is the multilingual classification of European Skills, country. For the United Kingdom, France, Germany and the Competences, Qualifications and Occupations. It identifies and Netherlands there is a substantial collection of job vacancies in categorizes skills/competences, qualifications and occupations the area of digital technologies. relevant for the EU labour market and education and training, in 25 European languages. The system provides occupational 4.2 CONCEPTUAL ARCHITECTURE profiles showing the relationships between occupations, The labour market statistics conceptual structure is built upon the skills/competences and qualifications. For instance, one example following major blocks: of existing ESCO skill is “JavaScript” (with alternative labels “Client-side JavaScript”, "JavaScript 1.7" et al.). 1. Data sources related to different aspects of smart labour market. The main data source aggregates historical and current job Both SARO and ESCO ontologies are useful for the aim of smart vacancies in the area of digital technologies and data science statistics, in particular for skills ontology development and skills around Europe. ontology evolution scenarios. However, the ontologies usually are manually manipulated, and the methods developed for smart 2. Modelling smart labour market statistics takes central part of labour market statistics should overcome the difficulties related to the smart labour market statistics approach, where the goal is to this issue. The ontology evolution scenario of smart labour market construct models based on different data sources, updated in statistics envisions automatic identification of emerging and business-real-time (as needed or as data sources allow). Models decreasing skills from the data perspective. shall bring understanding of the smart labour market statistics domain and shall be used for aggregation, ontology development and ontology evolution. 4. PROBLEM DEFINITION 3. Targeted users are smart statistics consumers. There are several 4.1 DATA SOURCES major groups of users (described above). The example users might include statisticians, policy makers, individual users (residents The main data sources available for the development of smart and non-residents), training and educational organizations and labour market statistics are historical and current data about job other. vacancies in the area of digital technologies and data science around Europe (~5.000.000 job vacancies 2015-2018). 4. Finally, applications of smart labour market statistics are multiscale - they can be presented at cross-country level (around Additional data sources may include: 10 Europe) country level (UK, France, the Netherlands etc.), relationships, and other distinctions that are relevant for modeling city/area level and conceptual level (ontology). a domain. The specification takes the form of the definitions of Figure 1 illustrates the conceptual architecture diagram for smart representational vocabulary (classes, relations, and so forth), labour market statistics. which provide meanings for the vocabulary and formal constraints on its coherent use. Figure 1: Conceptual Architecture The key characteristics of the development techniques will include: - Interpretability and transparency of the models – the aim is, for a model to be able to explain its decision in a human readable manner (vs. black box models, which provide results without explanation). - Non-stationary modelling techniques are required due to changing data and its statistical properties in time. For instance, the ontology evolution process will be modeled taking to the account the incremental data arriving to the system. - Multi-resolution nature of the models, having the property to observe the structure of a model on multiple levels of granularity, depending on the application needs. - Scalability for building models is required due to the nature of incoming data streams. 4.3 SCENARIOS The smart labour market statistics proposal includes three scenarios - demand analysis scenario, ontology development Figure 2: Example of Job Vacancies Crawled and scenario and ontology evolution scenario described below. Processed 4.3.1 DEMAND ANALYSIS Ontologies are often manually developed and maintained, what Demand analysis scenario suggests production of statistical requires a sufficient user efforts. indicators based on the available job vacancies using techniques In the ontology development scenario an automatic (or semi- for data preprocessing, semantic annotation, cross-linguality, automatic) bottom-up process of creating ontology from available location identification and aggregation. job vacancies will be suggested. Job vacancies in structural and semi-structural form are the input The relevant skills (extracted from the job vacancies) will be to into the system, while statistics related to overall job demand, defined and formalized. Using semantic annotation and cross- job demand with respect to particular location, job demand with linguality techniques for skills extraction based on JSI Wikifier respect to particular skill (skill demand) and time frame are the tool [6] will enable the possibility of including the newest outputs of the system. available skills “on the market” that are not yet captured in the Figure 2 presents an example of crawled and processed job ontologies, taxonomies and classifications that are manually vacancies. developed. The input to the ontology development scenario is a set of job vacancies and the output is ontology of skills presenting 4.3.2 SKILLS ONTOLOGY DEVELOPMENT the domain structure that can be compared to or used for official Ontologies reduce the amount of information overload in the classifications. working process by encoding the structure of a specific domain and offering easier access to the information for the users. Gruber 4.3.3 SKILLS ONTOLOGY EVOLUTION [8] states that an ontology defines (specifies) the concepts, Ontology Evolution is the timely adaptation of an ontology to the arisen changes and the consistent propagation of these 11 changes to dependent artefacts [9]. Ontology evolution is a - Ontology evolution statistics. Example: emerging skills in process that combines a set of technical and managerial activities the ontology in the last 3 months and ensures that the ontology continues to meet organizational Since the data has a streaming nature, different kinds of multiscale objectives and users’ needs in an efficient and effective way. and aggregation options can be handled with respect to time Ontology management is the whole set of methods and techniques parameters. that is necessary to efficiently use multiple variants of ontologies from possibly different sources for different tasks [10]. Scenario 3 will suggest an automatic (or semi-automatic) ontology 6. CONCLUSION AND FUTURE WORK evolution process based on the real-time job vacancy stream. With In this paper, we presented a proposal for developing smart labour respect to the nature of job vacancy data stream and skills market statistics based on streams of enriched textual data, such as extracted from job it will be possible to see the dynamics of job vacancies from European countries. We define smart statistics evolving skills – when the new skills (not included into the scenarios, such as demand analysis scenario, skills ontology current ontology versions appear) and how the skills ontology is development scenario and skills ontology evolution scenario. The changing with time. future work would include the implementation of the smart labour In particular, it could be possible to observe appearing new skills market scenarios, quality assessment and evaluation of the and suggest them for inclusion into official skills classifications. produced statistical outcomes. In addition, it could be visible how fast the ontology changes, which could be the indicator of the technological progress on the 7. ACKNOWLEDGMENTS relevant market. This work was supported by the Slovenian Research For instance, the current version of ESCO classification does not Agency and EDSA European Union Horizon 2020 project contain “TensorFlow” skill (TensorFlow [11] is an open-source under grant agreement No 64393. software library for dataflow programming across a range of tasks, appeared in 2015). TensorFlow, which is already present in job vacancies, could be captured during ontology evolution process 8. REFERENCES and suggested as a new concept for official classifications. [1] EDSA, http://edsa-project.eu (accessed in August, 2018). [2] Sibarani, Elisa & Scerri, Simon & Mousavi, Najmeh & Auer, Sören. (2016). Ontology-based Skills Demand and Trend 5. STATISTICAL INDICATORS Analysis. 10.13140/RG.2.1.3452.8249. Traditionally the indicators related to labour market have been [3] ESCO taxonomy, https://ec.europa.eu/esco/portal (accessed in based on survey responses. The smart labour market statistics August 2018). proposal introduces a possibility to complement standard statistical indicators, such as job vacancy rate with novel “data [4] Adzuna developer page, inspired” knowledge. https://developer.adzuna.com/overview (accessed in August, 2018). The smart labour market statistics indicators use data sources, previously not covered by official statistics, and in such way [5] Ratinov, L., Roth, D., Downey, D. and Anderson, M. Local complementary to traditional data sources. The smart labour and global algorithms for disambiguation to wikipedia. In market statistics indicators are based on real-time data streams, Proceedings of the 49th Annual Meeting of the Association for which makes possible to obtain not only historical, but also Computational Linguistics: Human Language Technologies- current values for job vacancies that could be used for different Volume 1, pages 1375–1384. Association for Computational purposes, such as nowcasting. In addition, the smart labour Linguistics, 2011. market statistics indicators take into the account data cross-lingual [6] JSI Wikifier, http://wikifier.org (accessed in May, 2018). and multi-lingual nature of streaming data and can be produced at the multiscale levels – cross-country, country, city (area) levels. [7] GeoNames ontology, http://www.geonames.org/ontology/documentation.html (accessed The scenarios described above would result into a number of in August, 2018). smart labour market indicators with multiscale options. In particular: [8] Ontology (by Tom Gruber), - http://tomgruber.org/writing/ontology-definition-2007.htm Up-to date job vacancies statistics on a cross- (accessed in August, 2018). country/country/city(area) level. Example: job vacancies in UK and France in the last month [9] M. Klein and D. Fensel, Ontology versioning for the Semantic - Web, Proc. International Semantic Web Working Symposium Up-to date skills statistics on a cross- (SWWS), USA, 2001 country/country/city(area) level. Example: top 10 skills in UK in the last month [10] L. Stojanovic, B. Motik, Ontology evolution with ontology, - in: EKAW02 Workshop on Evaluation of Ontology-based Tools Up-to date location statistics. Example: top locations for (EON2002), CEUR Workshop Proceedings, Sigüenza, vol. 62, specific skill 2002, pp. 53–62 - Ontology development statistics. Example: number of [11] TensorFlow, https://en.wikipedia.org/wiki/TensorFlow concepts in the ontology (accessed in August, 2018). 12 Relation Tracker - tracking the main entities and their relations through time M. Besher Massri Inna Novalija Marko Grobelnik Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute Jamova cesta 39, Ljubljana, Slovenia Jamova cesta 39, Ljubljana, Slovenia Jamova cesta 39, Ljubljana, Slovenia besher.massri@ijs.si inna.koval@ijs.si marko.grobelnik@ijs.si ABSTRACT contextual information provided as characteristic keywords, for a In this paper, we present Relation Tracker, a tool that tracks main quick detection of information from the original articles. entities [people and organizations] within each topic through time. The main types of relations between the entities are detected Regarding classifying news, we observe in [3] a new technique and observed in time. The tool provides multiple ways of that uses Deep Learning to increase the accuracy of prediction of visualizing this information with different scales and durations. online news popularity. The tool uses events data from Event Registry as a source of In the paper explaining Event Registry [1], we see how articles information, with the aim of getting holistic insights about the from different languages are grouped into events and the main searched topic. information and characteristics about them are extracted. Additionally, a graphical interface is implemented which allows search for events and visualize the results in multiple ways that KEYWORDS together give a holistic view about events. Information Retrieval, Visualization, Event Registry, Wikifier, Dmoz Taxonomy This work begins with the events as a starting point, and it is one more step on the same path; it groups events further into topics 1. INTRODUCTION and trends, then it focuses on tracking how some entities are Every day, tremendous amounts of news and information are appearing as main entities regarding the selected topic, and how being streamed throughout the Internet, which is requiring the the relationship between them is changing through time. implementation of more tools to aggregate this information. With technology advancement, those tools have been increasing in complexity and options provided. However, there has been a 3. DESCRIPTION OF DATA demand for tools that give simple yet holistic summary of the We used part of the events from Event Registry as our main searched topic in order to acquire general insights about it. source of data. We obtained a dataset of ~ 1.8 million events as a list of JSON files, with event’s dates between Jan 2015 and July Hence, we provide the Relation Tracker tool that tries to achieve 2016. Each event consists of general information like title, event this goal; it is based on the data from Event Registry [1], which is date, total article count, etc., and a list of concepts that a system for real-time collection, annotation and analysis of characterize the event, which is split into entity concepts and non- content published by global news outlets. The tool presented in entity concepts. Entity concepts are people, organizations, and this paper takes the events and groups them into topics, and locations related to the event. Whereas non-entity concepts within each topic, it provides an interactive graph that shows the represent abstract terms that define the topic of the event, like main entities of each topic at each time and the main topic of technology, education, and investment. Those concepts were relations between those entities. In addition, a summary extracted using JSI Wikifier [4] which is a service that enables information about entities and their relationship is visualized semantic annotation of the textual data in different languages. In through different graphs to help understand more about the topic. addition, each concept has a score that represents the relevancy of that concept to the event. The remainder of this paper is structured as follows. In section 2, we show the related work done in this area. In section 3, we provide a description of the used data. Section 4 explains the 4. METHODOLOGY methodology and main challenges that were involved in this work. Next, we explain the visualization features of the tool in section 5. Finally, we conclude the paper and discuss potential future work. 4.1 Clustering and Formatting Data To group the events into topics, we used K-Means clustering 2. RELATED WORK algorithm, where each event is represented as a sparse vector of the non-entity concepts it has, with the weights equal to their Similar works have been done in the area of visualizing scores in that event. The constant number of topics is set information extracted from news. We see in [2] a tool for efficient experimentally to be 100 clusters, in a balance between mixed visualization of large amount of articles as a graph of connected clusters and repeated clusters. Each cluster describes a set of entities extracted from articles, enriched with additional events that fall under the same topic, whereas the centroid vector of each cluster represents the main characteristics of it. To name 13 the clusters, we used category classifier service from Event 4.3 Detecting the Characteristics of Registry, which uses Dmoz Taxonomy [5], a multilingual open- content directory of World Wide Web links, that is used to Relationship classify texts and webpages into different categories; for each The main goal was to model the relationship between any two cluster, we formed a text consisting of the components of its entities through a vector of words where two entities are centroid vector, taking into account their weights within the collocated. Since the relationship between two entities at any vector. The resulted cluster names were ranged from technology given time is based on the shared events between them, and each and business to refugees and society, and clusters were exported event is characterized by a set of concepts, we decided on using as a JSON file for processing them in the visualization part. those concepts - specifically the abstract or the non-entity concepts - to characterize such relationships. For each pair, we 4.2 Choosing the Main Entities aggregated all the non-entity concepts from the shared events between them, and each one of them was assigned a value based Under any topic, the top entities at each duration of time has to be on the number of events it is mentioned in and its score in those chosen. At first, the concepts were filtered from outliers like events. Those concepts were sorted and ranked depending on their publishers and news agencies. Then, an initial importance value values, and the top ones were chosen as the main features of the has been set for each concept based on two parameters: the TF- relationship. In addition, these values of the concepts were used to IDF score of concept with respect to each event, and the number rank the shared events and extract the top ones; by giving each of articles each event contains. If we denote the set of events that event a value equal to the aggregated values (the ones calculated occur in the interval of time D by ED, the number of articles that in previous step) of all non-entity concepts it has. To summarize event e contains is Ae, the TF-IDF score of concept c at event e by the set of characteristics, we classified them using Dmoz category Sc,e, then the importance value of each item with respect to the classifier in a similar way to what we have done in determining interval D is calculated by the formula: the names of the clusters. These categories were used to label the relationship between the entities, indicating the main topic of the shared events between them. 5. VISUALIZING THE RESULTS To access a topic, a search bar is provided to select among the list The TF-IDF function is used to give importance to the concept of extracted topics from clustering step. Once the user selects a based on its relevance to the events, and the number of articles is topic, a default date is chosen and a network graph is shown used to give more importance to the events that have more articles explaining the topic. talking about it, and hence, more importance to the concepts that it has. We decided on using the product of summation rather than 5.1 Characteristics of the Main Graph summation of product because of its computation efficiency while Since the tool’s main goal is to show the top entities and their still producing good results. However, to prevent the case where relations, the network graph is the best choice for this matter. all the chosen entities get nominated because of one or two big Following that, we have built an interactive network graph that events (which results in a bias towards those few events), a has the following features: modification to the importance value formula has been made by - The main entities within that topic at the selected interval introducing another parameter, which is the links between of time are represented by the vertices of the graph. concepts (whenever two concepts occur in the same event, there is - The size of the vertices reflects the importance value of a link between them). Each concept now affects negatively the each entity, scaled to a suitable ratio to fit in the canvas. other concepts it is linked to by an amount equal to the initial - The colors represent the type of the entity, whether it is a importance value divided by the number of neighbors. If we person [red] or an organization [blue]. denote the set of neighbors of concept c during the interval of - The links between the entities represent the existence of time D by Nc,D, then the negative importance value is defined by: shared events in that interval of time between them under that topic, and hence indicating some form of relations. The thickness of the links is proportional to the number of shared events, whereas the labels are the ones calculated in previous section. Figure 1 presents top companies with relevant relations in July The final score is just the initial importance value minus the 2015 found among business news. negative importance value, which is then used to sort and nominate the top entities. 14 Figure 1: Top companies in July 2015 and their relations Figure 3: The changes in top entities under the same topic under the business topic. after moving the interval for 15 days. 5.2 Main Functionality 5.3 Displaying Relation Information As the tool is concerned about tracking the changes with time. Whenever the user selects a pair of entities, detailed information The graph is supported with a slide bar that allows the user to about their relationship in the selected interval of time is given, choose from the dates where there is at least one event occurred such as the number of shared events and articles, along with the with respect to the selected topic. Different scales for moving top events both concepts were mentioned in. Also, the top shared dates are also provided; the user can choose to move day by day, characteristics that shape the relationship between them at this week by week, or month by month and see the changes period is shown and sorted by percentage of importance. As seen accordingly. In addition, the user can choose a specific interval of in Figure 4; when selecting Jeff Bezos and Elon Musk under the time, and track how the entities and their relations are changing space topic between January and September 2015, we see a list of when the interval moves slightly with respect to its length. An the top events that involve both of them during this period. We interval magnifier is also given if the user wants to get a closer see also that the relationship between them is mainly about look at the changes that happen in a small interval. sending astronauts by rockets to the international space station, as it can be understood from the top shared characteristics. An example illustrating that can be seen in Figures 2 and 3. In Figure 2, we see the top 10 entities under the refugee topic in the last two months of 2015. When the interval is moved by 15 days, we notice that some of the entities disappear, like European Commission, indicating that they are no longer among the top 10 entities, whereas “United States House of Representitive” entity emerges and connects to “Barack Obama” and “Repulican Party”. The change in size indicates the change in the importance value of each one, while Society is the general theme among all labels. Figure 4: Relationship summary about Jeff Bezos and Elon Musk between January and September 2015 under the Space Figure 2: Top entities for the last two months of 2015 under topic. the refugee topic. 15 To illustrate how the importance of those top features with respect 6. CONCLUSION AND FUTURE WORK to the relationship is changing through time, a stream graph is In this paper, we provide a tool that uses events data from Event used as shown in Figure 5. Registry to show the main entities within each topic, and how the characteristics of relationship among them is changing through time. However, there are a couple of limitation to the tool that we want to improve in the future. Although we were able to detect the characterestics of the relationship between entities and how they are changing through time, the main type of relation that we used to label the links were very broad and hence rarely changing- improving the methodology for relation extraction and observation of relations in time will be the subject of future work. In addition, we limited the search space for topics for the 100 topics we obtained from clustering, we would like to generalize the search by enabling searches for any concept or keyword with different options to filter the search. 7. ACKNOWLEDGMENTS Figure 5: Stream graph showing how the effect of the main features on the relationship between Jeff Bezos and Elon Musk This work was supported by the euBusinessGraph (ICT-732003- is changing through time. IA) project [6]. 8. REFERENCES Finally, the set of all characteristics that affect the relationship is visualized in a tag cloud to give a big picture about it. Figure 6 shows the tag cloud of the same relationship mentioned above. [1] Gregor Leban, Blaz Fortuna, Janez Brank, and Marko Grobelnik. 2014. Event registry: learning about world events from news. In Proceedings of the 23rd International Conference on World Wide Web (WWW '14 Companion). ACM, New York, NY, USA, 107-110. DOI: https://doi.org/10.1145/2567948.2577024 [2] Marko Grobelnik and Dunja Mladenić. 2004. Visualization of news articles. Informatica 28. [3] Sandeep Kaur and Navdeep Kaur Khiva. 2016. Online news classification using Deep Learning Technique. IRJET 03/10 (Oct 2016). [4] Janez Brank, Gregor Leban and Marko Grobelnik. 2017. Annotating documents with relevant Wikipedia concepts. In Figure 6: Tag cloud illustrating a general view about all the Proceedings of siKDD2017. Ljubljana, Slovenia. characteristics that affects the relationship between Jeff Bezos and Elon Musk under the space topic. [5] Dmoz, open directory project, http://dmoz-odp.org/ (accessed in July, 2018) [6] euBusinessGraph project, http://eubusinessgraph.eu/ (accessed in July, 2018). 16 Cross-lingual categorization of news articles Blaž Novak Jožef Stefan Institute Jamova 39 Ljubljana, Slovenia +386 1 477 3778 blaz.novak@ijs.si ABSTRACT categories. We consider each document belonging to all In this paper we describe the experiments and their results categories that are explicitly stated, and all of their parents. We performed with the purpose of creating a model for automatic will compare the performance of model predictions on the same categorization of news articles into the IPTC taxonomy. We show language and in the cross-lingual setting, where we train the that cross-lingual categorization is possible using no training data model on the entire dataset available for one language, and from the target language. We find that both logistic regression and measure its performance on the other language. support vector machines are good candidate models, while Basic features of the dataset can be seen in the following 2 random forests do not perform acceptably. Furthermore, we show figures. Figure 1 shows the distribution of number of articles in that using Wikipedia-derived annotations provides more each category, and Figure 2 shows that most categories contain a information about the target class than using generic word roughly even number of articles in both languages, but there are features. some outliers. We ignored categories with less than 15 examples per language, which resulted in 308 categories. General Terms Algorithms, Experimentation Keywords News, articles, categorization, IPTC, Wikifier, SVM, Logistic regression, Random forests. 1. INTRODUCTION The JSI Newsfeed [1] system ingests and processes approximately 350.000 news articles published daily around the world, in over 100 languages. The articles are automatically cleaned up and semantically annotated, and finally stored and made available for downstream consumers. One of the annotation tasks that we would like to perform in the future is to automatically categorize articles into the IPTC “Media Topics” subject taxonomy [2]. IPTC – the International Press Figure 1. Number of articles in each category. Discrete Telecommunications Council – provides a standardized taxonomy categories on x axis are ordered by descending number of articles. of roughly 1100 terms, arranged into a 5 level taxonomy, describing subject matters relating to daily news. The vocabulary is accessible in a machine readable format – RDF/XML and RDF/Turtle – at http://cv.iptc.org/newscodes/mediatopic. There are two relations linking concepts in the vocabulary – the ‘broader concept’ taxonomical relation, and a ‘related concept’ sibling relation. The ‘related concept’ links concepts both to other concepts from the same taxonomy, and directly to external Wikidata [3] entities. The purpose of this work is to evaluate multiple machine learning algorithms and multiple sets of features with which we could automatically perform the categorization. As we would like to categorize articles in all the languages the Newsfeed system supports, but we only have example articles in English and Figure 2. Language imbalance for each category. Discrete French, the method needs to be language independent. categories on x axis are ordered from “mostly English” to “mostly French”. 2. EXPERIMENTAL SETUP We compare three different machine learning models – random The dataset that we have access consists of 30364 English and forests, logistic regression (LR), and Support Vector Machines 29440 French articles, each of which is tagged with 1 to 10 (SVM). 17 We try two different types of features, and their combinations. significantly worse. “Wiki-W” denotes the weighted version of Wikifier annotations, and “Wiki-K” the combination of KCCA- The first kind of a feature set we use is a projection of the bag-of- derived features and Wikifier annotations. Every second line in words representation of the document text into a 500 dimensional vector space. The KCCA [4] method uses an aligned multi-lingual the table is the standard deviation of the result when averaged corpus to find such a mapping, that words with similar meanings across all categories. map to a similar vector, regardless of their language. We represent a document as a sum of all word vectors. Table 1. ROC scores by model and feature type, cross- The second set of features we use is the output of the JSI Wikifier validation [5] system. The Wikifier links each word in a document to a set of Rand. Forest Log. Reg. SVM Wikipedia pages that might represent the meaning of that word. For each such annotation, we also get a confidence weight. EN FR EN FR EN FR We consider these annotations as a classical vector space model -- KCCA 0.75 0.71 0.96 0.95 0.95 0.94 as a bag-of-entities. We use two versions of the TF-IDF [7] (stdev) 0.11 0.11 0.04 0.04 0.05 0.04 scheme: in the first case, we use the number of times an entity Wiki 0.70 0.70 0.95 0.95 0.94 0.94 annotation is present for any word in a document as the TF (term (stdev) 0.12 0.12 0.04 0.04 0.05 0.04 frequency) factor, and in the second version, we use the sum of annotation weights of an entity across the document. In both Wiki-W 0.71 0.71 0.95 0.95 0.94 0.94 cases, we perform L1 normalization of the vector containing TF (stdev) 0.12 0.11 0.04 0.04 0.05 0.04 terms. For IDF terms, we use log �1 + 𝑁𝑁� where 𝑁𝑁 is the number 𝑛𝑛 Wiki+K 0.71 0.69 0.97 0.96 0.96 0.95 of all documents and 𝑛𝑛 the number of documents where an (stdev) 0.12 0.11 0.03 0.03 0.03 0.04 annotation was present at least once. Finally, we use a combination of both KCCA-derived and Wikifier-derived features as the last feature set option. Looking at the feature selections, we see almost no significant difference -- both kinds of features -- KCCA and Wikipedia For model training, we use Pythons scikit-learn [6] software annotations have useful predictive value. The combination of both package. In the case of logistic regression, we use L2 penalty, feature types slightly improves the ROC score. with automatic decision threshold fitting, using the liblinear library backend. Table 2 shows F1 cross-validation scores of all three models. Logistic regression scores much higher than SVM here, possibly For the SVM model, we use a stochastic gradient descent indicating that the SVM model would benefit from a post- optimizer. We performed a grid search for the optimal processing step of optimizing the decision threshold on a separate regularization constant 𝐶𝐶, but since there were no significant training set. accuracy changes, we used the default of 1.0 in all other experiments. Table 2. F1 scores by model and feature type, cross-validation For the random forest model, we used 4 different parameter Rand. Forest Log. Reg. SVM combinations: EN FR EN FR EN FR • default – 10 trees, splitting until only one class is in the KCCA 0.16 0.12 0.30 0.25 0.20 0.18 leaf (stdev) 0.21 0.18 0.21 0.20 0.21 0.19 • 30 trees, maximum tree depth of 10 Wiki 0.07 0.07 0.41 0.44 0.25 0.29 • 50 trees, maximum tree depth of 10 (stdev) 0.15 0.15 0.21 0.21 0.22 0.22 • 30 trees, maximu tree depth of 20 Wiki-W 0.08 0.08 0.40 0.43 0.24 0.28 In all cases, GINI index was used as the node splitting criterion. (stdev) 0.17 0.17 0.21 0.21 0.21 0.22 Since the majority of categories only have a small number of Wiki+K 0.09 0.07 0.44 0.46 0.27 0.30 documents, we automatically weighed training examples by the (stdev) 0.16 0.15 0.21 0.21 0.22 0.22 inverse of their class frequency. We also performed some experiments without this weighting scheme, but got useless models in all cases except for the couple largest categories. The combination of both feature sets performs significantly better than either alone, with generic word-based features providing the All reported results are the average of a 3-fold cross-validation. least amount of information. So far, we only created one-versus-all models for each category The feature usefulness changes when looking at cross-lingual independently, and only used the taxonomy information of classification performance. Table 3 shows the ROC score for all categories to select all examples from sub-categories when three models, when the model trained on English is used to predict training the more general category. categories of French articles, and vice versa. Decision trees give essentially a random result, and SVM scores somewhat higher than logistic regression. 3. RESULTS Table 1 shows ROC scores for cross-validation of all three models Table 3. ROC scores - cross-lingual classification on four sets of feature combinations, for English and French Rand. Forest Log. Reg. SVM separately. SVM and logistic regression are comparable in EN FR EN FR EN FR behavior and promising, while the random forest model performs 18 KCCA 0.50 0.50 0.50 0.50 0.50 0.51 (stdev) 0.00 0.00 0.01 0.03 0.04 0.08 Wiki 0.51 0.51 0.76 0.80 0.81 0.84 (stdev) 0.04 0.04 0.12 0.11 0.11 0.10 Wiki-W 0.51 0.52 0.78 0.82 0.82 0.84 (stdev) 0.04 0.05 0.11 0.10 0.10 0.10 Wiki+K 0.50 0.50 0.57 0.70 0.66 0.81 (stdev) 0.01 0.01 0.10 0.13 0.14 0.12 The biggest change here is the influence of KCCA cross-lingual word embedding: by itself it provides no informative value, as indicated by ROC value of 0.5 in all cases, and it even reduces the performance of the combined Wikifier + KCCA model. Figure 3. F1 score correlation for logistic regression In the Table 4, F1 scores from the same experiment are shown. Logistic regression still has a big advantage over SVM, as in the same-language categorization setting. The change from previous experiments is the influence of weighting of Wikipedia features -- it increases the performance of all models. Table 4. F1 scores - cross-lingual classification Rand. Forest Log. Reg. SVM EN FR EN FR EN FR KCCA 0.00 0.00 0.00 0.01 0.00 0.02 0.02 0.02 0.02 0.06 0.01 0.06 Wiki 0.03 0.04 0.48 0.44 0.30 0.26 0.10 0.11 0.21 0.20 0.22 0.22 Wiki-W 0.03 0.05 0.49 0.44 0.29 0.26 Figure 4. F1 score correlation for SVM 0.11 0.13 0.20 0.21 0.22 0.22 Wiki+K 0.00 0.00 0.18 0.40 0.20 0.23 0.04 0.04 0.22 0.22 0.19 0.21 An interesting observation is that the performance of the cross- lingual model is occasionally higher than that of the baseline cross-validation experiment. This anomaly however disappears for categories with large amount of positive training examples. It also disappears if we reduce the amount of training examples in the cross-lingual experiment by 1/3 – the effect seems to be caused by cross-validation reducing the training dataset size. KCCA cross-lingual word embedding feature generation used here was tested in other experiments and systems and gives a useful feature set for comparison of documents across languages, so its negative impact on the performance of these models needs Figure 5. ROC score correlation for logistic regression to be investigated in the future. As the weighted Wikipedia feature set appears to be the best for the stated goal of cross-lingual article categorization, the results of next experiments are shown only for it, but we performed the same experiments on all other combinations, and the results broadly follow the conclusions from the previous section. The following figures show correlation of testing and cross- lingual performance of logistic regression and SVM models. Both F1 score and area under ROC curve are shown for each of 308 categories in the experiment, since they provide complementary information. As the figures show, there is a good agreement between the cross-validation and the cross-lingual classification performance, giving us an ability to estimate cross-lingual performance based on the cross-validation score in the production environment. The difference between distributions for French and Figure 6. ROC score correlation for SVM English language models is consistent with the class imbalance for each of the categories. 19 The SVM model seems to have a more consistent behavior, so we will use it in the final application instead of logistic regression. Figures 7 through 10 show the F1 and ROC score behavior of logistic regression and SVM models for cross-validation and cross-lingual classification with regard to the number of positive examples in the category, separately for English and French language. While the SVM model underperforms on the F1 metric on average, it produces a better ranking of documents with respect to a category, as seen on ROC plots, especially for smaller categories. This further indicates the need for decision threshold tuning in the SVM model before we use its predictions. Figure 10 ROC score with respect to category size, cross- lingual prediction As expected, classification performance of all models improves with the number of training examples, but in cases of small categories, it appears that some are much easier to learn than others. 4. CONCLUSIONS AND FUTURE WORK We found that using a logistic regression model with weighted Wikifier annotations gives us a good enough result to use IPTC category tags as inputs for further machine processing in the Figure 7. F1 score with respect to category size, cross- Newsfeed pipeline. Before we can use this categorization for validation human consumption, we need to investigate automatic tuning of SVM decision thresholds on this problem, and add an additional filtering layer that takes into consideration interactions between categories beyond the sub/super-class relation. Additionally, the negative effect of KCCA-derived features for cross-lingual annotation needs to be examined. 5. ACKNOWLEDGEMENTS This work was supported by the Slovenian Research Agency as well as the euBusinessGraph (ICT-732003-IA) and EW-Shopp (ICT-732590-IA) projects. 6. REFERENCES [1] Trampuš M., Novak B., “The Internals Of An Aggregated Web News Feed” Proceedings of 15th Multiconference on Information Society 2012 (IS-2012). Figure 8. ROC score with respect to category size, cross- [2] https://iptc.org/standards/media-topics/ validation [3] https://www.wikidata.org/wiki/Wikidata:Main_Page [4] Rupnik, J., Muhič, A., Škraba, P. “Cross-lingual document retrieval through hub languages”. NIPS 2012, Neural Information Processing Systems Worshop, 2012 [5] Brank J., Leban G. and Grobelnik M. “Semantic Annotation of Documents Based on Wikipedia Concepts”. Informatica, 42(1): 2018. [6] Pedregosa, F., Varoquaux, G., Gramfort, A. et al. “Scikit- learn: Machine Learning in Python”. Journal of Machine Learning Research, 12. 2011, pp. 2825-2830. [7] K. Sparck Jones. "A statistical interpretation of term specificity and its application in retrieval". Journal of Documentation, 28 (1). 1972 Figure 9. F1 score with respect to category size, cross-lingual prediction 20 Transporation mode detection using random forest Jasna Urbančič Veljko Pejović Dunja Mladenić Artificial Intelligence Faculty of Computer and Artificial Intelligence Laboratory, Information science, Laboratory, Jožef Stefan Institute University of Ljubljana Jožef Stefan Institute Jamova 39, 1000 Ljubljana, Večna pot 113, 1000 Ljubljana Jamova 39, 1000 Ljubljana, Slovenia Slovenia Slovenia jasna.urbancic@ijs.si veljko.pejovic@fri.uni-lj.si dunja.mladenic@ijs.si ABSTRACT While the first attempts to recognize user activity were ini- This paper addresses transportation mode detection for a tiated before smart phones, the real effort in that direc- mobile phone user using machine learning and based on mo- tion begun with the development of mobile phones having bile phone sensor data. We describe our approach to data built-in sensors [10], including GPS and accelerometer sen- collection, preprocessing and feature extraction. We eval- sors. There are still some studies that use custom loggers uate our approach using random forest classification with to collect the data [11, 17] or use dedicated devices as well focus on feature selection. We show that with feature selec- as smart phones [5]. Although GSM triangulation and local tion we can significantly improve classification scores. area wireless technology (Wi-Fi) can be employed for the purpose of transportation mode detection, their accuracy is 1. INTRODUCTION relatively low compared to GPS [11], so latest state of the art In the recent years we have witnessed a drastic increase in research is focused on transportation mode detection based sensing and computational resources that are built in mo- on GPS tracks and/or accelerometer data. bile phones. Most of modern cell phones are equipped with a Machine learning approaches for transportation mode detec- set of sensors containing triaxial accelerometer, magnetome- tion often rely on statistical, time-based, frequency-based, ter, and gyroscope, in addition to having a Global Position- peak-based and segment-based [8] features, however in most ing System (GPS). Smart phone operating system APIs of- cases statistical features and features based in frequency are fer activity detection modules that can distinguish between used [10, 11, 16]. Feature domains are described in Table different human activities, for example: being still, walk- 1. Statistical, time-based, and spectral attributes are com- ing, running, cycling or driving in a vehicle [2, 3]. However, puted on a level of a time frame that usually covers a few sec- APIs cannot distinguish between driving in different kind onds, whereas peak-based features are calculated from peaks of vehicles, for example driving a car or traveling by bus or in acceleration or deceleration. On the other hand, segment- by train. Recognizing different kind of transportation, also based features are computed on the recordings of the whole known as transportation mode detection, is crucial for mo- trip, which means that they cover much larger scale. Statis- bility studies, for routing purposes in urban areas where pub- tical, time-based, and spectral features are able to capture lic transportation is often available, for facilitating the users the characteristics of high-frequency motion caused by user’s to move towards more environmentally sustainable forms of physical movement, vehicle’s engine and contact between transportation [1], or to inspire them to exercise more. wheels and surface. Peak-based features capture movement In this paper we discuss the use of random forest in trans- with lower frequencies, such as acceleration and breaking portation mode detection based on accelerometer signal. We periods, which are essential for distinguishing different mo- focus on torized modalities. Additionally, segment-based features de- 1. feature extraction, and scribe patterns of such acceleration and deceleration periods [8]. 2. feature analysis to determine the most meaningful fea- tures for this specific problem and the choice of classi- Machine learning methods that are most commonly used fier. in accelerometer based modality detection include support vector machines, decision trees and random forests, how- Our main contribution is feature analysis, which revealed ever naive Bayes, Bayesian networks and neural networks the impact of each feature to the classification scores. have been used as well [11, 12]. Often these classifiers are 2. RELATED WORK used in an ensemble [16]. The majority of algorithms addi- tionally use Adaptive Boosting or Hidden Markov Model to improve the performance of the methods mentioned above [16, 8, 11, 10]. In the last years, deep learning has also been used [6, 14]. Accelerometer-only approach where only statistical features have been used reported 99.8% classification accuracy, how- ever users were instructed to keep the devices fixed position during a trip. Furthermore, only 0.7% of data was labeled as train [11]. State of the art approach to accelerometer-only 21 Domain Description (1) Data (1a) Mobile Statistical These features are include mean, standard de- acquisition applications viation, variance, median, minimum, maximum, range, interquartile range, skewness, kurtosis, root (2b) mean square. (2) Pre- (2a) (2b) Gravity Resampling Filtering Time processing Time-based features include integral and double estimation integral of signal over time, which corresponds to speed gained and distance traveled during that (3) Feature recording. Other time-based features are for ex- extraction ample auto-correlation, zero crossings and mean crossings rate. (4a) (4b) Frequency Frequency-based features include spectral energy, (4) Feature Correlation Statistical spectral entropy, spectrum peak position, wavelet analysis analysis analysis entropy and wavelet coefficients. These can be computed on whole spectrum or only on spe- (5) Clas- (5a) (5b) cific parts, for example spectral energy bellow Defining Choosing sification 50Hz. Spectrum is usually computed using fast feature sets classifiers Fourier transform, whereas wavelet is a result of the Wavelet transformation. Entropy measures are Figure 1: Detailed work flow diagram of the based on the information entropy of the spectrum proposed approach. We stacked general, high- or wavelet [7]. level tasks common in other approaches vertically, Peak Peak-based features use horizontal acceleration whereas subtasks specific to our approach are pic- projection to characterize acceleration and decel- tured horizontally. eration periods. These features include volume, intensity, length, skewness and kurtosis. Split the signal Convolute with Segment Segment-based include peak frequency, stationary Signal on acceleration Find peaks a box window and deceleration duration, variance of peak features, and station- ary frequency. The latter two are similar to ve- locity change rate and stopping rate used by [17]. Count or compute Segment-based features are computed on a larger scale than statistical, time-based or frequency- Number of peaks based features. Table 1: Feature domains and their descriptions Mean Peak height adopted from [8]. Peak-based Standard Peak width features deviation transportation mode detection relies on long accelerometer Skewness Peak width height samples. It uses features from all five domains for classifica- tion with AdaBoost with decision trees as a weak classifier Peak area and achieves 80.1% precision and 82.1% recall [8]. Figure 2: Work flows for extraction of peak-based The performance of transportation mode detection systems features. depends on the effectiveness of handcrafted features designed by the researchers, researcher’s experience in the field af- We collect five second samples of sensor data and resam- fects the results. Thus, there have been approaches that use ple them to sampling frequency 100 Hz in the preprocessing deep learning methods, such as autoencoder or convolutional phase. Resampling ensures us that our samples all contain neural network, to learn the features used for classification. 500 measurements. The most prominent problem we face in Using a combination of handcrafted and deep features for preprocessing concerns the correlation of acceleration mea- classification with deep neural network resulted in 74.1% surements with the orientation of the phone in the three classification accuracy [15]. dimensional space. Practically this means that gravity is measured together with the dynamic acceleration caused by 3. PROPOSED APPROACH phone movements. To eliminate gravity from the samples we perform gravity estimation on raw accelerometer signal. Work flow of the proposed approach is visualized in Figure We follow an approach proposed by Mizell [9]. Gravity es- 1. The first task is data collection. To collect data we use timation splits the acceleration to static and dynamic com- NextPin mobile library [4] developed by the Artificial In- ponent. Static component represents the constant accelera- telligence Laboratory at Jožef Stefan Institute. Library is tion, caused by the natural force of gravity, whereas dynamic embedded into two free mobile applications. The first one is component is a result of user’s motion. Furthermore, using OPTIMUM Intelligent Mobility [1]. OPTIMUM Intelligent this approach we are able to calculate vertical and horizontal Mobility is a multimodal routing application for three Eu- components of acceleration. ropean cities — Birmingham, Ljubljana, and Vienna. The second one is Mobility patterns [4]. Mobility patterns is es- We use preprocessed signal to extract features for classifica- sentially a travel journal. Both applications send five second tion. Features are divided into five domains, based on their long accelerometer samples every time OS’s native activity meaning and method of computation. We have listed the do- recognition modules, Google’s ActivityRecognition API [2] mains in Table 1. We extract features from three domains — for Android and Apple’s CMMotionActivity API [3], de- statistical, frequency, and peak. We extract statistical fea- tect that the user is traveling in a vehicle. We use that tures (maximal absolute value, mean, standard deviation, accelerometer samples for fine-grained classification of mo- skewness, 5th percentile, and 95th percentile) from dynamic torized means of transportation. acceleration and its amplitude, horizontal acceleration and 22 Set Accele. Features Size Feature set CA RE PR F1 D-S Dynamic Statistical 54 D-S 0.48 0.41 0.39 0.37 D-SF Dynamic Statistical, Frequency 94 D-SF 0.60 0.41 0.41 0.39 D-SFP Dynamic Statistical, Frequency, Peak 172 D-SFP 0.46 0.39 0.40 0.35 H-S Horizontal Statistical 54 H-S 0.64 0.40 0.43 0.41 H-SF Horizontal Statistical, Frequency 94 H-SF 0.53 0.39 0.43 0.36 H-SFP Horizontal Statistical, Frequency, Peak 172 H-SFP 0.50 0.37 0.40 0.34 ALL 376 ALL 0.47 0.35 0.40 0.33 Table 2: Predefined feature sets used for classifica- Table 3: Classification metrics for classification with tion. random forest on predefined feature sets. Change model parameters the training set we use the data from [13], whereas validation and test sets were obtained during Optimum pilot testing in 2018. During validation step we are trying to maximize F1 (2) (1) score as our data set is imbalanced. We visualized the evalu- Validate Evaluate Train ation scenario in Figure 3, while the composition of the sets Join datasets in pictured in Figure 4. (3) (4) Test Use best parameters Join datasets Train + and 4. RESULTS Validate evaluate We trained random forest classifier on the predefined fea- ture sets from Table 2. Classification metrics we report on Figure 3: Schema of evaluation scenario. include classification accuracy (CA), recall (RE), precision its amplitude, amplitude of raw acceleration, and amplitude (PS) and F1 score (F1) Results are listed in Table 3. Ta- of vertical acceleration. From the same signals we extract ble 3 shows that we achieved the highest F1 score of 0.41 frequency-based features using fast Fourier transformation. using H-S feature set. This feature set consists of statisti- As frequency-based features we use the power spectrum of cal features calculated on the horizontal acceleration vector. the signal aggregated in 5 Hz bins. Pipeline for extraction of Classification accuracy in that case is also high, compared to peak-based features from dynamic and horizontal in acceler- other feature sets. The peak features seems to be the source ation is pictured in Figure 2. To extract peak-based features of noise in the data, as using peak features in combination we first smooth out the signal with convolution with a box with the other features sets decreases the performance, for window and split it into moments of acceleration and mo- example F1 drops from 0.39 for D-SF to 0.35 for D-SFP. ments of deceleration. Then we find peaks and compute F1 score and classification for dynamic acceleration increase peak heights, peak widths, peak width heights, and peak when we add frequency-based features, whereas these two areas. As there is usually more than one peak we aggregate measures decrease in case of similar action for horizontal ac- these values using mean, standard deviation, and skewness. celeration. This offers two possible interpretations. One is All together we extract 376 features. We organize features that frequency-based features of dynamic acceleration carry into seven predefined feature sets we use for classification. more information compared to frequency-based features of Feature sets are listed in Table 2. horizontal acceleration. The second one is that statistical To evaluate the capabilities and performance of the pro- features of horizontal acceleration are much better than sta- posed approach, we divide our dataset in 3 subsets — train, tistical features from dynamic acceleration. validation, and test set — based on the date the samples We noticed that smaller feature sets generally perform better were recorded on. By doing so we avoided using in this than larger so we focused on feature selection. We initially domain methodologically questionable random assignment train the model with all features and evaluate it on valida- of samples collected during the same trip to different sub- tion set. Then we remove each feature one by one, train the sets. The reason why we did not apply cross-validation is model, evaluate it on the validation set and compare all F1 similar. Using samples from the same trip in train and test scores. We eliminate the feature with the highest F1 score, set would result in significantly higher evaluation scores. For as this means that when the model was trained without that feature if performed better than when the eliminated feature was included. We repeat this procedure until the feature set consists of one feature. Similarly, we do feature addition — we start with an empty feature set and gradually add features one by one. Using the described process of forward feature selection and backward feature elimination we selected two feature sets that performed the best — in case of forward selection the best feature set has 10 features, whereas feature set pro- duced with backward elimination has 28 features. Feature set obtained by forward selection mostly contains statisti- cal features, followed by peak-based. Only one frequency- based features appears in that set. Additionally, features Figure 4: Distribution of modes in train, validation, are in vast majority extracted from dynamic acceleration. and test set. We also added joint train and valida- On the other hand feature set obtained by backward elim- tion set, which we use to train the final model. 23 Feature set CA RE PR F1 J. Urbančič. Optimum project: Geospatial data Forward selection (10) 0.70 0.50 0.47 0.48 analysis for sustainable mobility. In 24th ACM Backward elimination (28) 0.73 0.50 0.48 0.49 SIGKDD International Conference on Knowledge Table 4: Classification metrics for classification with Discovery & Data Mining Project Showcase Track. the selected features in feature selection. ACM, 2018. http://www.kdd.org/kdd2018/files/ project-showcase/KDD18_paper_1797.pdf. Forward selection Backward elimination [5] K.-Y. Chen, R. C. Shah, J. Huang, and L. Nachman. T \P Car Bus Train T \P Car Bus Train Mago: Mode of transport inference using the Car 0.78 0.27 0.05 Car 0.83 0.12 0.05 hall-effect magnetic sensor and accelerometer. Bus 0.51 0.40 0.09 Bus 0.55 0.35 0.10 Proceedings of the ACM on Interactive, Mobile, Train 0.47 0.21 0.32 Train 0.45 0.23 0.32 Wearable and Ubiquitous Technologies, 1(2):8, 2017. Table 5: Confusion matrix for classification with the [6] S.-H. Fang, Y.-X. Fei, Z. Xu, and Y. Tsao. Learning selected features in feature selection. transportation modes from smartphone sensors based ination contains more peak-based features than statistical, on deep neural network. IEEE Sensors Journal, again only one frequency-based feature appears. Dynamic 17(18):6111–6118, 2017. acceleration and horizontal acceleration appear in similar [7] D. Figo, P. C. Diniz, D. R. Ferreira, and J. M. proportions. We evaluated the models trained with that Cardoso. Preprocessing techniques for context feature sets against the test set. Results are listed in Ta- recognition from accelerometer data. Personal and ble 4. Both feature sets achieve better F1 scores than any Ubiquitous Computing, 14(7):645–662, 2010. previous feature sets. Confusion matrix in Table 5 reveals [8] S. Hemminki, P. Nurmi, and S. Tarkoma. what are the differences between these two feature sets. We Accelerometer-based transportation mode detection can see that in case of eliminating features there is less cars on smartphones. In Proceedings of the 11th ACM missclassified as buses and more buses missclassified as cars. Conference on Embedded Networked Sensor Systems, Classification of trains is fairly consistent. For buses and page 13. ACM, 2013. trains the largest part of samples is still missclassified as [9] D. Mizell. Using gravity to estimate accelerometer cars. orientation. In Proc. 7th IEEE Int. Symposium on 5. CONCLUSIONS Wearable Computers (ISWC 2003), page 252. Citeseer, 2003. We showed that while transportation mode with random for- est is possible, careful feature selection is necessary. Using [10] S. Reddy, M. Mun, J. Burke, D. Estrin, M. Hansen, feature selection we were able to improve classification scores and M. Srivastava. Using mobile phones to determine for at least 0.04, in some cases even over 0.10. Although clas- transportation modes. ACM Transactions on Sensor sification scores improved, most of non-car samples were still Networks (TOSN), 6(2):13, 2010. misclassified as cars. We observed that even though peak- [11] M. A. Shafique and E. Hato. Use of acceleration data based features did not perform as well in predefined feature for transportation mode prediction. Transportation, sets, they appeared consistently among selected features in 42(1):163–188, 2015. feature selection. That does not hold for frequency-based [12] L. Stenneth, O. Wolfson, P. S. Yu, and B. Xu. feature only one feature appeared among selected features. Transportation mode detection using mobile phones For the future work we suggest maximization of another clas- and gis information. In Proceedings of the 19th ACM sification score as we focused on maximization of F1 score. SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 54–63. 6. ACKNOWLEDGMENTS ACM, 2011. This work was supported by the Slovenian Research Agency [13] J. Urbančič, L. Bradeško, and M. Senožetnik. Near under project Integration of mobile devices into survey re- real-time transportation mode detection based on search in social sciences: Development of a comprehensive accelerometer readings. In Information Society, Data methodological approach (J5-8233), and the ICT program of Mining and Data Warehouses SiKDD, 2016. the EC under project OPTIMUM (H2020-MG-636160). [14] T. H. Vu, L. Dung, and J.-C. Wang. Transportation 7. REFERENCES mode detection on mobile devices using recurrent nets. [1] Optimum project - European Union’s Horizon 2020 In Proceedings of the 2016 ACM on Multimedia research and innovation programme under grant Conference, pages 392–396. ACM, 2016. agreement No 636160-2. [15] H. Wang, G. Liu, J. Duan, and L. Zhang. Detecting http://www.optimumproject.eu/, 2017. [Online; transportation modes using deep neural network. accessed 4-November-2017]. IEICE TRANSACTIONS on Information and [2] ActivityRecognition. https://developers.google. Systems, 100(5):1132–1135, 2017. com/android/reference/com/google/android/gms/ [16] P. Widhalm, P. Nitsche, and N. Brändie. Transport location/ActivityRecognition, 2018. [Online; mode detection with realistic smartphone sensor data. accessed 31-August-2018]. In Pattern Recognition (ICPR), 2012 21st [3] CMMotionActivity. https://developer.apple.com/ International Conference on, pages 573–576. IEEE, library/ios/documentation/CoreMotion/ 2012. Reference/CMMotionActivity_class/index.html#// [17] Y. Zheng, Q. Li, Y. Chen, X. Xie, and W.-Y. Ma. apple_ref/occ/cl/CMMotionActivity, 2018. [Online; Understanding mobility based on gps data. In accessed 31-August-2018]. Proceedings of the 10th international conference on [4] L. Bradeško, Z. Herga, M. Senožetnik, T. Šubic, and Ubiquitous computing, pages 312–321. ACM, 2008. 24 FSADA, an anomaly detection approach A modern, cloud-based approach to anomaly-detection, capable of monitoring complex IT systems Viktor Jovanoski Jan Rupnik Jozef Stefan International Postgraduate School Jozef Stefan Institute Jamova 39 Jamova 39 Ljubljana, Slovenia Ljubljana, Slovenia viktor@carvic.si jan.rupnik@ijs.si ABSTRACT huge volumes or just a few data points per day. Designing Modern IT systems are becoming increasingly complex and a system that can cope with such diverse situations can be inter-connected, spanning over a range of computing de- challenging. vices. As software systems are being split into modules and services, coupled with an increasing parallelization, de- Another important aspect is ”actionability” of the reported tecting and managing anomalies in such environments is anomalies. When human operator is presented with a new hard. In practice, certain localized areas and subsystems alert, the message as to what is wrong needs to be clear. The provide strong monitoring support, but cross-system error- operator must be able to immediately start addressing the correlation, root-cause analysis and prediction are an elusive problem. Sometimes all we need is a different presentation target. of the result, but most often the easy-to-describe algorithms and models are used - e.g. linear regression or nearest neigh- We propose a general approach to what we call Full-spectrum bour. anomaly detection - an architecture that is able to detect lo- cal anomalies on data from various sources as well as creating This high velocity of data (volume and rate) makes some high-level alerts utilizing background knowledge, historical of the algorithms less usable in such scenarios - specifically data and forecast models. The methodology can be imple- batch processing that requires random access to all past mented either completely or partially. data is not desired. Ideally, we would only use streaming algorithms - algorithms that live on the stream of incoming Keywords data, where each data point is processed only once and then discarded. Anomaly detection, Outlier detection, Infrastructure moni- toring, Cloud The contribution of this paper is a hollistic approach to anomaly detection system that clearly defines different parts 1. INTRODUCTION and stages of the processing, including active learning as a Modern IT systems need several key capabilities, apart from crucial part of the processing loop. The design addresses tracking and directing the underlying businesses. They need modern challenges in IT system monitoring and is suitable to manage errors and failures - predict them in advance, for cloud deployment. detect them in their early stages, help limit the scope of the damage and mitigate their consequences. All this is achieved by analyzing past and current data and detecting outliers in 2. ANOMALY-DETECTION it. The most common definition of an anomaly is a data point that is significantly different from the majority of other data Anomaly detection must happen in near-real time, while si- points. See [2] for a detailed explanation. This definition is multaneously analyzing potentially thousands of data points strictly analytical. But most often the users define it within per second. Incoming data that such a system can monitor the scope of their operation, such as finding abnormal engine is very diverse. Data can come in different shapes (numeric, performance in order to prevent catastrofic failure, flagging discrete or text), in regular time intervals or sporadically, in unexpected delays in manufacturing pipeline in order to pre- vent shipment bottlenecks, detecting unusual user behavior that indicates intrusion and identifying market sectors that exibit unusual trends to detect fraud and tax evasion. The anomaly-detection process is thus heavily influenced by the target domain. It also needs process-specific way of mea- suring the detection efficiency. For instance, in classification problems we can use several established measures such as accuracy, recall, precision or F 1. In anomaly detection, on the other hand, we often don’t have classes to work with 25 and secondly, we need strong user feedback to evaluate our 3. THE SYSTEM ARCHITECTURE results. Sometimes anomaly detection looks more like a fore- To create a system that is able to ingest such huge amount of casting and optimization problem. We measure how much different data streams, detect anomalies in them and present the current state of a complex system is different from the user with a manageable amount of actionable alerts we pro- optimal or predicted value. pose a reference architecture of such system (figure 1). The acronym FSADA stands for Full-Spectrum Anomaly Detec- 2.1 Actionability tion Architecture, is based on the Kappa architecture [5] and It is not sufficient for algorithms to just detect unusual pat- comprises the components described below. terns. Human operators that get notified about them must clearly understand the detected problems and be able to act • Storage module contains historical data (raw and upon them - we call this property of alerts actionability. For derived), background knowledge as well as generated instance, it is not enough to report “the euclidian distance alerts and incidents. between multi-dimensional vectors of regularized input val- ues is too big” - end-users will have no clue about what is • Stream-processing module performs incoming-data wrong here. Instead, the system should report something pre-processing, as well as signal- and incident-detection. like “The average processing time of customer orders is well above its usual values. This situation will very likely re- • Batch processing module calculates aggregations, sult in a significant drop of daily productivity.” Some algo- pattern discovery as well as background knowledge re- rithms produce models that are easier to translate into hu- fresh. man language than others. This feature needs to be taken • into account when an anomaly-detection system is being im- User-interface module (commonly abbreviated as plemented. GUI) displays raw-data, generated alerts along with feedback and active learning support. 2.2 Modern challenges In the era of big data there are many systems that produce 3.1 Terminology data and a lot of the generated data can be used to monitor, From now on we will be using the following terminology: maintain and improve the target system. The data volumes are staggering and need to be addressed properly within the anomalies - any kind of abnormal behavior inside the sys- system implementation. tem, regardless of the effect on the system performance. Users expect results to be available as soon as possible - signals - low-level anomalies that have been detected on within hours, sometimes even minutes or seconds. In cases single data-stream. where automated response in possible, this time-frame short- ens to miliseconds (e.g. stock trading, network intrusion). incidents - complex anomaly or a group of them with major impact on the system. Its time duration is usually limited Current systems for anomaly detection are developed as add- to several minutes or hours. They are closely related to the ons to the existing systems for collecting and processing way users perceive the system problems and outages. data. This makes sense, since they developed organically, during the usage by the competent users, which identified alerts - an anomaly that is reported to the user, self-contained areas that require advanced monitoring. We belive this pro- with explanation and basic context. vides necessary business validation of anomaly detection sys- tems. However, there are limitations of such approach. 3.2 Storage module The system needs to store several types of data that per- • Data that is available in one part of the system might form different functions. Each part of the storage layer can not be available in another part, where anomaly- be located in separate system that best matches the require- detection could greatly benefit from it. ments. • Data volume could prove to be too big for effective Measurement data represents raw values that were ob- anomaly detection analysis, because needed resources served and processed in order to monitor the system. This might not be available (e.g. computing power is needed data is strictly speaking not necessary when our algorithms for main processing and anomaly detection should not are designed to work on a stream, but they are required interfere with it). for batch algorithms, for model retraining, active learning • Anomaly detection has local scope as it only pro- and for ad-hoc analytics. Generated signals and inci- cesses data from one part of the system. The alerts dents are stored, additionally processed and viewed by the are thus not aware of the potential problems in other user. The storage needs to support flexible format of alerts, parts of the system, so resolving issues takes longer since each one of them is ideally an independent chunk of and involves more people from several departments to data that can be visualized without additional data retrieval. coordinate during problem escalations. The algorithms can use domain knowledge to guide their execution. To facilitate this, the data needs to be stored in • There is no systematic way of collecting the user feed- a storage system that provides fast searching, in order to be back that would guide and improve the anomaly de- used in stream processing steps for enrichment, routing and tection process. aggregation. The algorithms inside the system create and 26 Figure 1: The big picture - displays the main building layers such as stream processing and storage, as well as the flow of the data between different components of the system. update their models all the time, so this part of the stor- is required. This enables handling of previously unseen data age needs to support reliable and robust storing of possibly partitions as well as scalability in parallel processing. large binary files. These anomalies (signals) have simple models and conse- 3.3 Stream processing module quently alert explanations. But they are local in nature - their scope is most often very limited. They also operate This module contains the most important part of the system on single-data stream, so they don’t take into account the - the components that transport the data, run the processing anomalies in ‘‘the neighbourhood”. To overcome this defi- and generate alerts. ciencies, we propose the third step of stream processing, to which signals should be sent. 3.3.1 Incoming data pre-processing Incoming data that the system analyses arrives at different 3.3.3 Incident detectors volumes and speeds (high-velocity), as well as in many differ- This stage of the processing receives signals from different ent types and formats. This data needs to be pre-processed parts of the system, performing scoring of their importance, before it reaches any anomaly detection algorithms. combining them into incidents and thus achieving several advantages. Coping with such high-volume data stream requires special technologies. Streaming solutions such as Apache Kafka [4] The scoring algorithm provides option to assign user-guided have been developed and battle tested for processing millions subjective importance to signals - e.g. two statistically equally of data records per second in a distributed manner. This important anomalies can have completely difference per- step needs to perform several functions. ceived value to the user. This step can also can correlate data across data-streams, a step that is hard to achieve and Data formatting and enrichment - transform messages that proves to be very valuable. Given data from differnt from the input format into a common format that is accepted parts of the system it can create more complex constructs by the internal algorithms. Also, additional data fields can that better evaluate the impact of the current problem on be attached, based on background knowledge. the whole system. Data aggregation - sometimes we want to measure char- This level of abstraction is the main access point for end- acteristics of all the data within some time intervals (e.g. users - it more closely follows their way of addressing system average speed in the last 10 minutes). malfunctions (e.g ”if module A breaks, it will have impact on modules B and C, but module D should remain unaffected”). Data routing - send the transformed and aggregated data to relevant receivers. There may be several listeners for the 3.3.4 Background knowledge same type of input data. To help guide the algorithms during the signal detection we can provide additional background knowledge in differ- 3.3.2 Signal detectors ent forms, such as metadata, manual thresholds and rules, When data is ready for processing, it is routed to signal graphs and other structures. All this data can be used to detectors. They operate on a single data stream, often only perform various enhancements of basic algorithms, such as on a small partition of it - e.g. single stock, group of related creation of additional features in data pre-processing, up- stocks. They handle huge data volumes, so they need to be and down-voting of results (e.g. estimated impact of de- fast, using very little resources. To achieve great flexibility tected anomaly), pruning of search space in optimization regarding input data a dynamic allocation of such processors steps, estimation of affected entities for given anomaly 27 or support for complex simulations when current per- terval. Users feel comfortable with seeing the big picture formance is measured against historical values. These rules (an indcident) in then drill down into specific data (indi- and metadata can be acquired by analyzing historical data vidual signal). They reported this feature enables them to as well as collecting knowledge from end-user, e.g. manual cut down time for understanding the problem by an order feedback/sign-off and active learning. of magnitude (from hours to minutes). 3.4 Improving actionability Active-learning component was well received, as it made The system modules presented so far are mostly established manual work more efficient. The users noticed how the al- components that are used also in normal processing steps gorithm was choosing more and more complex learning ex- of modern, cloud-based systems (see [1]). We propose that amples for manual classification. This helped them feel pro- they should be upgraded with the following functionalities ductive and engadged. They also reported positive impact in order to achieve the goal of high-quality actionable alerts, of active learning on their problem understanding, as they empowering users to manage their complex systems in the were presented with some potentially problematic situations most efficient way. that went unnoticed in the past. Based on above observations were conclude that our pro- 3.4.1 Feedback posed approach has positively impact on the organization, Historical incidents are very valuable for learning of informa- both for technologies as well as human operators. Additional tive features that aid detection of anomalies. They are also ideas that were collected from users are listed under future used for calibrating scoring algorithm that assigns relevance work. scores to generated signals and incidents. It is common for the organization to require every major detected incident to 5. CONCLUSIONS AND FUTURE WORK be manually signed off - a relevance tag (e.g. high-relevant, The focus of our future work is on several advanced scenar- semi-relevant, not-relevant, noise, new-normal) has to be as- ios where a lot of added value for users is expected, mix- signed to it by the operators. These tags are used for train- ing anomaly detection, optimization and simulation. Main ing of incident-classification algorithms, but we can also con- gains are expected to come from feedback collection and ac- struct a more complex setting where a form of backtracking tive learning. Apart from monitoring IT systems, the target is used to calibrate signal detectors. domains are also manufacturing and smart cities. We also collected some features that users commonly inquired about: 3.4.2 Active learning The active learning approach [3] can be used to make the manual classification effort more efficient. The system pro- • Root-cause analysis - when a major incident occurs, vides untagged examples/incidents where the criteria func- many parts of the system get affected. To resolve is- tion returns the value that is the closest to the decision sues as quickly as possible, the operators should be boundary. The user then manually classifies the incident pointed to the origin of the problem. The anomaly de- and the classification model is re-trained with this new data. tection system should thus have a capability to point By guiding users in this way the system requires relatively to the first signal with high-impact on the final rele- small number of steps to cover the search space and obtain vance score. good learning examples. • Predictions - The goal for all monitoring systems is to detect problems as soon as possible. The system Our proposed approach incorporates this continuous activ- must that not only be able to detect signals, but also ity in several areas. GUI module should contain appropriate forecast them, based on past behavior. In order to do pages where user can enter his feedback and active-learning that, the algorithms require more metadata and struc- input. Storage module contains alerts historical data that ture to properly model inter-dependencies between sig- can be used for re-training of incident detectors. Storage nals. Mere observation is much easier than simulation module also contains old and new incident-detector mod- of a complex system with many moving parts. But it els that can be picked up automatically by the processing is possible and is what users expect from a modern AI- module. based system. Our future reserach will be oriented to- wards providing and efficiently integrating these func- 4. VALIDATION AND DISCUSSION tionalities into our anomaly-detection approach. Based on our extensive experience with practical anomaly detection implementation, we identified several new require- 6. REFERENCES ments for these systems. The presented approach satisfies [1] Anodot anomaly detection system. them by supporting big-data real-time analytics on one side http://www.anodot.com, 2018. and actionability via active-learning support on the other. [2] C. C. Aggarwal. Outlier Analysis. Springer New York, New York, New York, 2013. The system architecture is deployable to cloud environment [3] D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active by design. We also employ modern streaming and storage learning with statistical models. CoRR, cs.AI/9603104, technologies for transporting and storing of different input 1996. data and alerts. [4] N. Garg. Apache Kafka. Packt Publishing, 2013. We observed that users appreciate our notion of incidents [5] J. Lin. The lambda and the kappa. IEEE Internet - a grouping of alerts that occur in certain small time in- Computing, 21(5):60–66, 2017. 28 Predicting customers at risk with machine learning David Gojo Darko Dujič Jožef Stefan International Postgraduate School, Ceneje d.o.o., Jamova 39, 1000 Ljubljana, Slovenia Jožef Stefan International Postgraduate School, david.gojo@ijs.si Štukljeva cesta 40, 1000 Ljubljana, Slovenia darko.dujic@ceneje.si ABSTRACT 2. RELATED WORK Today’s market landscape is becoming increasingly competitive Improvements in tracking technology have enabled data driven as more advanced methods are used to understand customer’s industries to analyze data and create insights previously behavior. One of key techniques are churn mitigation tactics unavailable to the business. Data mining techniques have evolved which are aimed at understanding which customers are at risk to to now support the prediction of behavior of customers such risk leave the service provider and how to prevent this departure. This of leaving due to the attributes that are trackable [2]. The use of paper presents analyzes accounts renewal rates and uses easily data mining methods has been widely advocated as machine applicable models to predict which accounts will be decreasing learning algorithms, such as random-forest approaches have spend at the time when they are due to renew their existing several advantages over traditional explanatory statistical contract based on number of attributes. Key questions it tries to modeling [3]. explore is if customer behavioral or customer characteristic data Lack of predefined hypothesis makes algorithms excel in these (or combination of both) is better at predicting accounts that will tasks as it is making it less likely to overlook predictor variables renew at lower than renewal target amount (churn rate). or potential interactions that would otherwise be labelled Categories and Subject Descriptors unexpected [4]. Models are often labelled as business intelligence models aimed at finding customers that are about to switch to F.2.1 [Numerical Algorithms and Problems]: Data mining, competitors or leave the business [5]. Structured prediction Key classifications are observed in work related to churn that we General Terms will use in our data set for review [6]: Algorithms, Management, Measurement, Documentation, - Behavioral data - will consist of attributes that we have Performance historically observe that play a role in whether the account will renew or not: product utilization, activity Keywords levels of the account, number of successful actions in Data Mining, Analysis, Churn prediction. the account and number of upsells done ahead of renewal. 1. INTRODUCTION - Characteristic attributes - will consist of size of the The main issue of business is how to make educated decision with account in terms of spend, number of members in the support of analysis that dissect complex decisions on addressable company, number of active users of the products in the problems using measurements and algorithms. Where there are company, payment method and how they renew the many disciplines are researching methodological and operational contract, geography and what level of support the aspects of decision making, at the main level, we distinguish product is given (number of sales visits and interactions between decision sciences and decision systems [1]. With with the customer). increasing number of companies trying to use machine learning to assist in their decision-making process we examined how decision science can be supplemented by applying machine learning models to the company’s customer data. We partnered 3. DATA ACQUISITION with the medium sized B2B business operating in Europe and 3.1 Data understanding Africa with the aim to help them better understand their ‘customers at risk’ segment of clients. Working with the customer we have arranged a set of interviews with the leadership to better understand their business and To this end we developed two easily applicable performance challenges they are experiencing together with ambitions they algorithms designed to highlight customers at risk and company have in the business. After the interview rounds we focused on can address to mitigate their risk of leaving as clients. reviewing 2 hypotheses flagged in the examination process: The paper has the following structure: in section 2 we are - What is driving churn numbers: behavior of the customers or presenting related work to the area recorded historically. Next, better structure of the base? data acquisition is explained in section 3 followed by results - Does acquisition of new accounts represent a risk in churn acquired from the tested algorithms in section 4. We then number with historic observation of accounts renewing lower / conclude our observations in section 5. not renewing in their first-year renewal? 29 3.2 Data pre-processing or if no descriptor can be found that would result to the The data we used derives from company’s internal customer information gain. bookings and customer databases we consolidated. As customers Random Forest. We assume that the user knows about the are on yearly renewals we have taken the renewal and the data on construction of single classification trees. Random Forests grows the account before the renewal as the key building block for many classification trees. To classify a new object from an input analysis. vector, put the input vector down each of the trees in the forest. Each tree gives a classification, and we say the tree "votes" for 3.3 Feature engineering that class. The forest chooses the classification having the most We enriched the data using SQL joins on the customer numbers votes (over all the trees in the forest) [7]. Both methods were to include key characteristics of accounts, tenure of the client, applied to imported dataset numerous times with continuous products utilization information, behavior of the customer before testing of parameters to improve performance. the renewal and their usage of the product. In terms of regional split of the market the dataset consists of 4 4.2 Application of J48 key geo and segment regions in Europe and Africa: - Medium-business segment Working with Weka on the dataset of the customer we tried to - UK & Ireland market launch the model to tune the parameters. Key modifications: - Europe Enterprise segment - “10-fold cross validation” used to improve accuracy - Eastern Europe, Middle-East and Africa - Minimum number of objects moved to 100 Through feature engineering and reviewing descriptive statistics As Figure 2 shows this reduced the number of leaves to 16 which key attributes we nominalized are 11. was something comprehendible enough. For the machine learning purposes for the calls we have selected 3 possible outcomes related to the outcome of customer spend Summary of results below: after it’s renewal: - Customer_Renew (Not renew, Partial renew, Full renew) 3.4 Data Set Statistics We selected bookings period from 2016 to end of 2017 including 23,043 instances in above selected renewal of 12,872 accounts. The attributes that were nominalized are listed below: - (nom) Has main product – has product 1 - (nom) Has_assisting_product – has product 2 - (nom) Has_media_product – product 3 - (nom) Account_potential – size and potential of the account Figure 1: J-48 model error estimates - (nom) Is_Auto_Renew – auto renewal option enabled 4.3 Application of Random forest - (nom) First_renewal – is the client renewing first time We ran several tests on Random forest vs Random trees. When - (nom) Upsold_Before_renewal – was there an upsell before tuning parameters Random tree tended to not respond well to - (nom) JS_Utilization – utilization of product 2 - indicator pruning so Random forest was a preferred option. Like J48, - (nom) Score_Engagement – engagement of the recruiter application with key modifications was focused on validation and - (nom) LRI_Score – savviness of the user of the product additionally on setting maximum depth of the random forest: 4. RESULTS - “10-fold cross validation” - Max. depth set at 3 4.1 Brief description of the methods used Summary of results below: Where multiple algorithms were used during the testing due to important feature that the result needed to have at least one interpretable model, so we went in the direction of nominalizing attributes and decided to use J-48 model and Random forest classifier on the data set. J48. Decision trees C4.5 (J48 in Weka) algorithm: deals with continuous attributes as observed in the related work. Where the method is classification-only the main machine learning method applied is J48 pruned tree or WEKA-J48 machine learning method. Tree tries to partition the data set into Figure 2: Random forest model error estimates subsets by evaluating the normalized information gain from choosing a descriptor for splitting the data. The training process stops when the resulting nodes contain instances of single classes 4.4 Comparisons of models Overall the J48 model has surprisingly 0.7pp points higher Classification Accuracy than the Random forest model. 30 Validation Measures J48 Random Forest J48 provided a significantly better interpretability and classification accuracy than the Random forest or any test on the Classification Accuracy 72.9% 72.2% Random tree model. Some additional tests were done on Naïve Bayes model and J48 was superior in the results. Key area it accelerated was in predicting accounts that will not renew. Where Mean absolute error 0.276 0.280 the precision is just above 38% this is almost double comparing to Random forest model. Table 1. Baseline benchmark validation measures 3 key takeaways observed that the company found the most Key observation analyzing the data was that neither model was insightful were: predicting any partially churned accounts after we forced their trees to be pruned. - One of the new features designed by the product team that encouraged the auto-renew of their clients played the most important at predicting the renewal rate J48 predictions: - Customer behavior is a better signal for probability of a b c <-- classified as renewal vs general account characteristics - Account potential which is the predictor of account potential 0 2745 285 | a = PARTIAL_RENEW and size plays the role only after product utilization and 0 1528 789 | b = FULL_RENEW engagement of the account with our products 0 2434 1504 | c = NOT_RENEWED potent al potent al 5. CONCLUSION AND FUTURE WORK Figure 3: The J48 decision tree For the acceleration of performance, the decision tree is of paramount importance and value. Insight that Auto renew as a Random forest predictions: feature is one of the key predictors if the account will renew fully a b c <-- classified as is truly remarkable based on the simplicity of the models and how easily applicable they are. 0 2857 173 | a = PARTIAL_RENEW 0 15591 483 | b = FULL_RENEW Applications of this models will be of great foundation for driving the discussion on different account features and metrics. This is 0 2894 1044 | c = NOT_RENEWED especially true as it is tackling one of the key challenges observed in hypothesis as in how important ‘account potential’ is for the account ahead of the renewal. Even though Random forest has a lower classification accuracy J48 offers significantly higher interpretability with tree pruning Observing the attributes added into the analysis scope and offering valuable insights, short description below and discussed optimizing them for the J48 we were able to get valuable insight in evaluation of models. which account characteristics vs account behaviors ahead of the renewal are the best predictors for the account to renew at the full amount. 31 6. REFERENCES [5] M. Óskarsdóttir, B. Baesens and J. Vanthienen, "Profit- Based Model Selection for Customer Retention Using Individual Customer Lifetime Values," Big Data, vol. [1] M. Bohanec, Decision Making: A Computer-Science 6, no. 1, pp. 53-65, 3 2018. and Information-Technology Viewpoint, vol. 7, 2009, pp. 22-37. [6] S. Kim, D. Choi, E. Lee and W. Rhee, "Churn prediction of mobile and online casual games using [2] A. Rodan, A. Fayyoumi, H. Faris, J. Alsakran and O. play log data," PLOS ONE, vol. 12, no. 7, p. e0180735, Al-Kadi, "Negative correlation learning for customer 5 7 2017. churn prediction: a comparison study.," TheScientificWorldJournal, vol. 2015, p. 473283, 23 3 [7] J. Hadden, A. Tiwari, R. Roy and D. Ruta, "Computer 2015. assisted customer churn management: State-of-the-art and future trends," Computers & Operations Research, [3] A. K. Waljee, P. D. R. Higgins and A. G. Singal, "A vol. 34, no. 10, pp. 2902-2917, 10 2007. Primer on Predictive Models," Clinical and Translational Gastroenterology, vol. 5, no. 1, pp. e44- [8] A. K. Meher, J. Wilson and R. Prashanth, "Towards a e44, 2 1 2014. Large Scale Practical Churn Model for Prepaid Mobile Markets," Springer, Cham, 2017, pp. 93-106. [4] Y. Zhao, B. Li, X. Li, W. Liu and S. Ren, "Customer Churn Prediction Using Improved One-Class Support [9] M. Ballings, D. Van den Poel and E. Verhagen, Vector Machine," Springer, Berlin, Heidelberg, 2005, "Improving Customer Churn Prediction by Data pp. 300-306. Augmentation Using Pictorial Stimulus-Choice Data," Springer, Berlin, Heidelberg, 2012, pp. 217-226. 32 Text mining MEDLINE to support public health João Pita Costa, Luka Stopar, Raghu Santanam, Paul Carlin Michaela Black, Flavio Fuart, Marko Grobelnik Chenlu Sun South Eastern Health and Jonathan Wallace Jožef Stefan Institute, Ljubljana Arizona State University, USA Social Care Trust, UK Ulster University, UK Quintelligence, Ljubljana, Slovenia ABSTRACT demonstration paper focuses on this large open dataset, and the exploration of its structured data. Today’s society is data rich and information driven, with access to numerous data sources available that have the potential to provide new insights into areas such as disease prevention, personalised medicine and data driven policy decisions. This paper describes and demonstrates the use of text mining tools developed to support public health institutions to complement their data with other accessible open data sources, optimize analysis and gain insight when examining policy. In particular we focus on the exploration of MEDLINE, the biggest structured open dataset of biomedical knowledge. In MEDLINE we utilize its terminology for indexing and cataloguing biomedical information – MeSH – to maximize the efficacy of the dataset. Categories and Subject Descriptors Figure 1. MIDAS platform dashboard, composed of visualisation modules customized to the public health data sourced in each H.4 [Information Systems Applications]: Miscellaneous governamental institution, and combined with open data. General Terms 2. THE MEDLINE BIOMEDICAL OPEN DATA Measurement, Performance, Health. SET AND IT’S CHALLENGES. Keywords 2.1. MEDLINE DATASET. Big Data, Public Health, Healthcare, Text Mining, Machine Learning, MEDLINE, MeSH Headings. With the accelerating use of big data, and the analytics and visualization of this information being used to positively affect the daily life of people worldwide, health professionals require 1. MEANINGFUL BIG DATA TOOLS TO more and more efficient and effective technologies to bring SUPPORT PUBLIC HEALTH added value to the information outputs when planning and delivering care. The day-to-day growth of online knowledge The Meaningful Integration of Data, Analytics and Service requires that the high quality information sources are complete, [MIDAS], Horizon 2020 (H2020) project [1] is developing a high quality and accessible. A particular example of this is the big data platform that facilitates the utilisation of healthcare PubMed system, which allows access to the state-of-the-art in data beyond existing isolated systems, making that data medical research. This tool is frequently used to gain an amenable to enrichment with open and social data. This overview of a certain topic using several filters, tags and solution aims to enable evidence-based health policy decision- advanced search options. PubMed has been freely available making, leading to significant improvements in healthcare and since 1997, providing access to references and abstracts on life quality of life for all citizens. Policy makers will have the sciences and biomedical topics. MEDLINE is the underlying capability to perform data-driven evaluations of the efficiency open database [7], maintained by the United States National and effectiveness of proposed policies in terms of expenditure, Library of Medicine (NLM) at the National Institutes of Health delivery, wellbeing, and health and socio-economic (NIH). It includes citations from more than 5200 journals inequalities, thus improving current policy risk stratification, worldwide journals in approximately 40 languages (about 60 formulation, implementation and evaluation. MIDAS enables languages in older journals). It stores structured information on the integration of heterogeneous data sources, provides privacy- more than 27 million records dating from 1946 to the present. preserving analytics, forecasting tools and visualisation About 500,000 new records are added each year. 17.2 million of modules of actionable information (see the dashboard of the these records are listed with their abstracts, and 16.9 million first prototype in Figure 1). The integration of open data is articles have links to full-text, of which 5.9 million articles have fundamental to the participatory nature of the project and core full-text available for free online use. In particular, it includes ideology, that heterogeneity brings insight and value to 443.218 full-text articles with the key-words string “public analysis. This will democratize, to some extent, the health”. contribution to the results of MIDAS. Moreover, it enables the 2.2. MEDLINE STRUCTURE. MIDAS user to profit from the often powerful information that exists in these open datasets. A point in case is MEDLINE, the The MEDLINE dataset includes a comprehensive controlled scientific biomedical knowledge base, made publicly available vocabulary – the Medical Subject Headings (MeSH) – that through PubMed. The set of tools described in this 33 delivers a functional system of indexing journal articles and public health policy making, a suitable MIDAS PubMed books in the life sciences. It has proven very useful in the search repository had to be developed. This repository has to allow of specific topics in medical research, which is particularly exploration of a wide range of different visualisation techniques useful for researchers conducting initial literature reviews in order to evaluate their applicability to policy-making tasks before engaging in particular research tasks. Humans annotate within the policy cycle. Therefore, there was a need for a most of the articles in MEDLINE with MeSH Heading selection of a powerful, semi-structured text index, that would descriptors. These descriptors permit the user to explore a allow free text searches, but also allow the creation of complex certain biomedical related topic, which relies on curated queries based on available metadata. An obvious choice is information made available by the NIH. MeSH is composed of elasticSearch, which combines features provided by NoSQL 16 major categories (covering anatomical terms, diseases, drugs, databases with standard full text indexes, as it is based on the etc) that further subdivide from the most general to the most Apache Lucene Index. The main design challenge when specific in up to 13 hierarchical depth levels. choosing this particular toolset was that querying based on arrays or parent-child relations are not supported, meaning that for complex use-cases different indexes based on the same source dataset have to be created. Nevertheless, excellent results, particularly with regards to the area of performance have been obtained. 2.4. MEDLINE DASHBOARD. One of the identified needs motivating this work is assuring the availability of a dynamic dashboard that permits the user to explore data visualisation modules, representing the queries to the MEDLINE dataset through pie charts, bar charts, etc [5]. The dashboard that we made available (in Figure 2) feeds on that dataset through the elasticSearch index earlier discussed. It is composed of several interactive visualisation modules that utilises the mouse hover when interacting and provide information through pop-up messages on several aspects of the data based on particular queries of interest (e.g. a pie chart representing the “public health” citations that talk about “childhood obesity” during a selected period of time; or a bar chart showing different concepts included in the articles related to “mental health” in Finnish scientific journals). The MEDLINE dataset is mostly in the English language but includes a significant volume of translated abstracts of scientific articles that were written in several other languages. The open source data visualisation Kibana is a plugin to elasticSearch that supports the described dashboard. Thus, it is the data visualisation dashboard of choice for elasticSearch-based Figure 2. MEDLINE data visualisation tool enabling exploration of that indexes, such as the one we present here. It is used in the context open dataset in its full potential, based on data representations easy to understand and to communicate. It provides an interactive public of MIDAS for fast prototyping and support of part of MIDAS instance that can be managed at the dashboard management tool use-cases. While the dashboard itself serves the less technical (below) for which the visualisation modules are constructed (in the user to explore the data available (over a subset of the data center) based on the queries made to the MEDLINE dataset (above). generated by a topic of interest), other options are available that permit more control of the data by the data scientists at a more 2.3. MEDLINE INDEX. operational level. These are: (i) the management dashboard, This paper demonstrates the interactive data visualisation text- where the technical user can perform the appropriate mining tools that enable the user to extract meaningful subsampling based on the topics of interest and the required information from MEDLINE. To do that we are using the advanced options over the available data features; and (ii) the underlying ontology-like structure MeSH. MEDLINE data, visual modules creator permitting the technical user to easily together with the MeSH annotation, that is indexed with create new interactive visualisation modules. Moreover, this tool ElasticSearch and made available to data analytics and enables one to query large datasets and produce different types visualisation tools. This will be discussed in more detail in the of visualisation modules that can be later integrated into next section. customized dashboards. The flexibility of such dashboards The manipulation and visualization of such a complete data permits the user to profit from data visualisations that feed on source brings challenges, particularly in the efficient search, his/her preferences, previously set up as filters to the dataset. review and presentation when choosing appropriate scientific The MIDAS data visualisation tools permit the user to explore knowledge. The manipulation and visualisation of complex text the potential of the MEDLINE dataset, based on pie charts and data is an important step in extracting meaningful information other representations that are easy to comprehend, interact with, from a dataset such as MEDLINE. Although powerful, the and to communicate. It also enables a public instance based on a online search engine provided by the NLM does not provide particular query to the dataset, which includes different types of suitable tools for in-depth analysis and the emergence of data visualisation modules that can later integrate a customised scientific information. As one of the main goals of MIDAS is to dashboard, designed in agreement with the workflows and experiment with advanced visualisation techniques in support of preferences of the end-user. This live dashboard can easily be 2 34 integrated through an iframe in any website, not showing the could be transformed as D1 = (1, 1, 1, 0) and D2 = (1, 1, 0, 1). customization settings but maintaining the interaction capability Then the documents are clustered into k groups G1, G2, ..., Gk and the real-time update. It permits a complete base solution to using the k-means algorithm. For each group we compute the further explorer the MEDLNE index and the associated dataset "average" document (centroid), which is the representative of [6]. the group. The most frequent words in the "average" document are drawn in the word cloud - the central grey word cloud is the "average" of all the documents in S. We can calculate how similar a particular document d is to a group Gi by calculating the cosine of the angle between the vector representation of d and the "average" document (centroid) of the group Gi. By dragging the red SearchPoint ball over the word-groups, we provide the relevance criteria to the search result, thus bringing to the top results the articles we are most interested in (see Figure 4). When that ball is moved, for each document, we calculate the similarity to each of the word-groups and combine it with the distance between the ball and the group. The result is used as the ranking weight where the document with the highest cumulative weight is ranked first. When having the mouse over the word-clouds we get a combination of the most frequent words shown in the tag clouds that change based on how close the ball is to a particular group. After getting to a position with the SearchPoint over the word cloud highlighting “primary Figure 4. A screenshot of the MEDLINE SearchPoint, with groups of care”, a qualitative study in primary care on childhood obesity keywords (on the right) extracted from the search results, represented that occupied the position 188 is now in the first position. The by different colors, and on the left the reindexed search results user can read its title and first lines of abstract, and when themselves with the number that they appear in the original index [6]. clicking on it, the system opens the article in the browser at its PubMed URL location. 4. MEDLINE SEARCHPOINT. 3. MeSH CLASSIFIER The efficient visualisation of complex data is today an important step in obtaining the research questions that describe the This rich data structure in the MEDLINE open set is annotated problem that is extracted from that data. The MEDLINE by human hand (although assisted by semi-automated NIH SearchPoint is an interactive exploratory tool refocused from the tools) and therefore is not available in the most recent citations. proprietary open source technology SearchPoint [8] (available at However, in the context of MIDAS we made available an searchpoint.ijs.si) to support health professionals in the search automated classifier based on [2] that is able to suggest the for appropriate biomedical knowledge. It exhibits the clustered categories of any health related free text. It learns over the part keywords of a query, after searching for a topic. When we use of the MEDLINE dataset that is already annotated with MeSH, indexing services (such as standard search engines) to search for and is be able to suggest categories to the submitted text information across a huge amount of text documents – the snippets. These snippets can be abstracts that do not yet include MEDLINE index described in Section 2 being an example – we MESH classification, medical summary records or even health usually receive the answer as a list sorted by a relevance criteria related news articles. To do that we use a nearest centroid defined by the search engine. The answer we get is biased by classifier [3] constructed from the abstracts from the MEDLINE definition. Even by refining the query further, a time consuming dataset and their associated MeSH headings. Each document is process, we can never be confident about the quality of the embedded in a vector space as a feature vector of TF-IDF result. This interactive visual tool helps us in identifying the weights. For each category, a centroid is computed by averaging information we are looking for by reordering the positioning of the embeddings of all the documents in that category. For higher the search results according to subtopics extracted from the levels of the MeSH structure, we also include all the documents results of the original search by the user. For example, when we from descendant nodes when computing the centroid. To enter a search term ‘childhood obesity, the system performs an classify a document, the classifier first computes its embedding elasticSearch search over the MEDLINE dataset, extracts groups and then assigns the document to one or more categories whose of keywords that best describe different subgroups of results centroids are most similar to the document’s embedding. We (these are most relevant, and not most frequent terms). This tool measure the similarity as the cosine of the angle between the gives us an overview of the content of the retrieved documents embeddings. Preliminary analysis shows promising results. For (e.g. we see groups of results about prevention, pregnancy, instance when classifying the first paragraph of the Wikipedia treatments, etc.) represented by: (i) a numbered list of 10 page for “childhood obesity”, excluding the keyword “childhood MEDLINE articles with a short description extracted from the obesity” from the text, the classifier returns the following MeSH first part of the abstract; (ii) a word-cloud representing the k- headings: means clusters of topics in the articles that include the searched keywords; (iii) a pointer that can be moved through the word- "Diseases/Pathological Conditions, Signs and cloud and that will change the priority of the listed articles. The Symptoms/Signs and Symptoms/Body Weight/Overweight", word-cloud in (ii) is done by taking a set of MEDLINE "Diseases/Pathological Conditions, Signs and documents S and transforming them into vectors using TF-IDF, Symptoms/Signs and Symptoms/Body where each dimension represents the "frequency" of one Weight/Overweight/Obesity". particular word. For example, lets say that we have document The demonstrator version of the MeSH classifier is already D1: "psoriasis is bad" and document D2: "psoriasis is good". This available through a web app, as well as through a REST API 3 35 using a POST call, and is at the moment under qualitative the research topic over time windows that enable filtering to evaluation. This is being done together with health professionals avoid known unreliable results. with years of practical experience in using MeSH themselves through PubMed. In addition, we aim to further explore the In line with this work we have been developing research to potential of the developed classifier in several public health contribute to the smart automation of the production of related contexts including non classified scientific articles of biomedical review articles. This collaborative research lead by three types: (i) review articles; (ii) clinical studies; and (iii) Raghu T. Santanam at Arizona State University, aims to provide standard medical articles. The potential impact of this a wide knowledge over a restricted topic over the wider technology will also include electronic health records and the knowledge available at MEDLINE. We utilize the deep learning monitoring health related news sources. We believe that his algorithm Doc2vec [4] to create similarity measures between approach will address an identified recurrent need of health articles in our corpus. In that we built a balanced test dataset and departments to enhance the biomedical knowledge, and motivate three different representations of the corpus, and compared the a step change in health monitoring. performance between them. The implementation currently builds a matrix of similarity scores for each article in the corpus. In the next steps, we will compare similarity of documents from our implementation against the baseline for a randomly chosen set of articles in the corpus. The further development of the MeSH classifier will consider the feedback of the usability of health professionals working in partner institutions, profiting of their years of experience with MEDLINE and MeSH itself, to tune the system to ensure the best usability in the domain. Furthermore, we will use the outcomes of the final version of this classifier to label health related news with the MeSH Headings descriptors, potentiating a new approach on the processing and monitoring of population health, population awareness of certain diseases, and the general public acceptance of public health decisions through news. Figure 3. A screenshot of the web app to the MEDLINE classifier, when ACKNOWLEDGMENTS requesting the automated MeSH annotation of a scientific review article abstract extracted from PubMed (in the body of text above) and the We thank the support of the European Commission on the results as MeSH headings descriptors including their tree path in the H2020 project MIDAS (G.A. nr. 727721). MeSH ontology-like structure (center), their similarity measure (right) REFERENCES and their positioning in the classification (left). [1] B. Cleland et al (2018). Insights into Antidepressant 5. CONCLUSION AND FUTURE WORK Prescribing Using Open Health Data, Big Data Research, doi.org/10.1016/j.bdr.2018.02.002 To further extend the usability of the MEDLINE SearchPoint, [2] L. Henderson, Lachlan (2009). Automated text classification we are developing a data visualisation tool that permits the user in the dmoz hierarchy. TR. to plot the top results mostly related with a topic of interest, as [3] C. Manninget al (2008), “Introduction to Information explored with SearchPoint. Based on the choice of a time Retrieval,” Cambridge Univ. Press, 2008, pp. 269-273. window and a certain topic, such as “mental health”, the user is [4] T. Mikolov et al (2013). Efficient Estimation of Word able to view the clustered MEDLINE documents, rolled over the Representations in Vector Space, arXiv:1301.3781. plot or click to view the plotted points, each of which represents [5] J. Pita Costa et al (2017). Text mining open datasets to an article in PubMed. This will be done through support public health. In Conf. Proceedings of WITS 2017. multidimensional scaling, plotting the articles in the subsample [6] J. Pita Costa et al (2018). MIDAS MEDLINE Toolset Demo. using cosine text similarity. The difficulties to plot large datasets http://midas.quintelligence.com (accessed in 28-8-2018). using these methods, and the lack of potential in the outcomes of [7] F. B. Rogers, (1963). Medical subject headings. Bull Med that heavy computation, provided a focus for the team to only Libr Assoc. 51: 114–6. plot the first hundred results of the explorations done within [8] L. Stopar, B. Fortuna and M. Grobelnik (2012). Newssearch: MEDLINE SearchPoint. With this extended tool the healthcare Search and dynamic re-ranking over news corpora. In Conf. professional will be able to: (i) explore a certain area of research Proceedings of SiKDD2012. with the aim of a more accessible scientific review, in identifying the evidence base for a medical study or a diagnostic decision; (ii) identify areas of dense scientific research corresponding to searchable topics (e.g. the evaluation of the coverage of certain rare diseases that need more biomedical research, or the identification of alternative research paths to overpopulated but inefficient research); and (iii) exploration of 4 36 Crop classification using PerceptiveSentinel Filip Koprivec Matej Čerin Klemen Kenda Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute Jamova 39, 1000 Ljubljana, Jamova 39, 1000 Ljubljana, Jožef Stefan International Slovenia Slovenia Postgraduate School filip.koprivec@ijs.si matej.cerin@ijs.si Jamova 39, 1000 Ljubljana, Slovenia klemen.kenda@ijs.si ABSTRACT data labels, which will become apparent when interpreting Efficient and accurate classification of land cover and land results. usage can be utilized in many different ways: ranging from natural resource management, agriculture support to legal Another class of problems is posed by the spatial resolu- and economic processes support. In this article, we present tion of images. Since satellite images provided by the ESA an implementation of land cover classification using the Per- Sentinel-2 mission have a resolution of 10 m × 10 m on most ceptiveSentinel platform. Apart from using base 13 bands, granular bands and 60 m × 60 m on bands used for atmo- only minor feature engineering was performed and different spheric correction, land cover irregularities falling in this classification methods were explored. We report an F1 and order of magnitude might not be detected and correctly accuracy score (80-90%) in range of state of the art when learned on. This problem is especially prevalent in smaller using pixel-wise classification and even comparable to time and more diverse regions, where individual fields are smaller series based land cover classification. and land usage is more fragmented. The current state of the art land classification focuses heav- Keywords ily on the temporal dimension of acquired data [1], [13], remote sensing, earth observation, machine learning, classi- [14]. The time-based analysis offers clear advantages since fication it considers growth cycles of sample crops, enables continu- ous classification etc., and generally produces better results, 1. INTRODUCTION with reported F1 scores for crop/no-crop classification in a range from 0.80-0.93 [14]. One major drawback of time- Specific aspects of earth observation (EO) data (huge amount based classification is the problem of missing data. In our of data, widespread usage, many different problem settings test drive scenario, 70% of images are heavily obscured by etc.), coupled with the recent launch of ESA Sentinel mission clouds [9], a problem which removes a lot of the advantages that provides a huge volume of data relatively frequently (ev- of time-based classification and demands major compensa- ery 5-10 days), present an environment that is suitable for tions with missing data imputation. current machine learning approaches. In this paper, we present a possible approach on a land cover Efficient and accurate land cover classification can provide classification of single time image acquired using the Percep- an important tool for coping with current climate change tiveSentinel 1 platform, using multiple classification meth- trends. Classification of crops, their location and potentially ods for tulip field classification in Den Helder, Netherlands. their yield prediction provide various interested parties with information on crop resistance, adapting to changes in tem- perature and water level changes. Along with direct help, 2. PERCEPTIVESENTINEL PLATFORM accurate crop classification tools can be used in a variety of 2.1 Data other programs, from government based subsidies to various Data used in this article is provided by ESA Sentinel-2 mis- insurance schemes. sion. The Sentinel-2 mission comprises a constellation of two polar-orbiting satellites placed in the same orbit, phased at Along with previously highly promising features of EO data, 180◦ to each other [2]. Sentinel-2A was launched on 23rd data acquisition and processing pose some specific challenges. June 2015, while the second satellite was launched on 7th Satellite acquired data is highly prone to missing data due March 2017. Revisit time for equator is 10 days for each to various reasons; mostly cloud coverage, (cloud) shadows, satellite, so since the launch of the second satellite, each atmospheric refraction due to changes in atmospheric con- data point is sampled at least every 5 days (a bit more fre- ditions. Additionally, correct training data, either for clas- quently when away from the equator). sification or regression, is hard to come by, must be rela- tively recent, and regularly updated due to changes in land Each satellite collects data in 13 different wavelength bands use. Furthermore, correct labels and crop values are almost presented in figure 1, with varying granularity. Data ob- impossible to verify and usually self-reported, which often tained for each pixel is firstly preprocessed by ESA where means that quality of training data is not perfect. Valero et al. [13] raise the problem of incorrect (or partially correct) 1http://www.perceptivesentinel.eu/ 37 atmospheric reflectance and earth surface shadows are cor- 3. METHODOLOGY rected [4]. 3.1 Sample Data For purpose of this article, a sample patch of fields in Den Helder, Netherlands, with coordinates: (4.7104, 52.8991), (4.7983, 52.9521) was analyzed. Three different datasets were considered: tulip fields in year 2016 and 2017 and arable land in 2017. For each of these datasets, the first ob- served date with no detected clouds was selected and binary classification (tulips vs no-tulips and arable vs non-arable land) was performed on the image from that date. The date selection was based on the fact that tulips’ blooms are most apparent during late April and beginning of May [9]. 3.2 Feature Vectors Three additional earth observation indices that were used as features are presented in Table 1 as suggested by [8]. Figure 1: Sentinel 2 spectral bands [12] Name Formula 2.2 Data Acquisition Satellites provide around 1TB of raw data per day, which B08 − B04 NDVI is provided for free on Amazon AWS. Images are then pro- B08 + B04 cessed and indexed by Sinergise and subsequently provided for free along with their SentinelHub [11] library. As part 2.5(B08 − B04) EVI of the PerceptiveSentinel project, a sample platform was de- (B08 + 6B04 − 7.5B02 + 1) veloped on top of SH library, which eases data acquisition, cloud detection and further data analysis on acquired data. B08 − B04 SAVI (1 + 0.5) B08 + B04 + 0.5 The whole dataset currently consists of images captured from the end of June 2015 till August 2018. Images are avail- able for use in a few hours after being taken. Since working Table 1: Additional indices with data for the whole world is impractical, smaller geo- graphical regions are usually queried and analyzed on their For each selected image, all 13 Sentinel2-bands were consid- own. One important aspect when analyzing larger regions ered as feature vectors for each pixel, in the second experi- that must be taken care of is the fact that EO data is ac- ment, additional land cover based classification indices from quired in swaths with the width of approximately 290 km Table 1 were added. [3]. Since the swaths overlap a bit, care must be taken when sampling larger areas (in a size of small state), as the area 3.3 Experiment might be chopped into a few irregular tiles covering only The experiment was conducted in the Den Helder region part of an area of interest. to asses the effectiveness of classification and improvement with additional features. The same region is also used as a Corrected images are available using the SentinelHub li- test drive location for the PerceptiveSentintel project. One brary. PerceptiveSentinel platform provides an easy to use important characteristic to keep in mind is the fact that framework that combines satellite data acquisition, subse- classification classes are not uniformly distributed. Tulip quent cloud detection enables an easy way to pipeline ma- fields constitute 17.1% and 17.7% of all pixels in the year chine learning framework. They also provide an easy way 2016 and 2017 respectively, while arable land accounts for to integrate (vectorized or rasterized) geopedia layers as a 64% of pixels in 2017 data set. Care must, therefore, be source of ground truth for classification. taken when assessing the predictive power of a model. 2.3 Data Preprocessing For each dataset, multiple classification algorithms were tested Most of the preprocessing is already done by ESA (atmo- on base band feature vectors and feature vectors enriched spheric reflectance, projection . . . ). The data is mostly clean with calculated indices from Table 1. Experiments were and presented as a pixel array with dimensions H×W×B, carried out using python library scikit-learn and default where W and H are image dimensions (in our case 589 and parameters were used for each type of classifier. For each 590) and B is number of bands selected (in our case 13, but data set and each classifier (Ada Boost, Logistic regression, we may individually preconfigure the Sentinelhub library to Random Forest, Multilayer perceptron, Gradient Boosting, return variable number of bands and even custom calcula- Nearest neighbors and Naive Bayes), 3-fold cross-validation tions based on other bands). was performed. Folds were generated on a non-shuffled dataset with balanced classes ratios. When preprocessing we only need to consider one part of problematic data, namely clouded parts of images. ESA 4. RESULTS provides some sort of cloud detection, but our experiments Results of selected classifiers are presented in Tables 2–4 (ind proved it unsatisfactory, so we used the s2cloudless library column indicates additional indices as features) are compa- developed by Sinergise for this task [10]. rable with results from related works [5], [6] which report 38 accuracy results from 60-80%, although our experimental Alg. Ind Prec Rec Acc F1 T dataset was quite small and homogeneous, which might of- Logistic No 0.853 0.913 0.843 0.882 2.7 fer some advantage over larger plots of land. Regression Yes 0.854 0.908 0.841 0.880 3.2 Decision No 0.878 0.868 0.837 0.873 9.6 Alg. Ind Prec Rec Acc F1 T Tree Yes 0.885 0.868 0.842 0.876 14.5 Logistic No 0.895 0.551 0.912 0.682 2.8 Random No 0.928 0.889 0.884 0.908 17.3 Regression Yes 0.877 0.564 0.912 0.686 3.6 Forest Yes 0.934 0.891 0.889 0.912 26.3 Decision No 0.640 0.697 0.881 0.667 7.9 ML No 0.929 0.932 0.911 0.931 522.4 Tree Yes 0.629 0.698 0.878 0.662 11.3 Perceptron Yes 0.926 0.940 0.913 0.933 586.2 Random No 0.870 0.675 0.927 0.760 15.0 Gradient No 0.899 0.921 0.883 0.910 82.6 Forest Yes 0.867 0.680 0.927 0.762 21.7 Boosting Yes 0.905 0.926 0.890 0.915 118.4 ML No 0.875 0.720 0.935 0.790 184.2 Naive No 0.823 0.830 0.776 0.827 0.4 Perceptron Yes 0.835 0.740 0.931 0.784 241.3 Bayes Yes 0.814 0.806 0.757 0.810 0.6 Gradient No 0.878 0.657 0.926 0.751 84.8 Boosting Yes 0.856 0.657 0.923 0.743 120.6 Table 4: Arable land in 2017 results Naive No 0.343 0.800 0.704 0.480 0.4 Bayes Yes 0.316 0.808 0.669 0.454 0.6 Table 2: Tulip fields in 2016 results Alg. Ind Prec Rec Acc F1 T Logistic No 0.514 0.561 0.829 0.537 2.8 Regression Yes 0.545 0.615 0.841 0.578 4.0 Decision No 0.574 0.633 0.852 0.602 7.3 Tree Yes 0.565 0.634 0.849 0.598 11.2 Random No 0.786 0.599 0.900 0.680 13.8 Forest Yes 0.788 0.604 0.901 0.683 20.5 ML No 0.790 0.673 0.911 0.727 375.9 Perceptron Yes 0.780 0.693 0.911 0.734 419.8 Gradient No 0.786 0.613 0.902 0.689 84.4 Boosting Yes 0.785 0.614 0.902 0.689 120.3 Naive No 0.330 0.861 0.666 0.477 0.4 Bayes Yes 0.318 0.858 0.649 0.464 0.6 Table 3: Tulip fields in 2017 results For each test, precision, recall, accuracy, and F1 score were reported along with the timing of the whole process. As seen from the tables, multilayer perceptron achieved best results when comparing F Figure 2: Graphical representation of errors in ML 1 score across all data sets, but its training was considerably slower than all other classification perceptron classification of tulip fields in 2017 (TP methods (in fact, it’s training time was comparable to all in purple, FP in blue, FN in red) other classification times combined). Multilayer perceptron was followed closely by random forest, which achieved just marginally worse results, but was way less expensive to train seen, that classification produced quite satisfactory results. and evaluate, while still retaining score that was higher or An important thing to notice when inspecting Figure 2 is comparable with related works. that the true positive pixels were classified in densely packed groups with clear and sharp edges, which correspond nicely Adding additional features to feature vector did not signif- to field boundaries seen with the naked eye (both RF and icantly improve classification score and has in some cases Gradient boosting decision trees produced visually very sim- even hampered performance while having a significant im- ilar results). This might suggest that algorithms have de- pact on the training time. tected another culture similar to tulips and classified it as tulips (or conversely, that the ground truth might not be Using classifier trained on 2016 tulips data and predicting that accurate). A lot of FN pixels can also be spotted on data in 2017 yielded an F1 score of 0.57, while classifier field boundaries, which may correspond to different quality trained on 2017 data yielded an F1 score of 0.67 on 2016 or mixing of different plant cultures near field boundaries. data, indicating the robustness of the classifier. Similarly, observing results of arable land classification, one Graphical representation of classification errors can be seen immediately notices small (false positive) blue patches and in Figure 2 and 3 which show true positive (TP) pixels in some red patches. Most notably, a long blue line is spotted purple color, false positive (FP) in blue color and false neg- in the left part of the image (near the sea), which may in- ative (FN) in red. Looking at the images it can easily be dicate some sort of wild culture near the sea that was not 39 [2] ESA. Satellite constellation / Sentinel-2 / Copernicus / Observing the Earth / Our Activities / ESA. https://www.esa.int/Our_Activities/Observing_ the_Earth/Copernicus/Sentinel-2/Satellite_ constellation. Accessed 13 August 2018. [3] ESA. Sentinel-2 - Missions - Resolution and Swath - Sentinel Handbook. https://sentinel.esa.int/web/ sentinel/missions/sentinel-2/ instrument-payload/resolution-and-swath. Accessed 13 August 2018. [4] ESA. User Guides - Sentinel-2 MSI - Level-2 Processing - Sentinel Online. https: //sentinel.esa.int/web/sentinel/user-guides/ sentinel-2-msi/processing-levels/level-2. Accessed 13 August 2018. [5] Guida-Johnson, B., and Zuleta, G. A. Land-use land-cover change and ecosystem loss in the espinal ecoregion, argentina. Agriculture, Ecosystems & Environment 181 (2013), 31 – 40. [6] Gutiérrez-Vélez, V. H., and DeFries, R. Annual multi-resolution detection of land cover conversion to oil palm in the peruvian amazon. Remote Sensing of Environment 129 (2013), 154 – 167. Figure 3: Graphical representation of errors in ML [7] Gómez, C., White, J. C., and Wulder, M. A. perceptron classification of arable land in 2017 (TP Optical remotely sensed time series data for land cover in purple, FP in blue, FN in red) classification: A review. ISPRS Journal of Photogrammetry and Remote Sensing 116 (2016), 55 – 72. included in the original mask. Further manual observation [8] Jiang, Z., Huete, A. R., Didan, K., and Miura, of misclassified red patch in the middle of arable land sug- T. Development of a two-band enhanced vegetation gests that this field is barren (easily seen in Figure 2) and index without a blue band. Remote Sensing of possibly wrongly classified as arable land in training data. Environment 112, 10 (2008), 3833 – 3845. [9] Kenda, K., Kažič, B., Čerin, M., Koprivec, F., 5. CONCLUSIONS Bogataj, M., and Mladenić, D. D4.1 Stream Learning Baseline Document. Reported 30. April 2018. In our work, we have examined the use of different classifica- tion methods and additional features for land cover classifi- [10] Sinergise. sentinel-hub/sentinel2-cloud-detector: cation problem on a single image. Our results are compara- Sentinel Hub Cloud Detector for Sentinel-2 images in ble with results from the related literature. We propose that Python. https://github.com/sentinel-hub/ classification strength and adaptability be further improved sentinel2-cloud-detector. Accessed 14 August by considering time series and stream aggregates for each 2018. pixel as researched in [14] [7]. Additionally, pixels might be [11] Sinergise. sentinel-hub/sentinelhub-py: Download grouped together into logical objects to enable object (field) and process satellite imagery in Python scripts using level classification as proposed by [13]. Sentinel Hub services. https://github.com/sentinel-hub/sentinelhub-py. Furthermore, results have shown, that correct ground truth Accessed 14 August 2018. mask is essential for good classification performance. As [12] Spaceflight 101. Sentinel-2 Spacecraft Overview. seen from our results, even seemingly correct labels might http://spaceflight101.com/copernicus/ miss some cultures or classify empty straits of land as crops. wp-content/uploads/sites/35/2015/09/8723482_ orig-1024x538.jpg. Accessed 14 Aug. 2018. [13] 6. ACKNOWLEDGMENTS Valero, S., Morin, D., Inglada, J., Sepulcre, G., Arias, M., Hagolle, O., Dedieu, G., Bontemps, This work was supported by the Slovenian Research Agency S., Defourny, P., and Koetz, B. Production of a and the ICT program of the EC under project PerceptiveSen- dynamic cropland mask by processing remote sensing tinel (H2020-EO-776115). The authors would like to thank image series at high temporal and spatial resolutions. Sinergise for it’s contribution to sentinelhub and cloudless Remote Sensing 8(1) (2016), 55. library along with all help with data analysis. [14] Waldner, F., Canto, G. S., and Defourny, P. Automated annual cropland mapping using 7. REFERENCES knowledge-based temporal features. ISPRS Journal of [1] Photogrammetry and Remote Sensing 110 (2015), 1 – Belgiu, M., and Csillik, O. Sentinel-2 cropland mapping using pixel-based and object-based 13. time-weighted dynamic time warping analysis. Remote Sensing of Environment 204 (2018), 509 – 523. 40 Towards a semantic repository of data mining and machine learning datasets Ana Kostovska Sašo Džeroski Panče Panov Jožef Stefan IPS & Jožef Stefan Institute & Jožef Stefan Institute & Jožef Jožef Stefan Institute Jožef Stefan IPS Stefan IPS Ljubljana, Slovenia Ljubljana, Slovenija Ljubljana, Slovenia ana.kostovska@ijs.si saso.dzeroski@ijs.si pance.panov@ijs.si ABSTRACT The benefits of publishing FAIR data are manifold. It spe- With the exponential growth of data in all areas of our lives, eds up the process of knowledge discovery and reduces the there is an increasing need of developing new approaches for consumption of resources. When the FAIR-compliant data effective data management. Namely, in the field of Data Mi- at hand does not contain all the information needed it can be ning (DM) and Knowledge Discovery in Databases (KDD), easily integrated with data from external sources and boost scientists often invest a lot of time and resources for collec- the overall KDD performance [12]. ting data that has already been acquired. In that context, by publishing open and FAIR (Findable, Accessible, Interope- Semantic data annotation, being very powerful technique, rable, Reusable) data, researchers could reuse data that was is massively used in some domains, i.e. medicine, but it is previously collected, preprocessed and stored. Motivated by sill in the early phases in the domain of data mining and this, we conducted extensive review on current approaches, machine learning. To the best of our knowledge, there are data repositories and semantic technologies used for annota- no semantic data repositories that adhere to the FAIR prin- tion, storage and querying of datasets for the domain of ma- ciples. We recognize the ultimate benefits of having one and chine learning (ML) and data mining. Finally, we identify we are going in depths of the research covering semantic data the limitations of the existing repositories of datasets and annotation, ontology usage, storing and querying of data. propose a design of a semantic data repository that adheres to FAIR principles for data management and stewardship. 2. BACKGROUND AND RELATED WORK The Semantic Web (Web 3.0) is an extension of the World 1. INTRODUCTION Wide Web in which information is given semantic meaning, One of the main use of data is in the process of knowledge enabling machines to process that information. The aim of discovery, where scientist employ ML and DM methods and the Semantic Web initiative is to enhance web resources with try to solve various real-life problems from diverse fields, highly structured metadata, known as semantic annotations. from systems biology and medicine, to ecology and enviro- When one resource is semantically annotated, it becomes a nmental sciences. In order to obtain their objectives, they source of information that is easy to interpret, combine and need high-quality data. The quality of the data is crucial to a reuse by the computers [13]. In order to achieve this, the DM project’s success. Ultimately, no level of algorithmic so- Semantic Web uses the concept of Linked Data. Linked data phistication can make up for low-quality data. On the other is build upon standard web technologies [7] including HTTP, hand, progress in science is best achieved by reproducing, RDF, RDFS, URIs, Ontologies, etc. reusing and improving someone else’s work. Unfortunately, datasets are not easily obtained, and even if they are, they For uniquely identifying resources across the whole Linked come with limited reusability and interoprability. Data, each resource is given a Unified Resource Iden- tifier (URI). The resources are then enriched with terms A key-aspect in advancing research is making data open from controlled vocabularies, taxonomies, thesauruses, and and FAIR. FAIR are four principles that have been recen- ontologies. The standard metadata model used for logical tly introduced to support and promote good data manage- organization of data is called Resource Description Fra- ment and stewardship [17]. Data must be easily findable mework (RDF). Its basic unit of information is the triplet (Findability) by both humans and machines. This me- compiled from a subject, a predicate, and an object. These ans data should be semantically annotated with rich meta- three components define the concepts and relations, the bu- data and all the resources must be uniquely identified. The ilding blocks of an ontology. metadata should always be accessible (Accessibility) by standardized communication protocols such as HTTP(S) or In the context of computer science, ontology is “an expli- FTP, even when the data itself is not. Data and metadata cit formal specifications of the concepts and relations among from different data sources can be automatically combined them that can exist in a given domain” [3]. As computational (Interoperabiliy). To do so, the benefits of formal voca- artifacts, they provide the basis for sharing meaning both bularies and ontologies should be exploited. Data and me- at machine and human level. When creating an ontology, tadata is released with provenance details and data usage there are multiple languages to choose from. RDF Shema licence, so that humans and machines know whether data (RDFS) is ontology language with small expressive power. can be replicated and reused or not (Reusabiliy). It provides mechanisms for creating simple taxonomies of 41 concepts and relations. Another commonly used ontology There are numerous repositories of ML datasets available language is the Web Ontology Language (OWL). OWL online. The UCI repository3 is the most popular reposi- supports creation of all ontology components: concepts, in- tory of ML datasets. Each dataset is annotated with several stances, properties (or relations). Finally, SPARQL1 is descriptors such as dataset characteristics, attribute charac- standard, semantic query language used for querying fast- teristics, associated task, number of instances, number of growing private or public collections of structured data on attributes, missing values, area, etc. Similarly, Kaggle Da- the Web or data stored in RDF format. tasets4, Knowledge Extraction based on Evolutionary Le- arning (KEEL), and Penn Machine Learning Benchmarks There are diiferent technologies for storing data and meta- (PMLB)5 are well-known dataset repository that provide data. The most broadly used are relational databases, users with data querying based on the descriptors attached digital databases based on the relational model of data or- to the datasets. OpenML6 is an open source platform desi- ganized in tables, forming entity-relational model. Another gned with the purpose of easing the collaboration of resear- approach that became popular with the appearance of Big chers within the machine learning community [14]. Resear- Data are NoSQL databases [5], which are flexible databases chers can share datasets, workflows and experiments in such that do not use relational model. Triplestores are specific a way that they can be found and reused by others. When type of NoSQL databases, that store triples instead of rela- the data format of the datasets is supported by the platform, tional data. Triplestores use URIs and can be queried over the datasets are annotated with measurable characteristics trillions of records, which makes them very applicable. [15]. These annotations are saved as textual descriptors and are used for searching through the repository. Data in an information system can reside in different hete- rogeneous data sources, both internal and external to the In contrast to the above mentioned repositories, there are organization. In this setting, the relevant data from the frameworks in other domains that offer advanced techniques diverse sources should be integrated. Accessing disparate for describing, storing and querying datasets. One cutting- data sources has been a difficult challenge for data analysts edge framework in the domain of neuroscience is Neurosci- to achieve in modern information systems, and an active re- ence Information Framework(NIF) [4]. Its core objec- search area. OBDA [1, 11] is much longed-for method that tive is to create a semantic search engine that benefits from addresses this problem. It is a new paradigm, based on a semantic indexes when querying distributed resources by three-level architecture constituted of the ontology, the data keywords. The Gene Ontology Annotation (GOA), sources, and the mappings between the two (see Figure 1). is a database that provides high-quality annotations of ge- With this approach, OBDA provides data structure descrip- nome data [2]. The annotations are based on GO, a voca- tion, as well as semantic description of the concepts in the bulary that defines concepts related to gene functions and domain of interest and roles between them. relation among them. Large part of the annotations are ge- nerated electronically by converting existing knowledge from the data to GO terms. Electronic annotations are associated with high-level ontology terms. The process of generating more specific annotations can hardly be automated with the current technologies, therefore it is done manually. 3. CRITICAL ASSESSMENT Figure 1. The OBDA architecture In this section, we conduct critical assessment of the cur- rent research based on the review presented in the previous section. In the context of semantic ML data repository, we group ontologies in three categories, i.e., ontologies for describing Semantic Web technologies. The whole stack of seman- machine learning and data mining, ontologies for provenance tic technologies provide ways of making the content readable information, and domain ontologies. OntoDM ontology de- by machines. The metadata that describes the content can scribes the domain of data mining. It is composed of three be used not only to disregard useless information, but also sub-ontologies: OntoDT [10] - generic ontology for repre- for merging results to provide a more constructed answer. sentation of knowledge about datatypes; OntoDM-core [8] - A major drawback of this process of giving data a semantic ontology of core data mining entities (e.g., data, DM task, meaning is that it is time consuming and requires great amo- generalizations, DM algorithms, implementations of algori- unt of resources, thus people sometimes feel unmotivated to thms, DM software); OntoDM-KDD [9] - ontology for repre- do it. Another point to make is that semantic annotations senting the knowledge discovery process following CRISP- cannot solve the ambiguities of the real world. DM process model. The Data Mining OPtimization Ontology (DMOP) [6] has been designed to support au- Technologies for storing data and metadata. The tomation at various choice points of the DM process, i.e., data in relational databases is stored in a very structured choosing algorithms, models, parameters. The PROV On- way, making them a good choose for applications that relay tology (PROV-O)2 and Dublin Core vocabulary [16] 3 facilitate the discovery of electronic resources by providing a https://archive.ics.uci.edu/ml/ 4 base for describing provenance information about resources. https://www.kaggle.com/datasets 5https://github.com/EpistasisLab/ 1https://www.w3.org/TR/rdf-sparql-query/ penn-ml-benchmarks 2https://www.w3.org/TR/prov-o/ 6https://www.openml.org/ 42 on heavy data analysis. Moreover the referential integrity the approaches and technologies. Each of the proposed ar- guarantees that transactions are processed reliably. While chitectures has positive and negative sides, so there will be relational databases are a suitable choice for some applica- trade-off when choosing one. tions, they have difficulties dealing with large amounts of data. On the other hand, NoSQL databases were designed The common part of the three designs is that DM and ML primarily for big data and can be run on cluster architectu- datasets will be annotated through a semantic annotation res. Non-relational databases store unstructured data, with engine. The semantic query engine will receive SPARQL no logical schema. They are flexible, but this comes with query as input, and it will bring back results in form of set the price of potentially inconsistent data. of RDF triples. There will be SPARQL endpoint through which users can specify the query used as input in the se- Describing data and metadata. OntoDM is an ontology mantic query engine. Another open possibility is to enable that describes the domain of DM, ML and KDD with a great users to query data and metadata by simply writing key- level of detail. Because it covers a wide area, some parts words. Later, the system itself generates SPARQL query would be irrelevant for our application. DMOP is ontology based on those keywords. The anotation schema used by built with the special use case of optimizing the DM process. the semantic annotation engine will be based on three di- Nevertheless, both of them can be used for describing ML fferent types of ontologies such as ontologies for DM and and DM datasets. DC vocabulary and PROV-O define a ML (e.g., OntoDT, OntoDM-core, Onto-KDD, DMOP), do- wide range of provenance terms, therefore both of them can main ontologies, and ontologies and schemes for describing be employed in the provenance metadata generation. provenance information (e.g., Dublin Core ontology, PROV- O). Part of the annotations will be generated automatically, Repositories of machine learning datasets. The UCI e.g., annotations related to datatypes, while others will be repository offers a wide range of datasets, but they are not semi-automatically because they require concept mapping, available through a uniform format or API. Although it also e.g., annotations based on domain ontologies. provides data descriptors for searching the data, a major setback is that none of the descriptors is based on any vo- We plan to build a web-based user interface that will enable cabulary or ontology, which certainly limits interoperabi- users to search and query both datasets and metadata anno- lity. Kaggle Datasets, KEEL, PMLB also provide similar tations. Users will be given a chance of uploading new data- meta annotations, but they all lack semantic interpretabi- sets in CSV or ARFF fromat. Besides the dataset, users will lity. Another shortcoming of the UCI repository, KEEL and be expected to specify some additional information about it PMLB is that they don’t allow uploading new datasets. All such as data mining task they plan to execute on the data, datasets stored in the OpenML repository can be downlo- domain, provenance information, descriptions of the attri- aded in CSV or ARFF format. The annotation are based butes, etc. Since the whole process of semantic annotation on Exposé ontology, and they can be downloaded in JSON, can’t be automatic, when new dataset is uploaded, it won’t XML or RDF format. A major weakness of this repository be immediately available on the site. First it must be cura- is that annotations are not stored, but they are calculated ted, and only when the complete set of metadata annotati- on-the-fly and can not be used for semantic inference. ons is generated, the metadata will be published online. The dataset itself will be released under clear data usage licence. Frameworks for describing, storing and querying do- main datasets. The NIF framework is very progressive in The three architectural designs differ in the way of storing terms of semantic annotation, storing, and querying. Its ad- the datasets. The metadata annotations will be RDF tri- vantages come from providing domain experts with the abi- plets and they will be stored in triplestore that optimizes lity to contribute to the ontology development, by adding physical storage. Next, we briefly explain the differences new terms through the use of Interlex. It has a powerful between storing the datasets and what are the effects on search engine, and it follows the OBDA paradigm. Hetero- querying. geneous data is stored in its original format. The user defi- ned, keyword query is mapped to ontological terms to find Proposal I. The simplest approach of storing a dataset synonyms, and then translated to a query relevant to the in- would be to store it in RDF format in the same triplestore dividual data store. With respect to the genomics domain, as the metadata. The datasets from their original format, GOA database is favourable because of its high-quality an- will be converted to RDF triples. Having only one triplestore notations. Curators put extreme efforts in generating ma- will ease querying, but it will require more storage capacity nual annotations. To speed up the query execution it uses (see Figure 2). the Solr document store. Another superiority of GOA data- Proposal II. The second option is to store the datasets in base is that it provides advanced filtering of the annotations, a relational database and the metadata in RDF triplestore. for downloading customized annotation sets. The deficiency Datasets from CSV or ARFF format will be translated into of NIF and GOA database is that they are not able to query a relational database. Here, querying becomes more compli- and access the annotations in RDF format, which is an emer- cated, for which we will need a federated query engine. A ging standard for representing semantic information federated query engine allows simultaneous search on multi- ple data sources. A user makes a single query request, which 4. PROPOSAL FOR SEMANTIC is distributed across the management systems participating REPOSITORY OF DM/ML DATASETS in the federation and translated to a query written in a lan- guage relevant to the individual system. We will have two In this section, we propose three possible architecture desi- data stores, one for the data itself and one for the metadata. gns of the semantic data repository for the domain of ML For querying the two data stores, we will still use the same and DM. The proposals are based on the critical review of 43 querying of ML and DM datasets. We also examined speci- fic implementations of frameworks in the domain of neuro- science and genomics. Taking into consideration the critical assessment of the current state-of-the-art we will construct semantic data repository for ML and DM datasets. The semantic repository would be utilized for easy access of se- mantically rich annotated datasets and semantic inference. This, will improve the reproducibility and reusability in ML and DM research area. Moreover, annotating the datasets with domain ontologies will facilitate the process of under- standing the analyzed data. As of now, we have three pro- posed architectural designs for the semantic data repository Figure 2. Architectural design I that differ in the way of storing the datasets. We will either store both data and metadata in a triplestore, or we will have multiple data stores which will require usage of tools RDF query language, SPARQL. In order to query the rela- and methods from the ontology based data access paradigm. tional database with SPARQL, it will be mapped to virtual RDF graph (see Figure 3). Acknowledgements The authors would like to acknowledge the support of the Slovenian Research Agency through the projects J2-9230, N2-0056 and L2-7509 and the Public Scholarship, Development, Disability and Maintenance Fund of the Republic of Slovenia through its scholarship program. 6. REFERENCES [1] Mihaela A Bornea et al. Building an efficient rdf store over a relational database. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 121–132. ACM, 2013. [2] Gene Ontology Consortium. Gene ontology consortium: going forward. Nucleic acids research, 43(D1):D1049–D1056, 2014. [3] Thomas R Gruber. Toward principles for the design of ontologies used for knowledge sharing? International journal of human-computer studies, 43(5-6):907–928, 1995. [4] Amarnath Gupta et al. Federated access to heterogeneous information resources in the neuroscience information framework (nif). Neuroinformatics, 6(3):205–217, 2008. [5] Jing Han et al. Survey on nosql database. In Pervasive Figure 3. Architectural design II computing and applications (ICPCA), 2011 6th international conference on, pages 363–366. IEEE, 2011. [6] C Maria Keet et al. The data mining optimization ontology. Web Semantics: Science, Services and Agents on the World Proposal III. Instead of mapping the relational database Wide Web, 32:43–53, 2015. to virtual RDF graph, we can use the OBDA methodology [7] Brian Matthews. Semantic web technologies. E-learning, 6(6):8, 2005. and federated querying to use a combination of SQL que- [8] Panče et al. Panov. Ontology of core data mining entities. Data ries and SPARQL queries. Metadata will be queried with Mining and Knowledge Discovery, 28(5-6):1222–1265, 2014. SPARQL queries, but for the datasets, they will be mapped [9] Panče Panov et al. Ontodm-kdd: ontology for representing the to SQL queries. The integrated results are brought back to knowledge discovery process. In International Conference on Discovery Science, pages 126–140. Springer, 2013. the user (see Figure 4). [10] Panče Panov et al. Generic ontology of datatypes. Information Sciences, 329:900–920, 2016. [11] Antonella Poggi et al. Linking data to ontologies. In Journal on data semantics X, pages 133–173. Springer, 2008. [12] Petar Ristoski and Heiko Paulheim. Semantic web in data mining and knowledge discovery: A comprehensive survey. Web semantics: science, services and agents on the World Wide Web, 36:1–22, 2016. [13] Gerd Stumme et al. Semantic web mining: State of the art and future directions. Web semantics: Science, services and agents on the world wide web, 4(2):124–143, 2006. [14] Jan N Van Rijn et al. Openml: A collaborative science platform. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 645–649. Springer, 2013. [15] Joaquin Vanschoren et al. Taking machine learning research online with openml. In Proceedings of the 4th International Conference on Big Data, Streams and Heterogeneous Source Figure 4. Architectural design III Mining, pages 1–4. JMLR. org, 2015. [16] Stuart Weibel. The dublin core: a simple content description model for electronic resources. Bulletin of the Association for 5. CONCLUSION Information Science and Technology, 24(1):9–11, 1997. [17] Mark D Wilkinson et al. The fair guiding principles for scientific We have conducted a literature overview of research be- data management and stewardship. Scientific data, 3, 2016. ing done in the field of semantic annotation, storage, and 44 Towards a semantic store of data mining models and experiments Ilin Tolovski Sašo Džeroski Panče Panov Jožef Stefan International Jožef Stefan Institute & Jožef Jožef Stefan Institute & Jožef Postgraduate School & Jožef Stefan International Stefan International Stefan Institute Postgraduate School Postgraduate School Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia ilin.tolovski@ijs.si saso.dzeroski@ijs.si pance.panov@ijs.si ABSTRACT sible, Interoperable, Reusable) data principles, introduced Semantic annotation provides machine readable structure to by Wilkinson et al. [9]. Implementing these principles for the stored data. We can use this structure to perform seman- the annotation, storing, and querying of data mining models tic querying, based on explicitly and implicitly derived infor- and experiments will provide a solid ground for researchers mation. In this paper, we focus on the approaches in seman- interested in reproducing and reusing the results from the tic annotation, storage and querying in the context of data previous research on which they can build and improve. mining models and experiments. Having semantically anno- tated data mining models and experiments with terms from In the literature, there exist some approaches that address domain ontologies and vocabularies will enable researchers some of these problems. In both ontology engineering and to verify, reproduce, and reuse the produced artefacts and data mining community, there are approaches that aim to- with that improve the current research. Here, we first pro- wards describing the data mining domain, as described in vide an overview of state-of-the-art approaches in the area of Section 2. Furthermore, Vanschoren et al. [5] developed the semantic web, data mining domain ontologies and vocabu- OpenML system, a machine learning experiment database laries, experiment databases, representation of data mining for storing various segments of a machine learning experi- models and experiments, and annotation frameworks. Next, ment such as datasets, flows (algorithms), runs, and com- we critically discuss the presented state-of-the-art. Further- pleted tasks. more, we sketch our proposal for an ontology-based system for semantic annotation, storage, and querying of data min- In other domains, such as life sciences, storing annotated ing models and experiments. Finally, we conclude the paper data about experiments and their results is a common prac- with a summary and future work. tice. This is mostly due to the fact that the experiments are more expensive to conduct, and require specific prepara- tions. From the perspective of annotation frameworks, there 1. INTRODUCTION are significant advances in these domains, such as The Cen- Storing big amounts of data from a specific domain comes in ter for Expanded Data Annotation and Retrieval (CEDAR) hand with several challenges, one of them being to seman- workbench [8] , and the OpenTox framework [11]. tically represent and describe the stored data. Semantic representation enables us to infer new knowledge based on This paper is organized as follows. First, we make an overview the one that we assert, i.e. the description and annotation of the state-of-the-art approaches in annotating, storing, and of the data. This can be done by providing semantic annota- querying of models and experiments. Next, we critically as- tions of the data with terms originating from a vocabulary or sess these approaches and sketch a proposal for a system for ontology describing the domain at hand. In computer and annotating, storing and querying data mining models and information science, ontology is a technical term denoting experiments. Finally, we provide a summary and discuss an artifact that is designed for a purpose, which is to en- the possible approaches for further work. able the modeling of knowledge about some domain, real or imagined [15]. Ontologies provide more detailed description of a domain, first by organizing the classes into a taxonomy, 2. BACKGROUND AND RELATED WORK and further on by defining relations between classes. With The state-of-the-art in semantic annotation of data min- semantic annotation we attach meaning to the data, we can ing models and experiments provides very diverse research, infer new knowledge, and perform queries on the data. ranging from domain-specific data mining ontologies, exper- iment databases, to new languages for deploying annotations Data mining and machine learning experiments are con- in unified format. Here, we provide an introduction to the ducted with faster pace than ever before, in various settings state-of-the-art in semantic web, ontologies and vocabular- and domains. In the usual practice of conducting data min- ies, representations of data mining models and experiments, ing experiments, almost none of the settings are recorded, experiment databases, and annotation frameworks. nor the produced models are stored. These predicaments make for a research that is hard to verify, reproduce and up- Semantic technologies. The Semantic Web is defined grade. This is also in line with the FAIR (Findable, Acces- as an extension of the current web in which information is 45 given well-defined meaning, enabling computers and people for (semi) automatically or manually annotating data, there to work in cooperation [14]. The stack of technologies con- are several solutions that exist outside of the data min- sists of multiple layers, however, in this paper we will focus ing domain, which provide innovative approaches and good on the ones essential for our scope of research. Resource foundation for development in the direction of creating a Description Framework (RDF) represents a metadata data software to enable ontology-based semantic annotation of model for the Semantic Web, where the core unit of informa- models and experiments, their storage and querying. The tion is presented as a triple. A triple describes the subject by CEDAR Workbench [13] provides an intuitive interface for its relationship, which is what the predicate resembles, with creating templates and metadata annotation with concepts the object. RDF files are stored in triple store (typically or- defined in the ontologies available at BioPortal4. On the ganized as relational or NoSQL databases [12]), on which we other hand, OpenTox [11] represents domain specific frame- can perform semantic queries, by using querying languages work that provides unified representation of the predictive such as SPARQL. Finally, ontology languages, such as Re- modelling in the domain of toxicology. source Description Framework Schema (RDFS) and Ontol- ogy Web Language (OWL), are formal languages used to 3. CRITICAL ASSESSMENT construct ontologies. RDFS provides the basis for all ontol- ogy languages, defining basic constructs and relations, while In this section, we will critically assess the presented state- OWL is far more expressive enabling us to define classes, of-the-art in Section 2 in the context of semantic annota- properties, and instances. tion, storage and querying of data mining models and ex- periments. Ontologies & vocabularies. Currently, there are several ontologies that describe the data mining domain. These The state-of-the-art in ontology design for data mining pro- include the OntoDM ontology [16], DMOP ontology [7], Ex- vides well documented research with various ontologies that pose [4], KDDOnto [1], and KD ontology [10]. MEX [2] is an thoroughly describe the domain from different aspects and interoperable vocabulary for annotating data mining mod- can be used in various applications. OntoDM provides uni- els and experiments with metadata. In addition there have fied framework of top level data mining entities. Building been developments in formalism for representing scientific on this, it describes the domain in great detail, containing experiments in general, such as the EXPO ontology [6]. definitions for each part of the data mining process. Because of the wide reach, it lacks a particular use case scenario. On Representation of models. With the constant devel- the other hand, this same property makes this ontology suit- opment of new environments for developing data mining able for wide range of applications where there is a need of software, it is necessary to have a unified representation describing a part of the domain. of the constructed data mining models and the conducted experiments. The first open standard was the Predictive Ontologies like EXPO and Exposé have a essential meaning Model Markup Language (PMML). For a period of time it in the research since the first one describes a very wide and provided transparent and intuitive representation of data important interdisciplinary domain, while the latter uses it mining models and experiments. However, due to the as a base for defining a specific sub-domain. DMOP ontol- fast growth in the development of new data mining meth- ogy describes the process of algorithm and model selection in ods, PMML was unable to follow the pace and extend its the context of semantic meta mining. Both the KD ontology more and more complicated specification. Its successor, the and KDDOnto describe the knowledge discovery process in Portable Format for Analytics (PFA), was developed having the context of constructing knowledge discovery workflows. the PMML’s drawbacks as guidelines for improvement. They differ mainly in the key concepts on which they were built. At the same time, the MEX vocabulary provides a Experiment and model databases. Storing already con- lightweight framework for automating the metadata gener- ducted experiments in a well structured and transparent ation. Since it is tied with Java environment, it provides manner is essential for researchers to have available, veri- a library which only uses the MEX API and can also be fiable, and reproducible results. An experiment database is implemented in other programming languages. designed to store large number of experiments, with detailed information on their environmental setup, the datasets, algo- All in all, the current state of the art in ontologies for data rithms and their parameter settings, evaluation procedure, mining provides a good foundation for development of ap- and the obtained results [3]. The state-of-the-art in storing plications which will be based on one or several of these setups and results is abundant with approaches and solu- ontologies. Given the wide of coverage they can be easily be tions in different domains. For example, OpenML1 is the combined in a manner to suit the application at hand. biggest machine learning repository of data mining datasets, tasks, flows, and runs, the BioModels2 repository stores In the area of descriptive languages for data mining models more than 8000 experiments and models from the domains and experiments, one can see the path of progress in re- of systems biology, and ModelDB3 is an online repository search. PMML was the first, ground-breaking, XML-based for storing computational neuroscience models. descriptive language. However, with the expansion of the data mining domain, several weaknesses of PMML emerged. Annotation frameworks. When it comes to frameworks The language was not extensible, users could not create chains of models, and it was not compatible with the dis- tributed data processing platforms. Therefore, the same 1https://www.openml.org/ community started working on a new, extensible, portable 2http://www.ebi.ac.uk/biomodels/ 3https://senselab.med.yale.edu/modeldb/ 4https://bioportal.bioontology.org/ 46 language. Since its inception, the PFA format was intended need to have complete information about the conditions in to fill the small gaps that PMML had. Made up of analytic which that experiment was conducted. Namely, we need to primitives, implemented in Python and Scala, it provides the have an annotated dataset, annotation of the algorithm and users with more customizable framework, where they can its parameter settings for the specific run of the experiment. create custom models, model chains, and implement them Since one experiment usually consists of multiple algorithm in a parallelized setting. runs we annotate each run separately, as well as each of the results from each of them. For annotating the results, we use Storing and annotating experiments is of great significance the definitions of the performance metrics formalized in the in multiple scenarios. First, in domains where conducting data mining ontologies. A sketched example of the proposed the experiment is not a trivial task, i.e. the physical or solution is shown in Figure 1. financial conditions challenge the process, there needs to be a database where the setup and the findings of the experiment The proposed system for ontology-based annotation, stor- will be saved. For example, in BioModels.net we have two age, and querying of data mining experiments and models groups of experiments: Manually curated with structured will consist of several components. The users will interact metadata, and experiments without structure. The main with the system through an user interface enabling them drawback with this type of storage is the need for manual to run experiments on a data mining software, which will curation of the metadata. It is repetitive, time consuming export models and experiment setups to a semantic anno- task for which there is a strong need to be automated. tation engine. For example, for testing purposes we plan to use CLUS5 software for predictive clustering and structured In the domain of neuroscience, ModelDB provides an online output prediction, which generates different types of models service for storing and searching computational neuroscience and addresses different data mining tasks. models. In this database, alongside the files that constitute the models, researchers also need to upload the code that In the semantic annotation engine, the data mining mod- defines the complete specification of the attributes of the els and experiments will be annotated with terms from the biological system represented in the model, together with extended OntoDM ontology and then stored in a database. files that describe the purpose and application of the model. Once stored, the users will be able to semantically query Therefore, researchers can search the database for models the models and experiments in order to infer new knowl- with specific applications describing biological systems. edge. This will be done through a querying engine based on the SPARQL language, accessible through a user interface. OpenML provides a good framework for storing and anno- tating data mining datasets, experimental setups and runs, In order to perform annotation, we will extend the exist- as well as algorithms. One particular drawback of OpenML ing OntoDM ontology by adding a number of new terms, is that it does not store the actual models that are produced linking it to other domain ontologies, such as Exposé and from each experimental run, and one can not query the mod- EXPO. Linking OntoDM to these ontologies will extend the els. Furthermore, it’s founded on relational-database which domain of OntoDM towards connecting the data mining en- can not provide execution of semantic queries. tities that it already covers with new entities that describe the experimental setup and principles. With this we will All in all, these three examples show significant advances in obtain a schema for annotation of data mining models and storing and annotating models and experiments. However, experiments. The schema will then be used to annotate the there is also a significant room for improvement in the di- data mining models and experiments through a semantic an- rection of storing the models and experiments into NoSQL notation engine. The engine will have to read the models databases that are better suited for this task. and experiments from a data mining software system, anno- tate them with terms from developed schema and produce Finally, in the context of annotation tools the CEDAR Work- an RDF representation of the annotated data. bench and the OpenTox Framework provide a good insight in annotation frameworks. CEDAR enables the user to ex- Furthermore, the RDF graphs will be stored in a triple store ecute the annotations in modular manner by creating tem- database. Since the data mining models and experiments plates and adding elements to them. After curating the differ a lot in their structure, we have yet to decide on the annotations, they can export the schemas either in JSON, type of database in which we will store them. The data JSON-LD, or RDF file. OpenTox [11] is also based on on- stored in this way is set for performing semantic queries tology terms and represents a complete framework that de- on top of it. Therefore, we will develop a SPARQL-based scribes the predictive process in toxicology, starting with querying enigne so the users can perform predefined or cus- toxicity structures and ending with the predictive modelling. tom semantic queries on top of the storage base. Finally, the format of the results is another point where we 4. A PROPOSAL FOR SEMANTIC STORE need to decide whether the results will be presented as RDF OF MODELS AND EXPERIMENTS graphs, or in a different format (such as JSON) that is easier to interpret. This software package along with the storage After analysing the previous and current research, we can will then be added as a module to the CLUS software, de- conclude that despite the great achievements, there is a wide veloped at the Department of Knowledge Technologies. area for improvement in which we will contribute in the up- coming period by developing an ontology-based framework for storage and annotation of data mining models and exper- iments. In order to annotate a data mining experiment, we 5http://sourceforge.net/projects/clus 47 Annotation Schema Semantically RDF Triples Domain Extends Annotated OntoDM Storage of DM Ontology 2 Experiment Ontology experiments SPARQL Query SPARQL Query Domain Extends Storage of Querying Ontology 1 DM models RDF Triples Engine Semantically Experiment Annotated Semantic Model Annotation User defined query Model Engine Runs experiments Results Data Mining User interface Software Figure 1. Schema of the proposed solution 5. CONCLUSION & FURTHER WORK and the Public Scholarship, Development, Disability and Maintenance In this paper, we presented the state-of-the-art in annota- Fund of the Republic of Slovenia through its scholarship program. tion, storage and querying in the light of designing a se- mantic store of data mining models and experiments. We 6. REFERENCES first gave an overview of semantic web technologies, such as [1] Claudia Diamantini et al. KDDOnto: An ontology for discovery RDF, SPARQL, RDFS, and OWL that provide a complete and composition of kdd algorithms. Towards Service-Oriented Knowledge Discovery (SoKD’09), pages 13–24, 2009. foundation for annotation and querying of data. [2] Diego Esteves et al. MEX Vocabulary: a lightweight interchange format for machine learning experiments. In Furthermore, we critically reviewed the state-of-the-art on- Proceedings of the 11th International Conference on tologies and vocabularies for describing the domain of data Semantic Systems, pages 169–176. ACM, 2015. [3] Hendrick Blockheel et al. Experiment databases: Towards an mining provide detailed description of the domain of data improved experimental methodology in machine learning. In mining and machine learning (OntoDM, Expose, KD On- European Conference on Principles of Data Mining and tology, DMOP and KDDOnto, MEX). Next, we focused on Knowledge Discovery, pages 6–17. Springer, 2007. experiment databases as repositories where the experiment [4] Joaqin Vanschoren et al. Exposé: An ontology for data mining experiments. In Towards service-oriented knowledge discovery datasets, setups, algorithm parameter settings, and the re- (SoKD-2010), pages 31–46, 2010. sults are available for the performed experiments in various [5] Joaqin Vanschoren et al. Taking machine learning research domains. Furthermore, we saw that annotation frameworks online with OpenML. In Proceedings of the 4th International provide environments for (semi) automatically or manually Conference on Big Data, Streams and Heterogeneous Source Mining, pages 1–4. JMLR. org, 2015. annotating data, by discussing two frameworks from the do- [6] Larisa N Soldatova et al. An ontology of scientific experiments. mains of biomedicine and toxicology in order to analyze best Journal of the Royal Society Interface, 3(11):795–803, 2006. practices present in those domains. [7] Maria C Keet et al. The Data Mining OPtimization Ontology. Web Semantics: Science, Services and Agents on the World Wide Web, 32:43–53, 2015. Finally, given the performed analysis of the state-of-the-art, [8] Mark A Musen et al. The Center for Expanded Data we outlined our proposal for an ontology-based framework Annotation and Retrieval. Journal of the American Medical for annotation, storage, and querying of data mining mod- Informatics Association, 22:1148–1152, 2015. els and experiments. The proposed framework consists of an [9] Mark D. Wilkinson et al. The FAIR guiding principles for scientific data management and stewardship. Scientific Data, annotation schema, a semantic annotation engine, and stor- 3, 2016. age for data mining models and experiments with a querying [10] Monika Záková et al. Automating knowledge discovery engine, all of which will be controlled from an user interface. workflow composition through ontology-based planning. IEEE Transactions on Automation Science and Engineering, It will allow users to semantically query their data mining 8:253–264, 2011. models and experiments in order to infer new knowledge. [11] Olga Tcheremenskaia et al. OpenTox predictive toxicology framework: toxicological ontology and semantic media In the future, we plan to adapt this framework for the needs wiki-based openToxipedia. In Journal of biomedical semantics, page S7, 2012. of research groups or companies that conduct high volume of [12] Olivier Curé et al. RDF database systems: triples storage and data mining experiments, enabling them to obtain a queryable SPARQL query processing. Morgan Kaufmann, 2014. knowledge base consisting of annotated metadadata for all [13] Rafael S Gonçalves et al. The CEDAR Workbench: An experiments and produced models. This will enable them Ontology-Assisted Environment for Authoring Metadata that Describe Scientific Experiments. In International Semantic to reuse existing models on new data for testing purposes, Web Conference, pages 103–110. Springer, 2017. infer knowledge based on past experimental results, all while [14] Tim Berners-Lee et al. The semantic web. Scientific American, saving time and computational resources. 284:34–43, 2001. [15] Tom Gruber. Ontology. Encyclopedia of database systems, pages 1963–1965, 2009. Acknowledgements [16] Panče Panov. A Modular Ontology of Data Mining. PhD The authors would like to acknowledge the support of the Slovenian thesis, Jožef Stefan IPS, Ljubljana, Slovenia, 2012. Research Agency through the projects J2-9230, N2-0056 and L2-7509 48 A Graph-based prediction model with applications [Extended Abstract] ∗ András London József Németh Miklós Krész University of Szeged, Institute University of Szeged, Institute InnoRenew CoE of Informatics of Informatics University of Primorska, IAM Poznan University of University of Szeged, Institute Economics, Department of of Applied Sciences Operations Research ABSTRACT and later it appeared in many areas from social network We present a new model for probabilistic forecasting using analysis to optimization in technical networks (e.g. road graph-based rating method. We provide a “forward-looking” and electric networks) [16]. type graph-based approach and apply it to predict football game outcomes by simply using the historical game results Making predictions in general, and especially in sports as data of the investigated competition. The assumption of our well, is a difficult task. The predictions generally appear in model is that the rating of the teams after a game day cor- the form of betting odds, that, in the case of “fixed odds”, rectly reflects the actual relative performance of them. We provide a fairly acceptable source of expert’s predictions re- consider that the smaller the changing of the rating vector – garding sport games outcomes [21]. Thanks to the increasing contains the ratings of each team – after a certain outcome quantity of available data the statistical ranking, rating and in an upcoming single game, the higher the probability of prediction methods have become more dominant in sports that outcome. Performing experiments on European foot- in the last decade. A key question is that how accurate ball championships data, we can observe that the model per- these evaluations are, more concretely, the outcomes of the forms well in general and outperforms some of the advanced upcoming games how accurately can be predicted based on versions of the widely-used Bradley-Terry model in many the statistics, ratings and forecasting models in hand. cases in terms of predictive accuracy. Although the appli- cation we present here is special, we note that our method Statistics-based forecasting models are used to predict the can be applied to forecast general graph processes. outcome of games based on some relevant information of the competing teams and/or players of the teams. A detailed Categories and Subject Descriptors survey of the scientific literature of rating and forecasting I.6 [Simulation and Modeling]: Applications; I.2 [Artificial methods in sports is beyond the scope of this paper, we Intelligence]: Learning refer only some important and recent results in the topic. For some papers with detailed literature overview and sport applications of the the celebrated Bradley-Terry model [3], 1. INTRODUCTION see e.g. [5, 7, 24]). Other popular approach is the Poisson The problem of assigning scores to a set of individuals based goal-distribution based analysis. For some references, see on their pairwise comparisons appears in many areas and ac- for instance [10, 15, 20]. In these models the goals scored tivities. For example in sports, players or teams are ranked by the playing teams follow a Poisson distribution with pa- according to the outcomes of games that they played; the rameter that is a function of attack and defense “rate” of impact of scientific publications can be measured using the the respective teams. A large family of prediction models relations among their citations. Web search engines rank only consider the game results win, loss (and tie) and usu- websites based on their hyperlink structure. The centrality ally uses some probit regression model, for instance [11] and of individuals in social systems can also be evaluated accord- [13]. More recently, well-known data mining techniques, like ing to their social relations. Ranking of individuals based artificial neural networks, decision trees and support vector on the underlying graph that models their bilateral relations machines have also become very popular; some references - has become the central ingredient of Google’s search engine without being exhaustive - see e.g [8, 9, 14, 18].Based on ∗Corresponding author, email: london@inf.u-szeged.hu the huge literature it can be concluded that the prediction accuracy strongly depends on the investigated sport and the feature set of the machine learning algorithms used. A no- table part of prediction models based on the historical data of game results use the methodology of ranking and rat- ing. Some recent articles in the topic are e.g. [2, 6, 12, 17, 23]. Specifically highlighting [2] the authors analyzed the he predictive power of eight sports ranking methods using only win-loss and score difference data of American major sports. They found that the least squares and random walker meth- 49 ods have significantly better predictive accuracy than other to a successful bettor is less than that represented by the methods. Moreover, utilizing score-differential data are usu- true chance of the event occurring. This means mathemat- ally more predictive than those using only win-loss data. ically that 1/odds(i) + 1/odds(j) is more than one. This profit expected by the agency is known as the “over-round In contrast to those techniques that use the actual respective on the book”. strength of the two competing teams, we provide a graph- based and forward-looking type approach. The assumption 2.2 The Bradley-Terry Model of our model is that if a rating of the teams after a game day The Bradley-Terry model [3] is a widely-used method to as- correctly reflects the actual relative performance, then the sign probabilities to the possible outcomes when a set of smaller the change in that rating after a certain result occurs n individuals are repeatedly compared with each other in (in an upcoming single game) the higher the probability of pairs. For two elements i and j, the probability that i beats that event occur. j defined as The structure of this paper is follows. After presenting πi Pr(i beats j) = , the classical approaches (“Betting Odds” and “The Bradley- πi + πj Terry Model”), our new model is introduced. Then in Sec. 3 where πi > 0 is a parameter associated to each individual i = we present our preliminary experimental results, and finally 1, . . . , n, representing the overall skill, or “intrinsic strength” in Sec. 4 we conclude and discuss some possible research of it. Equivalently, πi/πj represents the odds in favor i beats directions. j, therefore this is a “proportional-odds model”. Suppose that i and j played Nij games against each other with i 2. MODELS winning Wij of them, and all games are considered to be Let V = (1, . . . , n) be the set of n teams (or players) and independent. The likelihood is given by let R be the number of game days in a competition among W N the teams in V . A rating is a function φr : V → n ij ij −Wij R that Y πi πj L(πi, . . . , πn) = . assigns a score to each team after each game day r (r = πi + πj πi + πj i Sj ) = Pr(Si − Sj > 0) = 1 − 1 + elogπi−logπj if j wins, then the bettor loses his $1. We can calculate the πi probabilities of the respective events as = . πi + πj 1/odds(i) Pr(i beats j) = 1/odds(i) + 1/odds(j) Extension with Home advantage and Tie. A natu- and ral extension of the Bradley-Terry model with “home-field 1/odds(j) advantage”, according to [1], say, is to calculate the proba- Pr(j beats i) = . 1/odds(i) + 1/odds(j) bilities as ( θπ We should note here that odds provided by betting agen- i , if i is at home Pr(i beats j) = θπi+πj cies do not represent the true chances (as imagined by the πi , if j is at home πi+θπj bookmaker) that the event will or will not occur, but are the amount that the bookmaker will pay out on a winning bet. where θ > 0 measures the strength of the home-field advan- The odds include a profit margin meaning that the payout tage (or disadvantage). Considering also a tie as a possible 50 final result of a game, the following calculations, proposed node home-i with weight x and an edge from node home- in [22], can be used : i to node away-j with weight y are added to the graph, π respectively. Our assumption is that if an outcome x : y i Pr(i beats j) = , has a high probability and it occurs, then it causes a small πi + απj change in the PageRank vector; hence δxy will be small. To simplify the notations let {δ1, . . . , δm} be the distance val- (α2 − 1)πiπj Pr(i ties j) = ues obtained by considering different results {E1, . . . , Em} (πi + απj )(απi + πj ) of the upcoming game between i and j. The goal now is where α > 1. Combining them is straightforward. In our to calculate the probability that a certain result occurs if experiments, we used the Matlab implementations found at {δ1, . . . , δm} is given. To do this, we use the following sim- ple statistics-based machine learning method. Let f +() be http://www.stats.ox.ac.uk/~caron/code/bayesbt/ using the expectation maximization algorithm, described in detail the probability density function of δi random variable where in [7]. the event (game result) Ei occurred. In our implementa- tion Ei ∈ {0 : 0, 1 : 0, 1 : 1, . . . , 5 : 5}, assuming that the probability of other results equals 0. Similarly, let f −() be 2.3 Rating-based Model with Learning the probability density function of δi random variable in Our new model is designed as follows. We will use the term which case the event (game result) Ei did not occur. To “game day” in each case when at least one match is played approximate the f +() and f −() functions, for each game on the given day. For any game day in which we make we use the training data set contains all results and related a forecast, we consider the results matrix that contains all δi (i = 1, . . . , m) values of the preceding T = 40 game days the results of the previous T = 40 game days. For the 40 of the considered game. In our experiments, the gamma dis- game days time window, the entries of the results matrix S tribution (and its density function) turned out to be a fairly are defined as Sij = #{scores team home-i achieved against good approximate for f +(δ) and f −(δ). team away-j}. To take into account the home-field effect, for each team i we distinguish team home-i and team away-i. Assuming that δ1, . . . , δm are independent, using the Bayes Thus, we define a 2n × 2n results matrix, which, in fact, theorem and the law of total probability, we can calculate describes a bipartite graph where each team appears both that in the home team side and the away team side of the graph. For rating the teams, a time-dependent PageRank method f +(δi) Q f −(δ k6=i k ) Pr(Ei|{δ1, . . . , δm}) = . is used. The PageRank scores are calculated according the P f +(δ f −(δ ` `) Qk6=l `) time-dependent PageRank equation We should note here that in this way we assign probabilities λ φ = Π = [I − (1 − λ)St to concrete game final results, which is another novelty of N mod(l1t)−1]−11, (1) our model. Then, for the upcoming game between i and j, defined in [19]. The damping factor is λ = 0.1, while we may the outcome probability of the event “i beats j” is calculated multiply each entry of S with the exponential function 0.98α as to consider time-dependency and obtaining S X mod, where α Pr(i beats j) = Pr(Ek|{δ1, . . . , δm}), denotes the number of game days elapsed since a given result k: Ek encodes a result occurred (and stored in S). Note, that a home team and an of team-i win away team PageRank values are calculated for each team. where we sum over those Ek results for which i beats j (i.e. We would like to establish a connection between team home- 1:0, 2:0, 2:1, 3:0, 3:1, etc.). The probabilities Pr(i ties j) i and team away-i using the assumption that home-i is not and Pr(j beats i) can be calculated in a similar way. weaker than away-i. In our implementation we assumed that home-i had a win 2 : 1 against away-i to give a positive bias for home-i at the beginning. In our experiments this setup 3. EXPERIMENTAL RESULTS performed well, but it was not optimized precisely. To measure the accuracy of the forecasting we calculate the mean squared error, which is often called Brier scoring rule Using the above-defined results matrix S and the PageR- in the forecasting literature [4]. The Brier score measures the ank rating vector φ, we assign probabilities to the outcomes mean squared difference between the predicted probability {home team win, tie, away team win} of an upcoming game assigned to the possible outcomes for event E and the actual in game day r between home-i and away-j as follows. Be- outcome oE. Suppose that for a single game g, between i and fore the game day in which we make the forecast, let the j, the forecast is pg = (pgw, pg, pg) contains the probabilities t l calculated PageRank rating vector be φr−1(V ). We use δr of i wins, the game is a tie and i loses, respectively. Let 40 xy to measure how the rating vector of the teams changes if the actual outcome of the game be og = (ogw, og, og), where t l the result of an upcoming game between teams i and j exactly one element is 1, the other two are 0. Noting that is x : y, where x, y = 0, 1, . . . are the scores achieved by the number of games played (and predicted) is N , BS is team i and team j, respectively1. We define δrxy as the Eu- defined as clidean distance between φr−1(V ) and φr 40 40(V ) that is the N rating vector for the new results matrix obtained by adding 1 X BS = ||pg − og||22 x to S N ij and y to Sn+j,i. In the results graph interpreta- g=1 tion this simply means that an edge from node away-j to N 1 X 1We should note here that if the result is 0 : 0, then x = = [(pg − og)2 + (pg − og)2]. N w − og w )2 + (pg t t l l y = 1/2 is used. g=1 51 The best score achievable is 0. In the case of three pos- 6. REFERENCES sible outcomes (win, lost, tie) we can easily see that the [1] A. Agresti. Categorical data analysis. John Wiley & forecast pg = (1/3, 1/3, 1/3) (for each game g and any N ) Sons, New York, 1996. gives accuracy BS = 2/3 = 0.666. We consider this value [2] D. Barrow, I. Drayer, P. Elliott, G. Gaut, and B. Osting. Ranking rankings: an empirical comparison as a worst-case benchmark. One question of our investiga- of the predictive power of sports ranking methods. tion is that how better BS values can be achieved using our Journal of Quantitative Analysis in Sports, method, and how close we can get to the betting agencies’ 9(2):187–202, 2013. fairly good predictions. [3] R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired The data set we used contained all final results of given comparisons. Biometrika, 39(3-4):324–345, 1952. seasons of some football leagues, listed in the first two col- [4] G. W. Brier. Verification of forecasts expressed in terms of probability. Monthly Weather Review, umn of Table 1. We tested our method as it was described 78(1):1–3, 1950. in Sec. 2.3. We start predicting games starting from the [5] K. Butler and J. T. Whelan. The existence of 41th game day; for each single game predictions are made maximum likelihood estimates in the Bradley-Terry using the results of the previous 40 game day before that model and its extensions. arXiv preprint game. The Brier scores were calculated using all predic- math/0412232, 2004. tions we made. Our initial results are summarized in Ta- [6] T. Callaghan, P. J. Mucha, and M. A. Porter. ble 1. To calculate the betting odds probabilities we used Random walker ranking for NCAA division IA football. American Mathematical Monthly, the betting odds provided by bet365 bookmaker available 114(9):761–777, 2007. at http://www.football-data.co.uk/. We could see that [7] F. Caron and A. Doucet. Efficient bayesian inference these predictions gave the best accuracy score (BS) in each for generalized Bradley–Terry models. Journal of case. We highlighted the values where the difference between Computational and Graphical Statistics, the Bradley-Terry method and the PageRank method was 21(1):174–196, 2012. higher than 0.01. Although we can see that slightly more [8] A. C. Constantinou, N. E. Fenton, and M. Neil. than half of the cases the Bradley-Terry model gives a better Pi-football: A bayesian network model for forecasting accuracy, the results are still promising considering the fact association football match outcomes. Knowledge-Based Systems, 36:322–339, 2012. that the parameters of our method and the implementation [9] D. Delen, D. Cogdell, and N. Kasap. A comparative are far from being optimized. analysis of data mining methods in predicting NCAA bowl outcomes. International Journal of Forecasting, 28(2):543–552, 2012. 4. CONCLUSIONS [10] M. J. Dixon and P. F. Pope. The value of statistical We presented a new model for probabilistic forecasting in forecasts in the UK association football betting sports, based on rating methods, that simply use the histor- market. International Journal of Forecasting, 20(4):697–711, 2004. ical game results data of the given sport competition. We [11] D. Forrest, J. Goddard, and R. Simmons. Odds-setters provided a forward-looking type graph based approach. The as forecasters: The case of English football. assumption of our model is that the rating of the teams after International Journal of Forecasting, 21(3):551–564, a game day is correctly reflects their current relative perfor- 2005. mance. We consider that the smaller the changing in the [12] R. Gill and J. Keating. Assessing methods for college rating vector after a certain result occurs in an upcoming football rankings. Journal of Quantitative Analysis in single game, the higher the probability that this event will Sports, 5(2), 2009. occur. Performing experiments on results data sets of Eu- [13] J. Goddard and I. Asimakopoulos. Forecasting football results and the efficiency of fixed-odds betting. ropean football championships, we observed that this model Journal of Forecasting, 23(1):51–66, 2004. performed well in general in terms of predictive accuracy. [14] A. Joseph, N. E. Fenton, and M. Neil. Predicting However, we should note here, that parameter fine tuning football results using bayesian nets and other machine and optimizing certain parts of our implementation are tasks learning techniques. Knowledge-Based Systems, of future work. 19(7):544–553, 2006. [15] D. Karlis and I. Ntzoufras. Analysis of sports data by We emphasize, that our methodology can be also useful to using bivariate Poisson models. Journal of the Royal compare different rating methods by measuring that which Statistical Society: Series D (The Statistician), 52(3):381–393, 2003. one reflects better the actual strength (rating) of the teams [16] A. N. Langville and C. D. Meyer. Google’s PageRank according to our interpretation. Finally we should add that and beyond: The science of search engine rankings. the model is general and may be used to investigate such Princeton University Press, 2011. graph processes where the number of nodes is fixed and edges [17] J. Lasek, Z. Szlávik, and S. Bhulai. The predictive are changing over time; moreover it also has a potential to power of ranking systems in association football. link prediction. International Journal of Applied Pattern Recognition, 1(1):27–46, 2013. [18] C. K. Leung and K. W. Joseph. Sports data mining: 5. ACKNOWLEDGMENTS Predicting results for the college football games. Procedia Computer Science, 35:710–719, 2014. This work was partially supported by the National Research, [19] A. London, J. Németh, and T. Németh. Development and Innovation Office - NKFIH, SNN117879. Time-dependent network algorithm for ranking in sports. Acta Cybernetica, 21(3):495–506, 2014. Miklós Krész acknowledges the European Commission for [20] M. J. Maher. Modelling association football scores. funding the InnoRenew CoE project (Grant Agreement #739574) Statistica Neerlandica, 36(3):109–118, 1982. under the Horizon2020 Widespread-Teaming program. 52 Table 1: Accuracy results on football data sets. The values where the difference between the Bradley-Terry method and the PageRank method was higher than 0.01 are shown in bold. League Season Betting odds error Bradley-Terry error PageRank method error 2011/12 0.58934 0.60864 0.59653 Premier League 2012/13 0.56461 0.59744 0.58166 2013/14 0.54191 0.55572 0.59406 2014/15 0.55740 0.60126 0.60966 2011/12 0.58945 0.59994 0.59097 Bundesliga 2012/13 0.57448 0.59794 0.58622 2013/14 0.55724 0.57803 0.60125 2014/15 0.57268 0.60349 0.60604 2011/12 0.54598 0.57837 0.58736 La Liga 2012/13 0.56417 0.58916 0.60205 2013/14 0.57908 0.58016 0.60473 2014/15 0.52317 0.55888 0.56172 [21] P. F. Pope and D. A. Peel. Information, prices and efficiency in a fixed-odds betting market. Economica, pages 323–341, 1989. [22] P. Rao and L. L. Kupper. Ties in paired-comparison experiments: A generalization of the Bradley-Terry model. Journal of the American Statistical Association, 62(317):194–204, 1967. [23] J. A. Trono. Rating/ranking systems, post-season bowl games, and ’the spread’. Journal of Quantitative Analysis in Sports, 6(3), 2010. [24] C. Wang and M. L. Vandebroek. A model based ranking system for soccer teams. Research report, available at SSRN 2273471, 2013. 53 54 Indeks avtorjev / Author index Black Michaela ............................................................................................................................................................................. 33 Carlin Paul .................................................................................................................................................................................... 33 Čerin Matej ................................................................................................................................................................................... 37 Dujič Darko .................................................................................................................................................................................. 29 Džeroski Sašo ......................................................................................................................................................................... 41, 45 Fuart Flavio .................................................................................................................................................................................. 33 Gojo David ................................................................................................................................................................................... 29 Grobelnik Marko ................................................................................................................................................................ 9, 13, 33 Jenko Miha ..................................................................................................................................................................................... 5 Jovanoski Viktor .......................................................................................................................................................................... 25 Kenda Klemen .............................................................................................................................................................................. 37 Koprivec Filip .............................................................................................................................................................................. 37 Kostovska Ana ............................................................................................................................................................................. 41 Krész Miklós ................................................................................................................................................................................ 49 London András ............................................................................................................................................................................. 49 Massri M. Besher ......................................................................................................................................................................... 13 Mladenić Dunja ............................................................................................................................................................................ 21 Németh József .............................................................................................................................................................................. 49 Novak Blaž ................................................................................................................................................................................... 17 Novak Erik ..................................................................................................................................................................................... 5 Novalija Inna ............................................................................................................................................................................ 9, 13 Panov Panče ........................................................................................................................................................................... 41, 45 Pejović Veljko .............................................................................................................................................................................. 21 Pita Costa Joao ............................................................................................................................................................................. 33 Rupnik Jan .................................................................................................................................................................................... 25 Santanam Raghu ........................................................................................................................................................................... 33 Stopar Luka .................................................................................................................................................................................. 33 Sun Chenlu ................................................................................................................................................................................... 33 Tolovski Ilin ................................................................................................................................................................................. 45 Urbančič Jasna ......................................................................................................................................................................... 5, 21 Wallace Jonathan.......................................................................................................................................................................... 33 55 56 Konferenca / Conference Uredila / Edited by Odkrivanje znanja in podatkovna skladišča - SiKDD / Data Mining and Data Warehouses - SiKDD Dunja Mladenić, Marko Grobelnik Document Outline 01 - Naslovnica-sprednja-C 02 - Naslovnica - notranja - C 03 - Kolofon - C 04 - 05 - IS2018 - Skupni del 07 - Kazalo - C 08 - Naslovnica podkonference - C 09 - Predgovor podkonference - C 10 - Programski odbor podkonference - C 11 - Clanki - C 01 - NovakErik Abstract 1 Introduction 2 Related Work 3 Data Preprocessing 4 Recommender Engine 4.1 Recommendation Results 5 Future Work and Conclusion Acknowledgments References 02 - Novalija 1. INTRODUCTION 2. BACKGROUND The development of smart labour market statistics touches a number of issues from labour market policies area and would provide contributions to questions related to: - job creation, - education and training systems, - labour market segmentation, - improving skill supply and productivity. For instance, the analysis of the available job vacancies could offer an insight into what skills are required in the particular area. Effective trainings based on skills demand could be organized and that would lead into better labour market integrat... A number of stakeholder types will benefit from the development of smart labour market statistics. In particular, the targeted stakeholders are: 3. RELATED WORK The European Data Science Academy (EDSA) [1] was an H2020 EU project that ran between February 2015 and January 2018. The objective of the EDSA project was to deliver the learning tools that are crucially needed to close the skill gap in Data Science ... - Analyzed the sector specific skillsets for data analysts across Europe with results reflected at EDSA demand and supply dashboard; - Developed modular and adaptable curricula to meet these data science needs; and - Delivered training supported by multiplatform resources, introducing Learning pathway mechanism that enables effective online training. 4. PROBLEM DEFINITION 4.1 DATA SOURCES 4.2 CONCEPTUAL ARCHITECTURE 4.3 SCENARIOS 4.3.1 DEMAND ANALYSIS 4.3.2 SKILLS ONTOLOGY DEVELOPMENT 4.3.3 SKILLS ONTOLOGY EVOLUTION 5. STATISTICAL INDICATORS 6. CONCLUSION AND FUTURE WORK 7. ACKNOWLEDGMENTS 8. REFERENCES 03 - Massri 1. INTRODUCTION 2. RELATED WORK 3. DESCRIPTION OF DATA 4. METHODOLOGY 4.1 Clustering and Formatting Data 4.2 Choosing the Main Entities 4.3 Detecting the Characteristics of Relationship 5. VISUALIZING THE RESULTS 5.1 Characteristics of the Main Graph 5.2 Main Functionality 5.3 Displaying Relation Information 6. CONCLUSION AND FUTURE WORK 7. ACKNOWLEDGMENTS This work was supported by the euBusinessGraph (ICT-732003-IA) project [6]. 8. REFERENCES 04 - NovakBlaz 1. INTRODUCTION 2. EXPERIMENTAL SETUP 3. RESULTS 4. CONCLUSIONS AND FUTURE WORK 5. ACKNOWLEDGEMENTS 6. REFERENCES 05 - Urbancic Introduction Related work Proposed approach Results Conclusions Acknowledgments References 06 - Jovanoski 07 - Gojo 08 - PitaCosta 09 - Koprivec Introduction PerceptiveSentinel Platform Data Data Acquisition Data Preprocessing Methodology Sample Data Feature Vectors Experiment Results Conclusions Acknowledgments References 10 - Kostovska 11 - Tolovski 12 - London 12 - Index - C 13 - Naslovnica-zadnja-C Blank Page Blank Page Blank Page Blank Page Blank Page Blank Page 04 - 05 - IS2018 - Predgovor in odbori.pdf 04 - IS2018 - Predgovor 05 - IS2018 - Konferencni odbori