Zbornik 19. mednarodne multikonference INFORMACIJSKA DRUŽBA - IS 2016 Zvezek D Proceedings of the 19th International Multiconference INFORMATION SOCIETY - IS 2016 Volume D Izkopavanje znanja in podatkovna skladišča (SiKDD) Data Mining and Data Warehouses (SiKDD) Uredila / Edited by Dunja Mladenić, Marko Grobelnik http://is.ijs.si 10. oktober 2016 / 10 October 2016 Ljubljana, Slovenia Zbornik 19. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2016 Zvezek D Proceedings of the 19th International Multiconference INFORMATION SOCIETY – IS 2016 Volume D Izkopavanje znanja in podatkovna skladišča (SiKDD) Data Mining and Data Warehouses (Sikdd) Uredila / Edited by Dunja Mladenić, Marko Grobelnik 10. oktober 2016 / 10 October 2016 Ljubljana, Slovenia Urednika: Dunja Mladenić Laboratorij za umetno inteligenco Institut »Jožef Stefan«, Ljubljana Marko Grobelnik Laboratorij za umetno inteligenco Institut »Jožef Stefan«, Ljubljana Založnik: Institut »Jožef Stefan«, Ljubljana Priprava zbornika: Mitja Lasič, Vesna Lasič, Lana Zemljak Oblikovanje naslovnice: Vesna Lasič Dostop do e-publikacije: http://library.ijs.si/Stacks/Proceedings/InformationSociety Ljubljana, oktober 2016 CIP - Kataložni zapis o publikaciji Narodna in univerzitetna knjižnica, Ljubljana 004.8(082)(0.034.2) MEDNARODNA multikonferenca Informacijska družba (19 ; 2016 ; Ljubljana) Izkopavanje znanja in podatkovna skladišča (SiKDD) [Elektronski vir] : zbornik 19. mednarodne multikonference Informacijska družba - IS 2016, 10. oktober 2016, [Ljubljana, Slovenija] : zvezek D = Data mining and data warehouses (SiKDD) : proceedings of the 19th International Multiconference Information Society - IS 2016, 10 October 2016, Ljubljana, Slovenia : volume D / uredila, edited by Dunja Mladenić, Marko Grobelnik. - El. zbornik. - Ljubljana : Institut Jožef Stefan, 2016 Način dostopa (URL): http://library.ijs.si/Stacks/Proceedings/InformationSociety/2016/IS2016_Volume_D% 20-%20SiKDD.pdf ISBN 978-961-264-100-9 (pdf) 1. Gl. stv. nasl. 2. Vzp. stv. nasl. 3. Mladenić, Dunja 29872935 PREDGOVOR MULTIKONFERENCI INFORMACIJSKA DRUŽBA 2016 Multikonferenca Informacijska družba (http://is.ijs.si) je z devetnajsto zaporedno prireditvijo osrednji srednjeevropski dogodek na področju informacijske družbe, računalništva in informatike. Letošnja prireditev je ponovno na več lokacijah, osrednji dogodki pa so na Institutu »Jožef Stefan«. Informacijska družba, znanje in umetna inteligenca so spet na razpotju tako same zase kot glede vpliva na človeški razvoj. Se bo eksponentna rast elektronike po Moorovem zakonu nadaljevala ali stagnirala? Bo umetna inteligenca nadaljevala svoj neverjetni razvoj in premagovala ljudi na čedalje več področjih in s tem omogočila razcvet civilizacije, ali pa bo eksponentna rast prebivalstva zlasti v Afriki povzročila zadušitev rasti? Čedalje več pokazateljev kaže v oba ekstrema – da prehajamo v naslednje civilizacijsko obdobje, hkrati pa so planetarni konflikti sodobne družbe čedalje težje obvladljivi. Letos smo v multikonferenco povezali dvanajst odličnih neodvisnih konferenc. Predstavljenih bo okoli 200 predstavitev, povzetkov in referatov v okviru samostojnih konferenc in delavnic. Prireditev bodo spremljale okrogle mize in razprave ter posebni dogodki, kot je svečana podelitev nagrad. Izbrani prispevki bodo izšli tudi v posebni številki revije Informatica, ki se ponaša z 39-letno tradicijo odlične znanstvene revije. Naslednje leto bo torej konferenca praznovala 20 let in revija 40 let, kar je za področje informacijske družbe častitljiv dosežek. Multikonferenco Informacijska družba 2016 sestavljajo naslednje samostojne konference: • 25-letnica prve internetne povezave v Sloveniji • Slovenska konferenca o umetni inteligenci • Kognitivna znanost • Izkopavanje znanja in podatkovna skladišča • Sodelovanje, programska oprema in storitve v informacijski družbi • Vzgoja in izobraževanje v informacijski družbi • Delavnica »EM-zdravje« • Delavnica »E-heritage« • Tretja študentska računalniška konferenca • Računalništvo in informatika: včeraj za jutri • Interakcija človek-računalnik v informacijski družbi • Uporabno teoretično računalništvo (MATCOS 2016). Soorganizatorji in podporniki konference so različne raziskovalne institucije in združenja, med njimi tudi ACM Slovenija, SLAIS, DKZ in druga slovenska nacionalna akademija, Inženirska akademija Slovenije (IAS). V imenu organizatorjev konference se zahvaljujemo združenjem in inštitucijam, še posebej pa udeležencem za njihove dragocene prispevke in priložnost, da z nami delijo svoje izkušnje o informacijski družbi. Zahvaljujemo se tudi recenzentom za njihovo pomoč pri recenziranju. V 2016 bomo četrtič podelili nagrado za življenjske dosežke v čast Donalda Michija in Alana Turinga. Nagrado Michie-Turing za izjemen življenjski prispevek k razvoju in promociji informacijske družbe bo prejel prof. dr. Tomaž Pisanski. Priznanje za dosežek leta bo pripadlo prof. dr. Blažu Zupanu. Že šestič podeljujemo nagradi »informacijska limona« in »informacijska jagoda« za najbolj (ne)uspešne poteze v zvezi z informacijsko družbo. Limono je dobilo ponovno padanje Slovenije na lestvicah informacijske družbe, jagodo pa informacijska podpora Pediatrične klinike. Čestitke nagrajencem! Bojan Orel, predsednik programskega odbora Matjaž Gams, predsednik organizacijskega odbora i FOREWORD - INFORMATION SOCIETY 2016 In its 19th year, the Information Society Multiconference (http://is.ijs.si) remains one of the leading conferences in Central Europe devoted to information society, computer science and informatics. In 2016 it is organized at various locations, with the main events at the Jožef Stefan Institute. The pace of progress of information society, knowledge and artificial intelligence is speeding up, but it seems we are again at a turning point. Will the progress of electronics continue according to the Moore’s law or will it start stagnating? Will AI continue to outperform humans at more and more activities and in this way enable the predicted unseen human progress, or will the growth of human population in particular in Africa cause global decline? Both extremes seem more and more likely – fantastic human progress and planetary decline caused by humans destroying our environment and each other. The Multiconference is running in parallel sessions with 200 presentations of scientific papers at twelve conferences, round tables, workshops and award ceremonies. Selected papers will be published in the Informatica journal, which has 39 years of tradition of excellent research publication. Next year, the conference will celebrate 20 years and the journal 40 years – a remarkable achievement. The Information Society 2016 Multiconference consists of the following conferences: • 25th Anniversary of First Internet Connection in Slovenia • Slovenian Conference on Artificial Intelligence • Cognitive Science • Data Mining and Data Warehouses • Collaboration, Software and Services in Information Society • Education in Information Society • Workshop Electronic and Mobile Health • Workshop »E-heritage« • 3st Student Computer Science Research Conference • Computer Science and Informatics: Yesterday for Tomorrow • Human-Computer Interaction in Information Society • Middle-European Conference on Applied Theoretical Computer Science (Matcos 2016) The Multiconference is co-organized and supported by several major research institutions and societies, among them ACM Slovenia, i.e. the Slovenian chapter of the ACM, SLAIS, DKZ and the second national engineering academy, the Slovenian Engineering Academy. In the name of the conference organizers we thank all the societies and institutions, and particularly all the participants for their valuable contribution and their interest in this event, and the reviewers for their thorough reviews. For the fourth year, the award for life-long outstanding contributions will be delivered in memory of Donald Michie and Alan Turing. The Michie-Turing award will be given to Prof. Tomaž Pisanski for his life-long outstanding contribution to the development and promotion of information society in our country. In addition, an award for current achievements will be given to Prof. Blaž Zupan. The information lemon goes to another fall in the Slovenian international ratings on information society, while the information strawberry is awarded for the information system at the Pediatric Clinic. Congratulations! Bojan Orel, Programme Committee Chair Matjaž Gams, Organizing Committee Chair ii KONFERENČNI ODBORI CONFERENCE COMMITTEES International Programme Committee Organizing Committee Vladimir Bajic, South Africa Matjaž Gams, chair Heiner Benking, Germany Mitja Luštrek Se Woo Cheon, South Korea Lana Zemljak Howie Firth, UK Vesna Koricki Olga Fomichova, Russia Mitja Lasič Vladimir Fomichov, Russia Robert Blatnik Vesna Hljuz Dobric, Croatia Aleš Tavčar Alfred Inselberg, Israel Blaž Mahnič Jay Liebowitz, USA Jure Šorn Huan Liu, Singapore Mario Konecki Henz Martin, Germany Marcin Paprzycki, USA Karl Pribram, USA Claude Sammut, Australia Jiri Wiedermann, Czech Republic Xindong Wu, USA Yiming Ye, USA Ning Zhong, USA Wray Buntine, Australia Bezalel Gavish, USA Gal A. Kaminka, Israel Mike Bain, Australia Michela Milano, Italy Derong Liu, Chicago, USA Toby Walsh, Australia Programme Committee Bojan Orel, chair Andrej Gams Vladislav Rajkovič Grega Nikolaj Zimic, co-chair Matjaž Gams Repovš Franc Solina, co-chair Marko Grobelnik Ivan Rozman Viljan Mahnič, co-chair Nikola Guid Niko Schlamberger Cene Bavec, co-chair Marjan Heričko Stanko Strmčnik Tomaž Kalin, co-chair Borka Jerman Blažič Džonova Jurij Šilc Jozsef Györkös, co-chair Gorazd Kandus Jurij Tasič Tadej Bajd Urban Kordeš Denis Trček Jaroslav Berce Marjan Krisper Andrej Ule Mojca Bernik Andrej Kuščer Tanja Urbančič Marko Bohanec Jadran Lenarčič Boštjan Vilfan Ivan Bratko Borut Likar Baldomir Zajc Andrej Brodnik Janez Malačič Blaž Zupan Dušan Caf Olga Markič Boris Žemva Saša Divjak Dunja Mladenič Leon Žlajpah Tomaž Erjavec Franc Novak Bogdan Filipič iii iv KAZALO / TABLE OF CONTENTS Izkopavanje znanja in podatkovna skladišča (SiKDD) / Data Mining and Data Warehouses (Sikdd) ................ 1 PREDGOVOR / FOREWORD ................................................................................................................................. 3 PROGRAMSKI ODBORI / PROGRAMME COMMITTEES ..................................................................................... 4 Near Real-Time Transportation Mode Detection Based on Accelerometer Readings / Urbančič Jasna, Bradeško Luka, Senožetnik Matej ...................................................................................................................... 5 Application of Advanced Analytical Techniques in Support of Socio-Economic Impact Assessment of Innovation Funding Programmes / Berčič Katja, Cattaneo Gabriel a, Karlovčec Mario, Fuart Flavio, Cerle Gaber ........................................................................................................................................................ 9 Information Flow between News Articles: Slovene Media Case Study / Choloniewski Jan, Leban Gregor, Maček Sebastijan, Rehar Aljoša ...................................................................................................................... 13 Data Analytics in Aquaculture / Pita Costa Joao, Rihtar Matjaž ........................................................................... 17 Modeling Probability of Default and Credit Limits / Herga Zala, Rupnik Jan, Škraba Primož, Fortuna Blaž ....... 21 Big Data Analysis Combining Website Visit Logs with User Segments and Website Content / Kladnik Matic, Fortuna Blaž, Moore Pat ........................................................................................................................ 25 Visual and Statistical Analysis of VideoLectures.NET / Novak Erik, Novalija Inna .............................................. 29 Spatio-Temporal Clustering Methods / Senožetnik Matej, Bradeško Luka, Kažič Blaž, Mladenić Dunja, Šubic Tine ......................................................................................................................................................... 33 Indeks avtorjev / Author index ................................................................................................................................ 37 v vi Zbornik 19. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2016 Zvezek D Proceedings of the 19th International Multiconference INFORMATION SOCIETY – IS 2016 Volume D Izkopavanje znanja in podatkovna skladišča (SiKDD) Data Mining and Data Warehouses (Sikdd) Uredila / Edited by Dunja Mladenić, Marko Grobelnik 10. oktober 2016 / 10 October 2016 Ljubljana, Slovenia 1 2 PREDGOVOR Tehnologije, ki se ukvarjajo s podatki so v devetdesetih letih močno napredovale. Iz prve faze, kjer je šlo predvsem za shranjevanje podatkov in kako do njih učinkovito dostopati, se je razvila industrija za izdelavo orodij za delo s podatkovnimi bazami, prišlo je do standardizacije procesov, povpraševalnih jezikov itd. Ko shranjevanje podatkov ni bil več poseben problem, se je pojavila potreba po bolj urejenih podatkovnih bazah, ki bi služile ne le transakcijskem procesiranju ampak tudi analitskim vpogledom v podatke – pojavilo se je t.i. skladiščenje podatkov (data warehousing), ki je postalo standarden del informacijskih sistemov v podjetjih. Paradigma OLAP (On-Line-Analytical-Processing) zahteva od uporabnika, da še vedno sam postavlja sistemu vprašanja in dobiva nanje odgovore in na vizualen način preverja in išče izstopajoče situacije. Ker seveda to ni vedno mogoče, se je pojavila potreba po avtomatski analizi podatkov oz. z drugimi besedami to, da sistem sam pove, kaj bi utegnilo biti zanimivo za uporabnika – to prinašajo tehnike odkrivanja znanja v podatkih (data mining), ki iz obstoječih podatkov skušajo pridobiti novo znanje in tako uporabniku nudijo novo razumevanje dogajanj zajetih v podatkih. Slovenska KDD konferenca pokriva vsebine, ki se ukvarjajo z analizo podatkov in odkrivanjem znanja v podatkih: pristope, orodja, probleme in rešitve. FOREWORD Data driven technologies have significantly progressed after mid 90’s. The first phases were mainly focused on storing and efficiently accessing the data, resulted in the development of industry tools for managing large databases, related standards, supporting querying languages, etc. After the initial period, when the data storage was not a primary problem anymore, the development progressed towards analytical functionalities on how to extract added value from the data; i.e., databases started supporting not only transactions but also analytical processing of the data. At this point, data warehousing with OnLine-Analytical-Processing entered as a usual part of a company’s information system portfolio, requiring from the user to set well defined questions about the aggregated views to the data. Data Mining is a technology developed after year 2000, offering automatic data analysis trying to obtain new discoveries from the existing data and enabling a user new insights in the data. In this respect, the Slovenian KDD conference (SiKDD) covers a broad area including Statistical Data Analysis, Data, Text and Multimedia Mining, Semantic Technologies, Link Detection and Link Analysis, Social Network Analysis, Data Warehouses. Dunja Mladenić, Marko Grobelnik 3 PROGRAMSKI ODBOR / PROGRAMME COMMITTEE Dunja Mladenić Marko Grobelnik 4 Near Real-Time Transportation Mode Detection Based on Accelerometer Readings Jasna Urbančič, Luka Bradeško, Matej Senožetnik Jožef Stefan Institute and Jožef Stefan International Postgraduate School Jamova cesta 39 1000 Ljubljana, Slovenia ABSTRACT public transportation with buses and trains can be a good This paper describes a method for automatic transportation alternative to private vehicles. mode detection based on smartphone sensors. Our approach is designed to work in real-time as it only requires 5s of sen- As the main smartphone APIs already support fine-grained sor readings for the detection. Because we used accelerom- classification of non-motorized forms of transportation [2, eter instead of GPS signal it uses less battery power and is 3], we focused on distinguishing means of motorized trans- therefore more user and phone-friendly. For the mode detec- port, specifically cars and buses, as the majority of passen- tion we use multiple support vector machine models which ger traffic in Ljubljana represent cars and buses. Trains and enable us distinguishing between multiple modes (bus, train, motorbikes are not that common, whereas subway and tram car). Before the classification, raw measurements are pre- infrastructures do not exist. Our goal is to recognize each processed in order to cancel out the constant acceleration mode of transportation in near real time while mobile phone that is caused by the force of gravity. The results of the users are traveling. paper are promising and are based on the collected training data from approximately 20 hours of driving on trains and 2. RELATED WORK public buses in Ljubljana. Ever since smartphones appeared and gained accessibility there has been a lot of research activity for their usage in Keywords user activity recognition and transportation mode detection. activity recognition, support vector machine classification, While the first attempts to recognize user activity were done accelerometer before smartphones, the real effort in that direction started with the development of mobile phones having built-in sen- 1. INTRODUCTION sors [7]. Besides GPS sensors, also GSM triangulation and Nowadays most smartphones have built-in sensors that mea- local area wireless technology (Wi-Fi) can be employed for sure motion, acceleration, orientation, and various other en- the purpose of transportation mode detection. However its vironmental conditions with quite high precision and sam- accuracy is relatively low compared to GPS, therefore we pling frequency. This can be used with great success in deem these out of scope of this paper[8]. everyday challenges, for example tracking and routing ap- plications. It has been proven that smartphone sensors are Latest state of the art research is focused on transporta- useful in monitoring three-dimensional device movement and tion mode detection based on GPS signal and/or accelerom- positioning [1], and also user’s activity detection, which is eter data. Approaches that rely solely on GPS trajectories also in the domain of this paper. require GPS signal of high-quality, whereas phones GPS receiver is generally severely shielded during daily activi- Mobile operating system developers are aware of such ap- ties [10]. This may occur during travel underground, inside plications, therefore their APIs include activity recognition stations, or even when user is not sufficiently close to a win- packages. However, they detect only a few modes - still, dow when traveling in a vehicle [5], and results in loss of walking, running, cycling, and in vehicle. Such coarse-grained positioning information. Another known issue when using classification is not enough for tracking and routing pur- GPS signal on mobile device is high power consumption [5], poses, specially in use-cases for urban environments, where which is especially not pleasing in the case of longer com- mutes. Both of these two issues suggest that the accelerom- eter sensor is more appropriate for activity detection. Another advantage of using accelerometer data over GPS signal is that it does not require additional external data. Many researches using GPS data used external data, such as GIS data on bus stations, bus routes and railway lines [9]. 5 Figure 1: Amplitudes of raw accelerometer data for Figure 2: Amplitudes of preprocessed accelerometer different means of motorized transportation. data for different means of motorized transporta- tion. 3. TRANSPORTATION MODE DETECTION For the purpose of collecting accelerometer data we extended The gravity estimation algorithm works as follows: for a the GPS tracking mobile application with the accelerometer chosen sampling interval (in our case 1s), obtain an esti- measuring ability. The phone sensor measures acceleration mate of the gravity component on each axis by averaging forces in m/s2 for all of the three physical axes (x, y, z). The all measurements in the interval on that axis [6]. After ob- sampling rate is 100Hz (1 measurement every 10 ms). To taining the estimates, we subtracted the gravity component increase the diversity of the training data-set, measurements from all of the entries on corresponding axis in given time were acquired in multiple ways: interval. Through this we obtained only the dynamic ac- celeration component of the signal. Amplitudes of dynamic accelerometer signal are shown in Figure 2. • Person is collecting data while traveling by the car and stops the collection at the destination. 3.2 Classification • Person is collecting data while traveling by the bus and After preprocessing the accelerometer readings we extracted stops the collection on exit. features for the classification process. We used mean, vari- ance, skewness, 5th, and 95th percentile of acceleration data • Person is traveling by the train and is collecting data on all three axes. We also split the acceleration into posi- until the arrival to the final destination tive part, which indicates that the velocity of movement in • Person is collecting data while driving a motorbike and that direction is increasing, and negative part, which indi- stops the collection at the destination. cates that the velocity is decreasing, and calculated the same statistics on these two parts. We collected approximately 20 hours of travelling measure- We used support vector machine (SVM) classifier as it was ments, with the travel modes distributed as follows: car − previously successfully used in similar work [8]. The imple- 57%, bus−32%, train−11%, motorbike−0.1%. Amplitudes mentation was SVM classifier (SVC) from QMiner package. of the raw accelerometer data are shown in Figure 1. QMiner is an open source analytics platform for performing large scale data analysis written in C++ and exposed via a 3.1 Preprocessing Javascript API [4]. In order to reduce the computation time and to have faster response time for real-time classification, we split the recorded First we focused on the binary classification of car and bus accelerometer signal into smaller pieces that do not exceed transportation versus the rest. We trained binary classi- 5s timewise. This enables us to work on chunks of record fiers for each of the labels in one against the rest manner. that span only through 5s or less. This additionally pre- That means that examples labled with this particular label serves battery life, saves space, and reduces usage of mobile represented positive examples, whereas all other examples, data. regardless of class represented negative examples. However, we also did one against one classification for each pair of Acceleration measurements are correlated with the orien- labels. That means that examples of one class were marked tation of the phone in 3D space, as gravity is measured to- as positive, examples of another class were marked as nega- gether with the dynamic acceleration caused by phone move- tive, and the rest of the learning set was filtered out. Later, ments. Thus we have to be able to separate the constant we extended this to support multi-class classification. For acceleration caused by the gravity and the dynamic part of multi-class classification we used binary models and com- the acceleration. bined their predictions based on the distance between the 6 Table 1: Table of all extracted features. Acceleration data Features X axis Mean (Total, Acceleration, Deceleration) Y axis Standard deviation (Total, Acceleration, Deceleration) Z axis Skewness (Total, Acceleration, Deceleration) Amplitude 5th percentile (Total, Acceleration, Deceleration) 95th percentile (Total, Acceleration, Deceleration) Total number of features 60 Table 2: Classification accuracy, precision, recall, Table 3: Confusion matrix for classification for 3 and F1 score for binary classifiers of different trans- classes with 10-fold cross-validation. portation modes as results of 10-fold cross valida- True\ Pred. Car Bus Train UC tion. Car 0.818 0.012 0.008 0.162 Accuracy Precision Recall F1 score Bus 0.198 0.219 0.072 0.511 Car 0.855 0.852 0.910 0.880 Train 0.118 0.042 0.344 0.496 Bus 0.720 0.620 0.694 0.655 Train 0.876 0.726 0.671 0.697 Table 4: Classification accuracy, precision, recall, and F1 score for classification with 3 classifiers with separating hyper plane and the test sample. 10-fold cross-validation. Accuracy Precision Recall F1 score 4. EVALUATION Car 0.823 0.826 0.818 0.827 Evaluation section is divided into two parts. In the first one Bus 0.725 0.844 0.219 0.347 are presented the results of experiments with one against the Train 0.859 0.683 0.181 0.286 rest classification, whereas in the second part we discuss the Average 0.803 0.784 0.401 0.535 results of one against one classification. In both parts we considered two different scenarios. In the first one we tested if our approach can recognize a specific transportation mode plan behind UC is, that we ask the application providing from all the others (one-vs-all) or if SVC can distinguish accelerometer data for new sample, which can help us re- between two specific modes of transportation (one-vs-one). classify into the proper class. The results in Table 3 and In the second, we used previously obtained binary models Table 4 are not surprising as the majority of cars is classified to classify to three(3) different classes. Our main focus was as cars and also most of misclassified cars are instances that recognizing traveling by cars and buses as these represent belong to either none or more than one class. In contrast majority of the passenger traffic in Ljubljana and therefore to cars, proportions of correctly classified buses and trains the majority of our training data. are smaller than the proportions of UC for these two classes, which shows that our approach to combining predictions for We measured performance of the models with classification multiple classifiers might not be the best. accuracy, precision and recall. Furthermore, we estimated a harmonic mean of precision and recall with the F1 score. 4.2 One against one We used 10-fold cross validation to tune the parameters of We did similarly for one-vs-one binary classification. Results each binary model. of this are shown in Table 5, which shows that cars are very well distinguishable from trains and vice versa. Buses 4.1 One against all are less distinguishable from cars and trains, however the Binary classification (one-vs-all) was done for each of the accuracy and F1 score of all the classifications are still above three classes (car, bus and train). Classification accuracy, 0.8. precision and recall for the most suitable values of parame- ters are listed in Table 2. We used these six binary classifiers for multi-class classifica- tion. Confusion matrix and accuracy, precision, recall and We got the best results (accuracy, precision and recall) for F1 score are listed in Tables 6 and 7. Tables show that clas- car travel detection. There was some drop in the perfor- sification accuracy, precision, recall and F1 score are higher mance of bus travel detection, and even bigger drop for the than in case of one against all classification. Confusion ma- train detection. We assume that the main cause for perfor- mance drop is smaller training data set for bus and train. We plan to resolve this with additional data-set collection Table 5: Accuracy / F1 score for one-vs-one bi- as part of the future work. nary classification. Rows represent positive exam- ples, whereas columns are negative examples. For the multi-class classification we used binary models from Car Bus Train Table 2. We mapped results into 4 classes (car, bus, train and UC - unable to classify). If according to the binary Car 0.848/0.889 0.943/0.962 classification, an instance belongs to none of the classes or Bus 0.856/0.832 0.858/0.896 more than one, we label it as UC (unable to classify). The Train 0.934/0.883 0.815/0.760 7 [4] B. Fortuna, J. Rupnik, J. Brank, C. Fortuna, Table 6: Confusion matrix for classification for 3 V. Jovanoski, and M. Karlovcec. Qminer: Data classes with 10-fold cross-validation. analytics platform for processing streams of structured True\ Pred. Car Bus Train and unstructured data. In Software Engineering for Car 0.893 0.088 0.019 Machine Learning Workshop, Neural Information Bus 0.282 0.573 0.145 Processing Systems, 2014. Train 0.167 0.162 0.671 [5] S. Hemminki, P. Nurmi, and S. Tarkoma. Accelerometer-based transportation mode detection on smartphones. In Proceedings of the 11th ACM Table 7: Classification accuracy, precision, recall, Conference on Embedded Networked Sensor Systems, and F1 score with 10-fold cross-validation. Accuracy Precision Recall F1 score page 13. ACM, 2013. Car 0.825 0.785 0.893 0.835 [6] D. Mizell. Using gravity to estimate accelerometer Bus 0.787 0.724 0.573 0.640 orientation. In Proc. 7th IEEE Int. Symposium on Train 0.886 0.672 0.671 0.671 Wearable Computers (ISWC 2003), page 252. Average 0.832 0.727 0.712 0.716 Citeseer, 2003. [7] S. Reddy, M. Mun, J. Burke, D. Estrin, M. Hansen, and M. Srivastava. Using mobile phones to determine transportation modes. ACM Transactions on Sensor trix shows that nearly 90% of cars is classified correctly, Networks (TOSN), 6(2):13, 2010. whereas for buses and trains the percentage of correctly clas- sified instances drops to 57% and 67% respectively. It also [8] M. A. Shafique and E. Hato. Use of acceleration data shows that trains are equally likely to be miss-classified as for transportation mode prediction. Transportation, cars and the buses. In contrast to trains, buses are more 42(1):163–188, 2015. often miss-classified as cars than trains. In comparison with [9] L. Stenneth, O. Wolfson, P. S. Yu, and B. Xu. one-vs-all multi-class classification, values of recall and F1 Transportation mode detection using mobile phones score are much higher, whereas accuracy and precision for and gis information. In Proceedings of the 19th ACM these two approaches are comparable. SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 54–63. ACM, 2011. 5. CONCLUSION [10] P. Widhalm, P. Nitsche, and N. Brändie. Transport In the presented work we showed that it is possible to detect mode detection with realistic smartphone sensor data. transportation mode using support vector machine, with In Pattern Recognition (ICPR), 2012 21st short readings of accelerometer signal. This proves that International Conference on, pages 573–576. IEEE, near real-time activity detection with fine-grained motorized 2012. transportation modality is possible. Nonetheless, there are still some issues for the future work. First one is regarding unbalanced data set as we will have to collect more data, especially for train and motorbike de- tection. Secondly, our task will be improving binary clas- sification regarding buses as this class was most often mis- classified. However, this might be also caused by our policy for combining multiple binary classification results, which is also something we will have to work on. 6. ACKNOWLEDGMENTS This work was supported by the Slovenian Research Agency and the ICT program of the EC under project OPTIMUM (H2020-MG-636160). 7. REFERENCES [1] Sensors overview. https://developer.android.com/ guide/topics/sensors/sensors_overview.html, 2016. [Online; accessed 25-August-2016]. [2] ActivityRecognition. https://developers.google. com/android/reference/com/google/android/gms/ location/ActivityRecognition, 2016. [Online; accessed 25-August-2016]. [3] CMMotionActivity. https://developer.apple.com/ library/ios/documentation/CoreMotion/ Reference/CMMotionActivity_class/index.html#// apple_ref/occ/cl/CMMotionActivity, 2016. [Online; accessed 25-August-2016]. 8 Application of Advanced Analytical Techniques in Support of Socio-economic Impact Assessment of Innovation Funding Programmes Katja Berčič Gabriella Cattaneo Mario Karlovčec Jožef Stefan Institute IDC Italia Jožef Stefan Institute Jamova cesta 39 Viale Monza 14 Jamova cesta 39 1000 Ljubljana 20127 Milano 1000 Ljubljana Slovenia Italy Slovenia katja.bercic@ijs.si gcattaneo@idc.com mario.karlovcec@ijs.si Flavio Fuart Gaber Cerle Jožef Stefan Institute Jožef Stefan Institute Jamova cesta 39 Jamova cesta 39 1000 Ljubljana 1000 Ljubljana Slovenia Slovenia flavio.fuart@ijs.si gaber.cerle@ijs.si ABSTRACT Keywords This paper gives an introduction to the European Programme impact assessment, network visualisation, statistical meth- for Internet Innovation and the needs for advanced analyt- ods ical techniques in assessment of the socio-economic impact of funded projects. A technical overview and architecture of 1. INTRODUCTION developed IT tools follows. Broadly, two set of tools were The Future Internet Public-Private Partnership (FI-PPP) explored. (1) An on-line assessment environment that pro- is the European programme for Internet innovation. Phase vides programme managers and companies with feedback Three of the FI-PPP funding is targeting more that 1000 about their potential socio-economic impact. For this set of entrepreneurs, start-ups or SMEs in an attempt to multi- tools the emphasis is on visualisation techniques, therefore ply the uptake and impact of the technologies developed in a detailed rationale and scientific background for those is previous phases1. Under that framework the Future Inter- provided with usage examples. (2) Statistical modules for net Impact Assurance (FI-MPACT)2 Support Action was identification of funding approaches that, potentially, pro- funded in order to collect and assess the qualitative and vide best results (”best practices”) have been developed and quantitative evidence of the potential socio-economic impact applied to available data. Through one use-case we have of the programme by measuring and projecting market sec- shown that the proposed IT system is useful in measur- tor economic potential, stakeholder take-up and technolog- ing the impact of innovation funding programmes, detecting ical impact of Phase III SME Accelerator projects to 2020. good management practices and giving support to funded Accelerator projects aim at investing in the strongest start- projects. The whole system has been published on a public ups (”subgrantees”) across Europe and create an ecosystem repository with an open source license, giving opportunity where entrepreneurs and business incubators meet. to re-use, customize and integrate the reporting system into other impact assessment environments. Based on Key Performance Indicators (KPI) elaborated in the Impact Assessment Guidebook [3], a set of Impact As- Categories and Subject Descriptors sessment tools were developed to collect empirical data H.4.0 [Information Systems Applications]: General— from subgrantees and other interested initiatives. In this reporting, statistical analysis, visualisation, networks context, developed IT tools provide an automated assess- ment system, driven by a set of KPIs, allowing Accelera- General Terms tor projects, other start-ups and entrepreneurs to measure Algorithms the potential impact compared to industry standards and the global community of FI-PPP projects. By benchmark- ing their progress in relation to different business processes, they can identify areas where improvements are needed and measure progress. Furthermore, a set of scripts for statistical data process- ing were combined in Accelerator Benchmarking tools 1https://www.fi-ppp.eu/ 2http://cordis.europa.eu/project/rcn/191426 en.html 9 to support the identification of accelerators best practices. tool is implemented as a interactive scatter plot that posi- This quantitative analysis was used to support, guide and tions projects in two-dimensional space defined with per- strengthen the expert judgement of qualitative indicators of formance indicators. This enables comparison of projects best performing accelerators. with different performance levels against different pairs of indicators. Performance indicators supported by the tool In this paper we present the developed IT tools [2], their are: innovation, market focus, feasibility, market needs and usage in practice, some results and possible use of these mattermark growth. Figure 1 gives an example of perfor- tools in future. mance graph where the selected project is compared with other high performance projects, based on market focus and innovation variables. 2. IMPACT ASSESSMENT TOOL The Impact Assessment tool3 was focused around funded subgrantees to facilitate mapping of this portfolio, to con- 2.2 Multi-attribute Based Similarity Tool tribute to the overall impact assessment of the FI-PPP Phase In order to provide both the sub-grantees and accelerators 3 and assist in forecasting the potential impact of this inter- better insight in project’s performance, a tool that shows vention up to 2020. The Self-Assessment tool is open to all the selected project’s position in relation to other projects interested parties and respondents can undertake the survey was created. In this tool, similarities among projects are at different stages to measure their progress. determined using a wide range of attributes that describe projects. This gives users a chance to form their own judge- The Impact/Self-Assessment tool provides a start-up sanity ment about “good” or “bad” similarities. The tool is imple- check, by calculating KPIs for Innovation, Market Focus, mented using several well-known data and network analysis Feasibility and Market Needs. Instant feedback is provided methods, and visualized as a network graph. The visual- by benchmarking respondent’s scores with average scores of ization gives sub grantees and their mentors/reviewers in- his/her peers, or a group of most successful peers. sight into how they compete with other projects and ideas, identify possible similarities, find opportunity windows or search for possible partners. The tool is implemented as an 2.1 Key Performance Indicators Benchmark- open API using QMiner [5] data analytics platform for pro- ing cessing large-scale real-time streams containing structured and unstructured data. The approach used for implement- ing multi-attribute based similarity tool consists of following steps: (1) Importing data and feature extraction; (2) Com- puting the main similarity graph; (3) Generating custom graph for selected project; and (4) Visualization. In the first step, project data is imported into the system and features are extracted from the data. Prior to importing, data is transformed into JSON file format. The file contains a configuration part that determines which attributes of the data will be used as features in further analysis. This makes the system flexible for feature-set changes. The feature-set consists of numerical, categorical and textual data obtained from questionnaires. Feature extractor for textual data ap- plies English stop-words removal in the tokenization phase and normalizes the word frequencies using TF-IDF weight- ing. The main similarity graph shows project similarities based on multiple attributes in form of a two dimensional net- work graph. This graph is the basis for constructing other custom graphs where a particular selected project is in the Figure 1: Performance graph based on market fo- focus. The graph is constructed in two steps. First, the cus and innovation variables for a selected project multidimensional vector representation of projects obtained represented with a large violet circle. Different in the feature extraction phase is transformed into two di- shapes correspond to performance levels: rhom- mensional representation by using Multidimensional scaling bus for Hight Performance Initiatives, squares and (MDS) [7] method. A similar approach of dimensionality crosses for a small selection promising initiatives reduction was used in [4] for visualization of a text docu- that received further funding for dissemination ac- ment corpus. In the second step of graph construction, the tivities. two dimensional projections of projects are connected using Delaunay triangulation [9]. This dynamic benchmarking against other respondents gives entrepreneurs and their mentors a tool to monitor progress A custom graph is such a graph where one project is in and identify areas where additional support is required. The the main focus. This can be a project selected from the list of already imported projects. It can also be generated 3https://github.com/JozefStefanInstitute/fi-impact for a new project that is in the preparation phase, to see 10 egorical indicators, CDF diagrams for practice scores and score histograms). All the diagrams are produced in two versions - with all sub-grantees and per-accelerator. Results are grouped into three main categories, as described below. The diagram in Figure 3 shows the structure of the tool. The degree of connectivity tables are a result of analysis of FI-PPP project partnerships. They contain counts of dif- ferent types of connections. Every pair of projects can have more than one connection, where each such connection be- tween two projects in the three FI-PPP phases represents a partner that appeared in both. Correlation analysis is performed on all numerical acceler- ator properties and all numerical subgrantee performance indicators by correlating each possible pair. The results in- clude a heatmap of correlations, a table of correlations and a corresponding table with sizes of samples on which the correlations were computed. Figure 2: Multi-attribute based similarity graph for an arbitrary project. The selected project is repre- Best practices identification produces notched box plots for sented with a black dot in centre of the figure. categorical accelerator properties, CDF plots for binary ac- celerator properties, correlation outputs for numerical ac- celerator properties (heatmap and tables) and histograms where is it positioned in relation to other projects. A cus- for performance indicators. These can then provide insight tom graph is generated by first extracting features from the into which practices coincided with better subgrantee per- selected project using the same feature extractors that were formance. used for computing the main similarity graph. In second step, distance between each of the imported project and the We define a good practice as an activity performed by one or selected project is computed (either cosine or euclidean dis- more of the Accelerators’ consortia according to their accel- tance can be used). Based on user selected threshold, an eration plans, which based on objective evidence, is shown edge is created between more similar projects, while less sim- to have contributed to the good performance of subgrantees. ilar projects are excluded from the graph. Finally, projects Here, good performance of the subgrantees is primarily their selected for the custom graph are connected based on the market success (measured in terms of positive dynamics main similarity graph. of revenue growth and customer growth); their ability to convince potential investors and collect additional funding The final step of the similarity tool is graph visualization. (“traction”); if they are not yet on the market, their market The graph is visualized using SigmaJS4 library and a plug- readiness (measured by FI-IMPACT’s KPIs scores). in that implements ForceAtlas2 [6] algorithm. The lay- outing algorithm mimics the physical attraction and repul- The full report about the usage of developed tools, com- sion forces, placing well connected and similar project closer bined with qualitative data based on face-to-face interviews on the two dimensional plane. An example of a similarity with Accelerators coordinators and interpretation of results graph for a project is given in Figure 2. is given in [1]. Some insights were drawn from the results computed by the benchmarking tool, while some of the con- 3. ACCELERATOR BENCHMARKING/BEST clusions formulated with other methods were confirmed by it [1]. For example the geographical scope of the partnerships PRACTICES REPORTING TOOLS did not seem to make a major difference, while the presence Accelerator Benchmarking/Best Practices Reporting Tools of professional accelerators had a strong positive influence are written in R [8] programming language and software on most subgrantee performance indicators. environment for statistical computing and graphics. In FI- IMPACT, several R modules5 have been developed to auto- There were several results computed in the benchmarking matically produce statistical reports about Accelerator Bench- tool that did not provide any meaningful insights, but did marking and Best Practices based on FI-IMPACT databases. confirm that the data exhibits some expected properties. For example, unsurprisingly, the benchmarking tool showed The databases are imported and merged into an R data a strong positive correlation between investment per sub- frame. The tool then produces correlation heatmaps, prac- grantee and KPIs measuring feasibility and innovation. tice scores with accompanying mean difference t-test results and FI PPP phase projects connectivity information. Most 4. CONCLUSIONS of the results are produced in final-form CSV files. The tool also outputs several sets of diagrams (boxplots for all cat- We have shown how a set of IT tools and computational methods can be used to assess the socio-economic impact 4http://sigmajs.org/ of a big investment programme, such as FI-PPP European 5https://github.com/JozefStefanInstitute/FI-Impact-R- programme for Internet innovation. On one hand, the im- analysis plementation followed a strict methodology used for bench- 11 Figure 3: Benchmarking tool inputs and outputs marking through questionnaires and calculation of respec- [accessed 12 September 2016]. tive KPIs, on the other hand, more exploratory techniques [2] FI-IMPACT Project. Fi-impact online assessment were used for statistical analysis and visualisation. environment, 2016. Deliverable D4.3 Available from http://www.fi-impact.eu/page/docdownload/479/ While it is true that many of the results of the statistical [accessed 12 September 2016]. analysis implemented in the Accelerator benchmarking tools [3] FI-IMPACT Project. Impact assessment guidebook, v2, did not show statistically significant new results, it is also 2016. Deliverable D2.1.v2 Available from true that the results displayed most of the expected correla- http://www.fi-impact.eu/page/docdownload/474/ tions. The statistical analysis explains little of the variations [accessed 12 September 2016]. in performance of the subgrantees. In our opinion, the main [4] B. Fortuna, M. Grobelnik, and D. Mladenic. weakness of the statistical correlation analysis was the very Visualization of text document corpus. Informatica limited available data on actual market performance. How- (Slovenia), 29(4):497–504, 2005. ever, weak signals, combined with with the results of the [5] B. Fortuna, J. Rupnik, J. Brank, C. Fortuna, qualitative interviews, point to the positive role of profes- V. Jovanoski, M. Karlovcec, B. Kazic, K. Kenda, sional accelerators within consortia, and positive impacts of G. Leban, D. Mladenic, A. Muhic, B. Novak, E. Novak, practices such as workshops, matchmaking and providing J. Novljan, M. Papler, L. Rei, B. Sovdat, and L. Stopar. gateways to further funding. Qminer: Data analytics platform for processing streams of structured and unstructured data. In Software Impact Assessment and Similarity Graph benchmarking tools Engineering for Machine Learning Workshop, Neural have been deployed and made available to over 700 sub- Information Processing Systems, 2014. grantees (projects), their mentors an coordinators. The [6] M. Jacomy, T. Venturini, S. Heymann, and M. Bastian. tool was deemed very useful for long-term monitoring of Forceatlas2, a continuous graph layout algorithm for sub-grantees performance. In this respect, the FI-IMPACT handy network visualization designed for the gephi project consortium is drawing plans to assure long-term sus- software. PLoS ONE, 9(6):1–12, 06 2014. tainability of the deployed IT systems. [7] J. B. Kruskal. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. 5. ACKNOWLEDGEMENTS Psychometrika, 29(1):1–27, 1964. This work was supported by the the ICT Programme of the [8] R Core Team. R: A Language and Environment for EC under FI-IMPACT (FP7-ICT-FI-632840). Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2013. 6. REFERENCES [9] P. Su and R. L. S. Drysdale. A comparison of [1] FI-IMPACT Project. Analysis of accelerators’ good sequential delaunay triangulation algorithms. In practices annex 8.5 to deliverable d2.4 update of Proceedings of the Eleventh Annual Symposium on impact assesment and forecast, 2016. Annex 8.5 of D2.4 Computational Geometry, SCG ’95, pages 61–70, New Available from York, NY, USA, 1995. ACM. http://www.fi-impact.eu/page/docdownload/477/ 12 Information flow between news articles: Slovene media case study Jan Chołoniewski Gregor Leban Sebastijan Maček, Center of Excellence for Artificial Intelligence Aljoša Rehar Complex Systems Research, Laboratory, Slovenska Tiskovna Agencija, Faculty of Physics, Jožef Stefan Institute, Tivolska 48, 1000 Ljubljana, Warsaw University of Jamova 39, 1000 Ljubljana, Slovenia Technology, Slovenia {sm,ar}@sta.si Koszykowa 75, PL-00662, gregor.leban@ijs.si Warsaw, Poland choloniewski@if.pw.edu.pl ABSTRACT fully applied to determine a relation between two articles. We present results of a study on usage of text similarity mea- Possible relations that we want to determine are (a) there sures based on co-occurrence of words and phrases to classify is no relation, (b) they share a common source, or (c) one is a relation between a pair of news articles (i.e. no relation, based on the other one. To find the most efficient way to do both based on a common source, one based on the other). that, we calculated cosine similarity of ”bag of n-grams” rep- For each Slovenian article written in Slovene and published resentations of articles from Slovene media published on one online on 27th June 2016, we found the most similar release day with releases from Slovenska Tiskovna Agencija (STA; from the Slovenian Press Agency (STA) database to obtain Slovenian Press Agency) to preselect the most similar re- a list of candidate article-source pairs. Four experts from lease to each article, asked experts to annotate the candi- STA were asked to score the pairs, and their annotations date pairs, and compared results for different thresholds and were used to train classifiers and evaluate their accuracy. n-grams with the annotations. The rest of the paper is structured as follows: in Section 2 1. INTRODUCTION we highlight a process of obtaining experimental data (can- Propagating, exchanging, organizing and processing infor- didate source release matching, expert annotation study), mation are important parts of human social interactions on in Section 3 we describe applied methods and benchmark both micro- [1] and macro-level [2]. After years of local and parameters, then in Section 4 we present results of classi- nationwide scale, newspapers, press agencies and news out- fication study in two simplified cases; Section 5 contains a lets started to operate at global level using Internet. An easy discussion and possible improvements, and Section 6 sums and open access to their releases is desirable for news con- up the research. sumers and can be tracked e.g. with website traffic statistics or in social media. Article reuse by other publishers (autho- 2. DATA rized or not) is however not that straight-forward. Data for the study consist of randomly selected 469 articles out of 895 published on 27th June 2016 from 62 Slovene Combining natural language processing methods with data online news outlets as tracked by the EventRegistry [9]. gathered in the Internet allows to quantify and measure so- For each article, we have found the most similar one in the cial information processing phenomena [3, 4, 5, 6]. The ad- STA releases database in terms of cosine similarity of two- vancements in NLP might serve all types of text-based media , three- and four-word phrases ({2,3,4}-grams) occurrence (in particular online news outlets) to provide tools to track vectors with TF-IDF weighting (see section 3 for details). A spreading of their texts. histogram of obtained similarities is presented in Figure 1. About 75% (354 out of 469) of candidate pairs obtained a A tool that automatically finds articles based on a given ar- cosine similarity below 0.1, and 10% (47 out of 469) over 0.9. ticle might be useful for news outlets and press agencies to track usage of their releases and to find cases of plagiarism The pairs were scored by four experts from STA (A1, A2, or unauthorized use. Moreover, it might be applied to large A3, A4). They were asked to mark each pair with one of scale news spreading studies [3]. A software-assisted plagia- the following scores: rism detection is a well-known problem in an information retrieval field [7], and using text similarity-based methods is one of the most popular approaches [8]. To the best of our • NF – the proper source release has not been found knowledge, the following paper is the first published study despite it is present in the STA database, of plagiarism detection in Slovene media supported by pro- • NC – the proper source release has not been found and fessional press agency workers. it is not present in the STA database, The aim of the presented work is to check if text comparison • DS – the proper source release has been found (al- methods based on co-occurrence of phrases can be success- though it might be one of many sources of the article), 13 (2,4)-gram cosine similarity histogram 103 A1 A2 A3 A4 A1 100% 87% 86% 87% A2 87% 100% 89% 89% A3 86% 89% 100% 83% 102 A4 87% 89% 83% 100% (a) Raw annotations count 101 A1 A2 A3 A4 A1 100% 88% 86% 88% A2 88% 100% 91% 89% A3 86% 91% 100% 84% 100 A4 88% 89% 84% 100% 0.0 0.2 0.4 0.6 0.8 1.0 cosine similarity (b) Simplified A – DS and IDS merged Figure 1: A histogram of {2,3,4}-gram cosine similarities of A1 A2 A3 A4 candidate pairs with a logarithmic Y-axis. A1 100% 96% 99% 95% A2 96% 100% 99% 96% A3 99% 99% 100% 95% • IDS – the article and the proposed source release are A4 95% 96% 95% 100% both based on the same third party source. (c) Simplified B – IDS and NC merged Table 2: Agreement among annotators. In cases where the source was not found (NF), the annota- tors provided a link to the proper source release. In Table 1, we present basic statistics of the annotations 3. METHODS given by experts. We considered two methods of simplifying Articles and releases were mapped to ”bag of n-grams” rep- the annotations. The first one (A), merges DS and IDS resentations. Additionally, n-gram counts were transformed marks to discriminate between two classes – a given pair using term frequency-inverted document frequency (TF-IDF) contains pieces of the same information or is unrelated. The weighting trained on a corpus of 5,000 randomly selected second one (B), merges IDS with NC – the algorithm’s task Slovene articles stored in the EventRegistry published dur- is to check if one text is directly based on the other one. ing two weeks preceding the analyzed day. Terms which occurred in more than 25% of documents were discarded. person total NF NC DS IDS Laplace smoothing with α = 1 was applied to include terms which were not present in the corpus. A1 469 3 315 98 53 A2 469 2 358 97 12 For n = 1, ..., 5, weighed term vectors of Slovene articles A3 95 0 61 23 11 from 27th June 2016 were compared with all vectors of STA A4 95 0 70 20 5 releases published between 20th and 27th June 2016 to find candidate source releases. For each n, we tested classifiers Table 1: Basic statistics of raw candidate release-article with a threshold from 0.00 to 1.00 with steps of 0.01 to find pairs annotations by the STA experts. total – a number the threshold for which the method achieves the highest F1- of annotated pairs; NF – source not detected despite the score for A and B simplifications separately. The releases source release is in the STA archive; NC – no source release were compared with the list created using preselected pairs in the STA archive; DS – one article is a direct source of and experts’ comments. A source release for a given article the other; IDS – both documents based on the same third and a given threshold was considered as correctly found, source article. when it matched the one annotated by human and cosine similarity score was above the threshold. A given article was considered correctly classified if a source release was In Table 2, percentages of agreement among annotators are correctly found or if the article was correctly marked as not being presented for (a) raw annotations, (b) simplification having a source release in the STA database. A and (c) simplification B (see above). Parameters used to score the classification were accuracy, re- The annotators were sometimes non-unanimous when both call, precision, and F1-score. We used following definitions: articles in a pair had a common source (compare Table 2a and 2b, mean agreement = 87%). They were more consis- T P + T N tent when a release was a source of a given article (compare accuracy = Table 2a and 2c, mean agreement = 96%). T P + T N + F P + F N Additionally, because of score inconsistencies, the final list T P has been prepared after discussing problematic cases. recall = TP + FN 14 T P 1-gram cosine similarity histogram precision = 250 T P + F N recall × precision 200 F 1 = 2 recall + precision 150 where TP – number of articles with a correctly found source, TN – number of articles correctly marked as not having count 100 source in the STA database, FP – number of articles in- correctly marked as having source in the database, FN – 50 number of articles incorrectly marked as not having source in the database. Cases when articles had incorrectly found 0.0 0 0.2 0.4 0.6 0.8 1.0 source were counted separately as errors. cosine similarity Each annotator could have scored differently each article- fractions of scores for a given similarity (n = 1) source pair thus the mean values and standard deviations of 1.0 parameters were calculated when considering lists of anno- NC NF tations separately. 0.8 IDS DS 4. RESULTS 0.6 For each n value, we have found a threshold which maxi- mized mean F1-score over all annotators. Results are shown 0.4 fraction in Table 3a for the simplification A and in Table 3b for the simplification B. 0.2 n threshold acc σ 0.0 acc F1 σF1 errors 0.0 0.2 0.4 0.6 0.8 1.0 cosine similarity 1 0.29 0.90 0.02 0.83 0.04 18 2 0.09 0.91 0.02 0.84 0.03 4 {2,3,4} 0.06 0.91 0.01 0.84 0.02 3 Figure 2: (top) A histogram of n-gram cosine similarities 3 0.05 0.91 0.01 0.84 0.02 4 and (bottom) a fraction of each score in each similarity bin 4 0.04 0.90 0.02 0.83 0.02 3 (see Section 2 for abbreviation expansions) for n = 1. 5 0.03 0.90 0.02 0.83 0.02 6 (a) Simplified A – direct and indirect relations merged final list in each cosine similarity bin for n = 1 and n = 3 (respectively). The cosine similarities are not dramatically n threshold acc σ more separated in any of the cases but using n = 1 leads to acc F1 σF1 errors significantly higher number of errors, and using n = 5 – to 1 0.56 0.95 0.01 0.88 0.03 4 a slight increase of number of errors. 2 0.46 0.96 0.01 0.90 0.04 1 {2,3,4} 0.27 0.96 0.01 0.90 0.03 1 5. DISCUSSION 3 0.25 0.96 0.01 0.90 0.03 1 For most values of n, over 85% of candidate pairs had ex- 4 0.22 0.96 0.01 0.90 0.03 1 treme cosine similarity values (below 0.1 or over 0.9). Two 5 0.13 0.95 0.01 0.89 0.02 2 articles with cosine similarity equal to 1 are duplicates while (b) Simplified B – indirect relations and lacks of relation merged the articles with cosine similarity equal to 0 are completely unrelated. Similarities between those values are not that Table 3: Thresholds resulting with the best F1 for differ- clear to interpret. Obtaining more pairs with intermedi- ent ns. acc – mean accuracy, σacc – standard deviation of ate values would make results for boundary cases more reli- accuracy, F1 – mean F1-score, σF1 – standard deviation of able. After closer examination, very similar pairs which were F1-score, errors – mean number of incorrectly found sources. marked as unrelated turned out to be annotators’ mistakes. On the other hand, in the opposite cases (pairs with low co- The results for the simplification A are satisfying when com- sine similarity but marked as related) the analyzed articles pared to the agreement among annotators. There were no were rewritten; using lemmatization might be sufficient to significant difference between specific n > 1 but for n = 1 identify them as similar. there were as many as 18 errors. The results for the simplifi- cation B are comparable with an agreement among annota- Using different ns did not cause significant changes of ac- tors (see Table 2) and the classifiers could not find a correct curacies and F1-scores of classifiers in both simplified cases source only in one case in which article was mainly based on but n > 1 allows to correctly find more sources than n = 1. some other release and only partially on the detected one. In most n = 1 errors, the algorithm pointed at some more Again, there is very little difference among different n-grams general release about a given topic. which might suggest that in most cases articles use similar phrasing as the source release and the method is efficient. We considered three types of relations between text pairs – lack of relation, common source, and direct sourcing (one In Figures 2 and 3, we show a histogram of cosine similarities based on the other). For the first and the last types of and a stacked bar plot showing fraction of each score in the relation, it was usually possible to distinguish between them 15 3-gram cosine similarity histogram 400 them, and can be detected with about 96% accuracy. A discrimination between not related and related pairs was 350 possible with a 90% accuracy. 300 250 The results might be useful for a broader use although a par- 200 tial supervision in boundary cases would be required. We count 150 suspect that lemmatization, proper quotations filtering and discarding proper nouns might result in achieving higher ac- 100 curacies. Using cross-lingual similarity measures would be 50 another interesting modification. 0.0 0 0.2 0.4 0.6 0.8 1.0 cosine similarity 7. ACKNOWLEDGMENTS fractions of scores We thank Aleš Pečnik for a technical support, and Tjaša for a given similarity (n = 3) Doljak, Naum Dretnik and Sabina Zonta from STA for an- 1.0 NC notations. This research has received funding as RENOIR NF Project from the European Union’s Horizon 2020 research 0.8 IDS and innovation programme under the Marie Sk lodowska- DS Curie grant agreement No. 691152. JCh has been also sup- 0.6 ported by Ministry of Science and Higher Education (Poland), grant No. 34/H2020/2016. 0.4 fraction 0.2 8. REFERENCES [1] M.N. Bechtoldt, C.K.W. De Dreu, B.A. Nijstad, and H.-S. Choi. Motivated information processing, social 0.0 0.0 0.2 0.4 0.6 0.8 1.0 cosine similarity tuning, and group creativity. J. Pers. Soc. Psychol., 99(4):622–637, 2010. [2] O. Oh, M. Agrawal, and H.R. Rao. Community Figure 3: (top) A histogram of cosine similarities and (bot- intelligence and social media services: A rumor tom) a fraction of each score in each similarity bin (see Sec- theoretic analysis of tweets during social crises. MIS tion 2 for abbreviation expansions) for n = 3. Quart., 37(2):407–426, 2013. [3] K. Lerman. Social information processing in news but in the proposed way it was not possible to accurately aggregation. IEEE Internet Comput., 11(6):16–28, identify when two articles had a common source. 2007. [4] V. Niculae, C. Suen, J. Zhang, The method we used has not created a completely clear sep- C. Danescu-Niculescu-Mizil, and J. Leskovec. aration between considered relation types. In the further QUOTUS: The structure of political media coverage work the approach could be improved with lemmatization, as revealed by quoting patterns. In Proceedings of the mapping to WordNet synsets, discarding proper nouns, or a 24th International Conference on World Wide Web, proper treatment of quotations. WWW ’15 Companion, pages 798–808, 2015. [5] A. Chmiel, J. Sienkiewicz, M. Thelwall, G. Paltoglou, It is also important to take into account that the experts K. Buckley, A. Kappas, and J.A. Ho lyst. Collective were able to discriminate pairs because of their domain- emotions online and their influence on community life. specific knowledge. Nevertheless, even highly trained in- PLoS ONE, 6(7), 2011. dividuals scored some pairs differently. In many cases, there [6] J. Cho loniewski, J. Sienkiewicz, J. Ho lyst, and can be more then one source release of an article or an article M. Thelwall. The role of emotional variables in the might be based only partially on a given release. classification and prediction of collective social dynamics. Acta. Phys. Pol. A, 127(3):A21–A28, 2015. An important future work will include use of cross-lingual [7] A Parker and JO Hamblen. Computer algorithms for techniques (e.g. [10]) to compute similarities and detect pla- plagiarism detection. IEEE T. Educ., 32(2):94–99, giarism in news articles in different languages. 1989. [8] D. Metzler, Y. Bernstein, W.B. Croft, A. Moffat, and 6. CONCLUSIONS J. Zobel. Similarity measures for tracking information We have presented a case study of estimating usage of STA flow. In Proceedings of International Conference on releases by Slovene news outlets. We applied ”bag on n- Information and Knowledge Management, pages grams” representations of articles and releases with TF-IDF 517–524, 2005. weighting, and compared them pairwise using cosine simi- [9] Gregor Leban, Blaz Fortuna, Janez Brank, and Marko larity. Detected candidate ”article-source release” pairs were Grobelnik. Event registry: Learning about world annotated by experts. events from news. In Proceedings of the 23rd International Conference on World Wide Web, WWW We compared results of automatic source detection with the ’14 Companion, pages 107–110, 2014. annotations, and as expected found that articles have higher [10] http://xling.ijs.si. cosine similarity to releases when they are directly based on 16 DATA ANALYTICS IN AQUACULTURE Joao Pita Costa and Matjaž Rihtar Artificial Intelligence Laboratory Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia joao.pitacosta@ijs.si, matjaz.rihtar@ijs.si ABSTRACT intuitive and inferential. The major factors to consider in the design and purchase of process control and artificial intelligence The specific challenges in aquaculture today reveal needs and software are functionality/intuitiveness, compatibility, flexibility, problems that must be addressed appropriately and in sync with upgrade path, hardware requirements and cost. Of these, the most recent optimization methods. It is now the time to bring intuitiveness and compatibility are the most important. The the techniques of aquaculture to a new level of development and software must be intuitive to the user or they will not use the understanding. In that, one must consider the state of the art system. Regarding compatibility, the manufacturer should be methods of statistics and data mining that permit a deeper insight congruent with open architecture designs so that the chosen into the aquaculture reality through the collected datasets, either software is interchangeable with other software products. from daily data or from sampling to sampling data. This must also be tuned to the expert knowledge of the fish farmers, their procedures and technology in use today. In this paper we review the state of the art of data analytics methodology in aquaculture, the data available deriving from the procedures characteristic to this business, and propose mathematical models that permit a deeper insight on the data. We also address the data unknowns and strategies developed that will contribute to the success of the business, leading to discover valuable information from the data that can be made usable, relevant and actionable. Figure 1. Dynamical plots developed for the project aquaSmart, available through a public interface where the fish farmers can upload their data and Categories and Subject Descriptors do a preliminary analysis and visualization. E.3 Data Structures; I.2 Artificial Intelligence; I.6 Simulation and Modelling The models presented in this paper were developed in the context of the EU project aquaSmart [1]. This project aims enhancing the General Terms innovation capacity within the aquaculture sector, by helping companies to transform captured data into knowledge and use this Algorithms, Data Science, Aquaculture knowledge to dramatically improve performance. In particular, the tools constructed in that context (illustrated in Figure 1) serve Keywords the aquafarmers to evaluate feed performance, considering Aquaculture, data analytics and visualization, important factors such as the water temperature and average fish weight, but also underlying factors such as the oxygen level. 1. INTRODUCTION 2. UNIQUE CHALLENGES Modern research and commercial aquaculture operations have begun to adopt new technologies, including computer control It is well known that the production in aquaculture has specific systems. Aquafarmers realize that by controlling the features and objectives associated with it. When talking about the environmental conditions and system inputs (e.g. water, oxygen, adaptation of existing technology, the features important to the temperature, feed rate and stocking density), physiological rates of production in aquaculture come from weather prediction. These cultured species and final process outputs (e.g. ammonia, pH and are the oxygen levels and water temperature, which are very growth) can be regulated [2]. These are exactly the kinds of specific to this activity. The tasks in fish farming carry several practical measurements that will allow commercial aquaculture uncertainties – often expressed by measurements or even facilities to optimize their efficiency by reducing labor and utility evaluations – that permit further optimization [9]. A classic costs. Anticipated benefits for aquaculture process control and example is the aim for a better control on the food loss and food artificial intelligence systems are: increased process efficiency; quality. A contribution of data mining in this context would be of reduced energy and water losses; reduced labor costs; reduced interest to the aquafarming industry, saving or relocating stress and disease; improved accounting; improved understanding resources. of the process. An important variable that remains undetermined during the The technologies and implementation of the technologies complete production pipeline is the exact number of fish. A necessary for the development of computer intelligent margin of up to 10% of number of fries is added to the initial management systems come in a wide variety [8] and enhanced production at time t=0 due to uncertainty of number of deaths in commercial aquaculture production [3]. Today’s artificial the transport. That means that we already have a maximum of intelligence (AI) systems offer the aquaculturist a proven 10% more fish than our estimations (assuming that no fries die methodology for implementing management systems that are both during transport or adaptation at t=0). Other than that we can only 17 have less fish than we estimated due to the lost fish because of appropriate adjustments; (4) Life To Date (LTD) cumulative data unknown reasons. This is already an open problem at the level of that is calculated from the time when the fish enters the net as a the bounds for total amount of harvested fish and the description fry to the date of data collection, and will last until the date of the of best-case scenario and worst-case scenario. This represents a harvest. big lack of knowledge about production. In fact, the unknown number of fish until the end of the production is important for the The identification data in input (1) is rather unspecific, as we amount of food given and, consequently, for the resources spent. cannot at this date in time identify the fish one by one as it is done in other animal farming such as cows and pigs. The data in this Feed composition has also a large impact on the growth of input category is distinguished between the group of production animals, particularly marine fish. Quantitative dynamic models indicating localization - Unit - and the individual production exist to predict the growth and body composition of marine fish series of fish - Batch. There is no further distinction in the for a given feed composition over a timespan of several months identification. Batch has to go with Unit. Aquafarmers may have [7]. The model takes into consideration the effects of different batches in one unit or fish from one batch in many units. environmental factors, particularly temperature, on growth, and it The daily data in input (2) is recorded by the aquafarmers on a incorporates detailed kinetics describing the main metabolic daily basis. These data columns follow the development of the processes (protein, lipid, and central metabolism) known to play fish since day one when it enters as a fry. The data inputted major roles in growth and body composition. That showed that mostly follows one batch of fish from the beginning till the end of multiscale models in biology can yield reasonable and useful the production. One input data can have several units but, for results. The model predictions are reliable over several timescales purposes of the algorithms used, we consider only the time spent and in the presence of strong temperature fluctuations, which are in one unit. For some of the algorithms used, the data is split this crucial factors for modeling marine organism growth. way (some data tables don’t have values in the column ‘harvest’) with clear input/output within one unit. 3. UNKNOWNS IN THE DATA The sampling data in input (3) serves the aquafarmer to It is curious that the underlying problems with the data unknowns improve/fix his/her initial Feed Conversion Ratio (FCR) model in aquaculture represent a problem of large dimensions for the with real data. This includes features that can be learned by a industry of aquaculture, in which the production is specific set of data. Those features will later be important for the straightforward. In fact, it is not known at any time in production, algorithms. They often correspond to columns with potential the exact number of fish in production, and therefore it is not effect on the end result. Also, they can influence the production possible to calculate with exactness the amount of food needed to (e.g. ‘feeder’). The software will adapt to data and will try to do support an appropriate growth. Furthermore, there are many the analysis and prediction from the available data. Note that the conditionings in the progress of the production that must be taken input will also include data columns unknown to the system and into account and are hard to measure with the existing and optional to the aquafarmer. We cannot predict the relevance of the available technology. In that, it is important to describe some of data on those columns (neither their nature) but will consider them the features of the data including an assessment on its quality and in the overall global analytics. measures to overcome obstacles to the analysis. The input and output variables of the dataset are classified as: numerical and categorical. Numerical variables can be: continuous measured quantities expressed as a float (e.g. ‘av. weight’); discrete quantities expressed as an integer (e.g. ‘number of fish’). Categorical variables can be: regular categorical data including non-ordered classes (e.g. species Bream/Bass); or ordinal classes that can be ordered in levels (e.g. estimations poor/fair/good). From the variables that can be measured it is important to distinguish between: (i) variables that do not change over time, often identifying population attributes (e.g. identifications such as ‘year’ or ‘hatchery’); (ii) variables that can change over time but do not change within a sampling period (e.g. ‘batch’); (iii) variables that change daily, taken into account when samplings occur (e.g. ‘average weight’). Table 1. Classification of values according to time dependence. Figure 2. The proposed data cleaning process for aquaculture data, change in time? direct calculated derived including the update of the metadata in the system and user interaction. yes water temp. FCR, SFR av. weight The daily data, the sampling data and the LTD data in inputs 2, 3 no identification av. weight at t=0 hatchery and 4 fall into three categories: (i) Direct values, that correspond to the direct observation of the aquafarmers on either variables Essentially we have four types of input data according to the values including small errors measured in the field (e.g. sampling impact they assure: (1) identification data that permits the fish measures such as average weight) or precise values provided by farmer to manage the production and correctly identify the fish; external sources (e.g. water temperature or oxygen level); (ii) (2) Daily data that is provided by the fish farmers resulting from Calculated values, that are dependent of a number of other their everyday data input (e.g. ‘date’, ‘av. wt.’, ‘actual feed’, etc.); observed values (e.g. LTD values calculated from the daily data); (3) Sampling data, collected at predetermined points of the fish (iii) Derived values – values deriving from previously available growth timeline, to confirm the model values and make the 18 data (e.g. FCR calculated from the table, given average weight numbers accordingly. Also specifying the influence of sexual and water temperature). maturity and the lack of oxygen, which are done by hand/intuition, have features to take in consideration by the math The original data provided by the aquafarmers has variances/holes model. The FCR models in this paper consider only temperature and is not precise because it is not measured automatically but and average weight. instead entered by human hand (with some exceptions such as ‘temperature’). Sometimes it is not entered for 1 or 2 days due to Each aquaculture entity draws an appropriate FCR table to that the bad weather, which complicates the access to the batch of fish. Higher temperature leads to lower energy spent and measurements and to the units themselves (sometimes this adds faster growth, and consequently to a lower FCR. As the fish gets up to 4 days without entries). Sometimes this is due to intentional bigger, he needs more food to increase his biomass in percentage, fasting to readjust features and in that case the data measurements and thus the FCR grows higher with the increase of the average stay the same as the ones in the previous fields, just before fasting weight. The quality of the food and the size of the pellet size are takes place. The major discrepancies should be pushed to the user not considered at this point. At high temperatures (above 30 as a compromise. If the data is missing up to a certain threshold, degrees in the case of bream and bass) low oxygen leads to low the data will be sent back to the user in order to be inputted once conversion to biomass. This is one of the hidden variables in the again after appropriate corrections. The options for the missing model, which should be considered separately at a later stage. One data problem are to consider it as an error and report it to the user of the possibilities would be to penalize the FCR tables for the requesting the missing data, or consider the average from the lack of oxygen. The other variable is the high reproduction of the missing data in the sense of interpolation on a fixed mesh grid. fish in low temperatures and high average weight, which highly affects the growth of the fish. Recall that the Economic FCR is the 4. DATA ANALYTICS IN AQUACULTURE real FCR index following from the quotient between food given to the fish and the fish biomass. When the temperature is too high or Mathematical modeling aims to describe the different aspects of too low we should ignore the data that is filled in with zeros and the real world, their interaction, and their dynamics through considered empirical data. mathematics. It constitutes the third pillar of science and In the following we present the plots of the models for the three engineering, achieving the fulfillment of the two more traditional fish farms in AquaSmart. It includes 3 fish farms. disciplines, which are theoretical analysis and experimentation [4]. Nowadays, mathematical modeling has a key role also in aquaculture. In the following section we will present an overview of that. Growth and reproductive modeling of wild and captive species is essential to understand how much of food resources an organism must consume, and how changes to the resources in an ecosystem alter the population sizes [6]. The FCR is an important performance indicator to estimate the growth of the fish. It is widely used by the aquaculture fish farmers in pair with the Specific Feeding Ratio (SFR). Its Figure 3. Company A: Real data (on the left) and FCR model (on the importance follows from the fact that 70% of the production costs right) for the bream production. in aquaculture are assigned to the food given to the fish during growth. Some of it will fall through the net and some will be spared. The optimization of the feeding of the fish can carry great benefits to the economic development of the fish farms. Specifically, the FCR permits the aquafarmer to determine how efficiently a fish is converting feed into new tissue, defined as growth [10]. Recall that the FCR is a ratio that does not have any units provided by the formula: FCR = dry weight of feed consumed/wet weight of gain Figure 4. Company B: Real data (on the left) and FCR model (on the right) for the bream production. while the feed conversion efficiency (FCE) is expressed as a percentage as follows: FCE = 1/FCR × 100 There seems to be some controversy among aquatic animal nutritionists as to which is the proper parameter to measure, but in aquaSmart we used FCR (exposing here FCE for completion). Moreover, the FCR and FCE are based on dry weight of feed and fish gain, as the water in dry pelleted feed is not considered to be significant. A typical feed pellet contains about 10% moisture that Figure 5. Company C: Real data (on the left) and FCR model (on the will only slightly improve the FCR and FCE. right) for the bream production. The FCR table allows the fish farmer to assess the amount of food The model (on the right) produced based on the sample data (on to give to the fish according to their average weight and the the left) serves as a base of comparison with the historical data temperature of the water. Each farm has its own FCR table. This provided by a particular fish farm. Thus, with the new real data is an opportunity to create our own table/model by tweaking the getting in our system, the fish farmer can compare it with the 19 model and make an evaluation on the progress of the production. Moreover, the statistical analysis of the results permit a clearer These models complement and confirm the expert knowledge: the visualization of the important features in the data that can boost high values on the right correspond to high fish reproduction in the production and optimize the processes related to it. That will cold water temperatures and high average weight values. On the enable classification and forecast based on the analytics of the other hand, high temperatures represent low levels of oxygen available data. which request higher feeding rate to maintain and increase the growth rate. In that, future work includes the production of guidelines validated by end-users in order to facilitate the application of The big number of peaks in the real data, plotted on the left, further advanced learning methods in aquaculture. correspond to the real values. Typically the input data can be seen within a grid. The following images show the grid view of both ACKNOWLEDGMENTS the real data (on the left) and the FCR model (on the right) for the company C. The authors would like to acknowledge that his work was funded by the project AquaSmart, co-funded by the European Commission with the agreement number 644715 (H2020 Programme). We would also like to thank to Primož Škraba, Luka Stopar, Dunja Mladenić, Pedro Gomes de Oliveira, Ruben Costa, Gerasimos Antzoulatos, Nir Tzohari, John McLaughlin and Gary McManus. Figure 6. The grid view of both the real data (on the left) and the FCR REFERENCES model (on the right) for the company C. We then use least squares method to interpolate the missing [1] aquaSmart Consortium, The aquaSmart Project [Online]. values including all non-peak values as those interpolated values. URL: www.aquasmartdata.eu. [accessed in 22.8.2016]. It does so by approximate the solution of overdetermined systems. [2] Bhujel, R. C. (2011). Statistics for Aquaculture. John Wiley The average weight must be represented using specific values that & Sons. are important in the fish production decision making, and eventually distinct from fish farm to fish farm. Thus we consider a [3] Beveridge M (2004); Cage Aquaculture; Third Edition, second interpolation to produce a final FCR table that is consistent Oxford, UK. with the systems in use by the fish farms. The nearest neighbours [4] N. R. Draper and H. Smith, Applied regression analysis, 3th algorithm is used here to find the values outside the area [5]. That edition. New York: Wiley, 1998. permits us to consider the complete table of measurements in line [5] Han, J., Kamber, M., & Pei, J. (2011). Data Mining: with the sample data available and the missing values calculated Concepts and Techniques. Elsevier. for the area inside the region. [6] Dobson, A. J., & Barnett, A. (2008). An Introduction to 5. CONCLUSIONS Generalized Linear Models, Third Edition. CRC Press. The challenges of aquaculture for data analytics are very specific [7] Bar, N. S., & Radde, N. (2009). Long-term prediction of fish in the field and must be addressed with the appropriate growth under varying ambient temperature using a multiscale methodology and technology, in tune with the expertise of the fish dynamic model. BMC Systems Biology, 3(1). farmers. The uncertainty of measures, such as the number of fish [8] Lee P.G. 2000. Process control and artificial intelligence until the time of harvest, derives in variances that do not permit a software for aquaculture. Aquacultural Engineering, 23, 13- complete accuracy of some of the calculations. This is particularly 36. important to some of the available tools to monitor the business, such as the feed conversion rate tables in use by the fish farmers [9] Rizzo, G., and Spagnolo, M. 1996. A Model for the Optimal to optimize the production costs. Management of Sea Bass Dicentrarchus Labrax Aquaculture. Mar. Resour. Econ. 11: 267–286. The mathematical models developed in the aquaSmart project and [10] Stickney, R. R. (2007). Aquaculture: An introduction. (C. discussed in this paper aim to contribute to the improvement of Publishing, Ed.). CABI Publishing. the aquaculture procedures, providing a deeper insight on the information retained in the collected data, using state-of-the-art methods of data mining in line with the expert knowledge of the field transferred to the metadata in the data store. 20 Modeling Probability of Default and Credit Limits Zala Herga Jan Rupnik Jožef Stefan Institute Jožef Stefan Institute Jamova 39 Jamova 39 Ljubljana, Slovenija Ljubljana, Slovenija zala.herga@ijs.si jan.rupnik@ijs.si Primoz Škraba Blaž Fortuna Jožef Stefan Institute Jožef Stefan Institute Jamova 39 Jamova 39 Ljubljana, Slovenija Ljubljana, Slovenija primoz.skraba@ijs.si blaz.fortuna@ijs.si ABSTRACT understand. For credit limits, we implemented a linear pro- Creditors carry the risk of their clients not meeting their gramming based approach [5] which provides portfolio opti- debt obligations. In the literature, these events are often mization with risk aversion constraints. referred to as default events. These can be modeled for each company through a probability of default (PD). Measures This paper presents the workflow and the methodology that can be taken to limit the default risk: in this paper we fo- we employed to build a portfolio credit allocation model. cused on credit limit. Firstly, we predict PD of a company Due to privacy concerns it does not include any expiremental using a logistic regression model, based on publicly available results and does not discuss any concrete results or aspects financial data. Secondly, we effectively find an optimal port- of real data. folio under risk aversion constraints and show how variation of inputs affects the results. The paper is organized as follows. Section 2 provides an overview of related work. In Section 3 the data that is used in modeling is described. Section 4 first describes the ap- Categories and Subject Descriptors proach and computation of the PD model and then presents Mathematics of computing [Mathematical optimization]: the results. Section 5 provides a short theoretical introduc- [Linear programming, Convex optimization]; Computing method- tion to portfolio optimization and then presents our compu-ologies [Machine learning]: [Supervised learning by re- tation and results. Section 6 concludes the paper. gression] 2. RELATED WORK Keywords The Altman Z-score [1] is a widely used credit-scoring model. PD model, logit, credit limit model, portfolio optimization, It is a linear combination of five commonly used financial in- linear programming, risk management dicators and it predicts company’s degree of PD. Both [6] and [8] argue that Altman Z-score and distance-to-default 1. INTRODUCTION ([7]) are not appropriate to use in the context of small busi- Payment defaults represent a key default risk (also credit nesses. The authors in [6] predict PD using delinquency risk) to creditors. Creditors can limit their risk by either data on French small businesses. They propose a scoring insuring their claims or taking preventative measures before model with an accuracy ratio based solely on information extending a credit. Standard tools to measure default risk about the past payment behavior of corporations. Similarly, include different kinds of credit ratings. [8] forecasts distress in European SME portfolios. They es- timate the PD using a multi-period logit model. They found Our goal was to create a model that predicts a company’s that the larger the SMEs, the less vulnerable they are to the probability of default (PD) and provides credit limit sugges- macroeconomic situation. They also show that SMEs across tions based on the computed PD. One of our constraints Europe are sensitive to the same firm-specific factors. was that the underlying PD model be simple and easy to [2] examine the accuracy of a default forecasting model based on Merton’s bond pricing model [7] and show that it does not produce a sufficient statistic for the PD. [11] compare the predictive accuracy of PD among six data mining meth- ods. They also present a novel “sorting smoothing method” for estimating the real PD. Using a simple linear regression, they show that artificial neural networks produce the best forecasting model. [3] used the Merton model to show that, on contrary to what theory suggests, the difference in re- turns between high and low PD stock is negative and that 21 returns almost monotonically decrease as the PD increases. bins 0.8 However, they found a positive relationship systematic de- pGood fault risk exposure and returns. 0.6 pBad 0.4 On the problem of portfolio optimization, [9] showed that 0.2 Conditional Value at Risk minimization with a minimum 0 expected return can be computed using linear programming 1 2 3 4 5 techniques. [5] built on this idea and showed that alterna- WOE 1 tively, one can maximize returns while not allowing large 0.5 risks. 0 3. DATA -0.5 The dataset that we used covers several thousand companies -1 1 1.5 2 2.5 3 3.5 4 4.5 5 from several European countries. Data for each company consists of two parts: financial data and trading data. Figure 1: Bins and WOE on simulated data. The greater Financial data corresponds to publicly available data - bal- the value of the simulated data the greater the evidence that ance sheets and income statements of a company. a company is ’good’. Trading data consists of private information on trades be- tween our data provider and his clients. It contains monthly tus of a company. We transformed each financial indicator data about the sum of trades, outstanding debts, disputed vector into a feature vector by binning and assigning Weight claims and delayed payments. The data is available for years of Evidence (WOE) [10] to it. The idea is as follows: create between January 2010 and June 2016. n bins in range from min to max indicator value and assign each company to the corresponding bin. Then count the 4. PD MODEL number of ’bad’ (defaulted) and ’good’ companies in each All creditors carry the risk of their clients not paying the bin. Then assign WOE to a bin as bills. We can be quite certain that some clients will not P(company = good) pay, however, we have difficulties identifying those clients. log . P(company = bad) Hence, our first task was to compute the probability of de- fault for each client company. The end model should also Since WOE scores will be used as inputs to a linear model, be simple and easy (intuitive) for domain experts to under- they should be a monotonous function over bins, meaning, stand. the higher the financial indicator value the better the com- pany is (if WOE is increasing) or the higher the indicator the Default can be defined in multiple ways, considering the worse the company is (if WOE is decreasing). In Figure 1 available data and how strict we want our model to be. Do we show an example of binning and WOE transformation we consider it a default if the client is only one day late on simulated data. on payments? Or do we let them be 30, 45, 60 days late before taking action? Are we going to consider a client de- Thus, we obtained 40 out of 45 features. Another feature faulted if he owes 10e? What if the client didn’t pay one was the size of the company (based on income) and four of bill but has been paying all the bills after? We are required them were country features (dummy variables - one for each to make a judgment call and choose a definition that meets country). the creditor’s needs best. As described above, we mapped “raw” features into WOE As soon as a client defaulted we removed it from the dataset. features: Meaning, each client can only default once and after that we assume there was no more trading done with them. (x1, x2, ...) → (woe(x1), woe(x2)..) PD of a company is then predicted through logistic distri- There were some companies that were already defaulted at bution function: the beginning of the time-span of our dataset. We removed these companies from the dataset since they provide no use- 1 F (x) = ful information. We also filtered out clients with low sales 1 + exp−(β0+β1·woe(x1)+β2·woe(x2)+...+βn·woe(xn)) since their financial data can be very unreliable and their where βi denotes linear regression coefficients. impact on our model is big compared to the trading volume that they generate. 4.1 Computation Due to the constraint that the model should be simple and Since we have the response variable observations on monthly easy to understand we chose to model the PD with logistic level, we interpolated features to obtain the same frequency regression. of explanatory variables. We also took into account offset of the financial and trading data: financial data is only made From the available financial data we calculated 45 financial available sometime around June each year for the previous indicators for each company that cover aspects of solvency, year (e.g. in June 2016 we only know financial statuses of liquidity, debt, profitability and operative effectiveness sta- 2015). 22 Missed trading volume vs. disputed claims but how to set the limit? If the limit is to high, the client 25 might not be able to pay the bills, but if the limit is too low, profit is lost on trading volume. The model that we created 20 is inspired by [5], is based on PD calculation presented in X: 6.456 Y: 15.51 first part of this paper and takes into account the level of 15 risk that creditor is willing to take. 10 Let us introduce some standard financial risk-related terms. disputed claims Value at Risk (VaR) is an upper percentile of loss distribu- tion. Probability level is denoted as α. E.g. 95% − V aR of 5 1, 000, 000e means that there is a 0.05 probability that loss will exceed 1, 000, 000e. Conditional Value at Risk (CVaR) 0 is the conditional expected loss under the condition that it 0 50 100 150 200 250 missed trading volume exceeds VaR. CVaR at α = 95% level is the expected loss in the 5% worst scenarios. ω denotes the maximum allowed Figure 2: Missed trading volume vs. disputed claims on CVaR of the portfolio at level α. simulated data. The info box marks 6%-level: if trading was stopped with the worst 6% companies, approx. 7.5 could be We will denote loss associated with the portfolio x and ran- saved on disputed claims and 6.5 · margin profit would be dom vector y (with density p(y)) as f (x, y); CV aR will be lost on trading volume. The figure is included for illustrative noted as φα(x), which is given by purposes and is not based on real data for privacy concerns. Z φα(x) = (1 − α)−1 f (x, y)p(y)dy. f (x,y)>V aRα We then filtered out features that: It has been established [9] that φα(x) can be computed by • had high percentage of missing values (We kept thresh- minimizing the following function: old as a parameter. in this paper, threshold of 0.7 was Z used.) Fα(x, ζ) = ζ + (1 − α)−1 max{f (x, y) − ζ, 0}p(y)dy, y∈ n R • were highly correlated (In this paper, threshold of 0.9 was used). and that the value ζ which attains the minimum is equal to V aRα. After that, 27 features were left in our feature set. Finding credit allocations x ∈ X that maximize the expected profit under CV aR is equivalent to the following optimiza- We used Matlab for computation. The data was trained tion problem [5]: on 42 months and tested on 19 months. Two models were trained: standard logistic model and stepwise logistic model. min − R(x) x∈X,ζ∈R In stepwise regression, explanatory variables are added to a (1) subject to F model by an automatic procedure [4]. Regression coefficients α(x, ζ ) ≤ ω are estimated by maximum likelihood estimation. where R(x) is the expected profit and the set X is given by a set of box constraints (lower and upper bounds on each 4.2 Results component of x). Test data consisted of 20% of the data (not used in training). There were 16 features chosen in stepwise model. Both of 5.1 Computation the models have 0 p-values and return very similar results By using the PDs from the first part of the paper, we can in predicted values. simulate the default events and compute several random sce- narios for our portfolio. Since each company is assigned We evaluated models by comparing the amount of disputed a probability of default we can generate random scenarios claims (true negatives) to the amount of missed trading (where certain companies default) over the full portfolio by volume (false positives) given ceased trading with compa- sampling from independent (with different weight) Bernoulli nies with PD exceeding some threshold (Figure 2). Note, random variables. that expenses based on missed trading volume cannot be directly compared to disputed claims; disputed claims are a By generating a set of sample scenario vectors y direct expense, whereas missed trading volume number con- 1, . . . , yJ with their corresponding probabilities π sist largely of expenses (that a company in that case did 1, . . . , πJ we can ap- proximate F (x, y) by a finite sum: not have). Hence, one needs to multiply the trading volume with company’s (average) margin to obtain actual opportu- J X nity costs. ˜ Fα(x, ζ) = ζ + (1 − α)−1 πj max{f (x, y) − ζ, 0} j=1 5. CREDIT LIMITS MODEL Using Naturally, a question arises once we identify risky clients: how to handle them? Client should have set a credit limit, zj ≥ f (x, yj ) − ζ, zj ≥ 0, j = 1, . . . , J, ζ ∈ R 23 PD vs. amount of credit PD vs. amount of credit the constraints in (1) can be reduced to a system of linear constraints: J X ζ + (1 − α)−1 πj zj ≤ ω (2) j=1 PD PD f (x, yj ) − ζ − zj ≤ 0, zj ≥ 0, j = 1, . . . , J, ζ ∈ R (3) In our case, the expected profit is 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 R(x) = (1 − pd) · x · margin − pd · x. Approved credit / max credit Approved credit / max credit (a) (b) As for PD modeling, we used Matlab for computation. We PD vs. amount of credit PD vs. amount of credit combined left-side part of constraints from (2) and (3) in a matrix A; the right-hand side was combined in a vector b. We also added additional constraints on x, that are specific to our problem: we set an upper and lower bound (ub and lb correspondingly) to the credit limit. PD PD We then have to solve linear program: Ax ≤ b min −R(x) such that 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 x lb ≤ x ≤ ub Approved credit / max credit Approved credit / max credit (c) (d) 5.2 Results Our method provides an optimal portfolio based on α, ω, Figure 3: Each circle in these scatter-plots represents one margin, PDs, credit limit upper- and lower bounds. By op- company. The x-axis represents the relative amount of ap- timal portfolio we mean monthly credit limit for each com- proved credit. Y-axis represents predicted PD of a company. pany, which takes value between the provided credit limit In 3a default values are used, while Figure (b) is based on bounds. In Figure 3 we present some scatter-plots based on reducing ω by a factor of 10, figure (c) is based on increasing results; the default values used for these graphs are α = 0.95, the margin by a factor of 10 and figure (d) is based on lifting lb = 0, ub = max trading volume (based on historical data) the lower bound to 10% of the upper bound. The scales of and margin = 0.01. Most companies get either zero or max- PDs are omitted on the graphs due to privacy concerns. imum credit approved; there are only some companies the get part of the max credit approved. In 3b we decreased ω by a factor of 10. This is a lot stricter constraint and Default Risk: The Good, The Bad, and The Anomaly consequently there are more companies that get zero credit (March 2015), 2015. approved and many companies that get only a part of max [4] R. R. Hocking. A biometrics invited paper. the credit approved. In 3c we increased margin from 0.01 to 0.1. analysis and selection of variables in linear regression. In practice this means, that the credit-giver is making profit Biometrics, 32(1):1–49, 1976. on trading volume. In 3d, we moved credit lower bound [5] P. Krokhmal, J. Palmquist, and S. Uryasev. Portfolio from zero to 0.1 · ub. This makes the PD threshold stricter. optimization with conditional value-at-risk objective and constraints. Journal of risk, 4:43–68, 2002. 6. CONCLUSION [6] A. Marouani. Predicting default probability using We presented a logit model based on weight of evidence fea- delinquency: The case of french small businesses. tures to predict a company’s probability of default. Standard Available at SSRN 2395803, 2014. and stepwise methods were used to train the data. Both [7] R. C. Merton. On the pricing of corporate debt: The methods provide similar results. risk structure of interest rates. The Journal of finance, 29(2):449–470, 1974. In second part of the paper we introduce an efficient portfolio [8] D. Michala, T. Grammatikos, S. F. Filipe, et al. optimization technique that was used to determine credit Forecasting distress in european sme portfolios. EIF limits for creditor’s clients. We presented the results and Research & Market Analysis Working Paper, 17, 2013. showed how variation of inputs impacts the results. [9] R. T. Rockafellar and S. Uryasev. Optimization of conditional value-at-risk. Journal of risk, 2:21–42, 7. REFERENCES 2000. [1] E. I. Altman. Financial ratios, discriminant analysis [10] D. Sharma. Evidence in favor of weight of evidence and the prediction of corporate bankruptcy. The and binning transformations for predictive modeling. journal of finance, 23(4):589–609, 1968. available at: http://ssrn.com/abstract=1925510, [2] S. T. Bharath and T. Shumway. Forecasting default September 2011. with the kmv-merton model. In AFA 2006 Boston [11] I.-C. Yeh and C.-h. Lien. The comparisons of data Meetings Paper, 2004. mining techniques for the predictive accuracy of [3] S. Ferreira Filipe, T. Grammatikos, and D. Michala. probability of default of credit card clients. Expert Pricing default risk: The good, the bad, and the Systems with Applications, 36(2):2473–2480, 2009. anomaly. Theoharry and Michala, Dimitra, Pricing 24 Big Data Analysis Combining Website Visit Logs with User Segments and Website Content Matic Kladnik Blaž Fortuna Pat Moore Jožef Stefan Institute Jožef Stefan Institute Bloomberg LP Jamova cesta 39 Jamova cesta 39 New York, USA Ljubljana, Slovenia Ljubljana, Slovenia pmoore26@bloomberg.net matic.kladnik@ijs.si blaz.fortuna@ijs.si ABSTRACT 2.2 User Segment Data This paper provides three use-cases of combining website visit logs Web ecosystem allows for deeper annotation of users based on their with user segment data, website content and stock volume data. The behavior around the web. For example, visit to a car dealership use-cases provide concrete examples showcasing how to derive website can be noted by a third-party data provider, which has an insights into behavior of users on the website. We also present how agreement and some tracking mechanism installed on the to efficiently derive required data using MapReduce technique, and dealership’s website. Such user segment data is further aggregated implement the use-cases using QMiner data analytics platform. and distributed by data providers such as Krux. Connecting visitors to our website with external database of user segments can provide Categories and Subject Descriptors a much richer understanding of our audience and, as we will see in the first use-case, allows for some interesting analysis. Example of Information Systems → World Wide Web → Web Mining user segments covered include estimated household income, General Terms gender, estimated home net worth and others. Algorithms, Management 2.3 Website Content Keywords Content of the pages on our website provide additional source of Web log analysis, Big data, MapReduce, Hadoop, Hive, QMiner data we can use in the analysis when joined with visit logs and user segment data. Content can be represented on different levels of 1. INTRODUCTION granularity: words, entities, topics, etc. Visitors on modern websites leave behind large amounts of traces In our use-cases we collect the content by crawling the URLs in form of a web server log, query logs, pixel tracking logs, etc. mentioned in the visit logs and extracting their content by removing Website owners analyze these traces to better understand how boilerplate. Each page is processed using a standard NLP pipeline visitors navigate their website, identify their interests, and optimize [1], providing us with list of topics, named entities and tickers advertising opportunities. Most typical use-cases are covered by (stocks from companies). end-user tools such as Google Analytics, ComScore, Omniture, etc. More sophisticated use-cases, which require custom or ad-hoc 2.4 Stock Volume Data processing of the data, are covered by big data frameworks such as Our last use-case combines visit logs with market data. We use MapReduce and its implementations such as Apache Hadoop. price and trading volume data provided by Kibot. In this paper we However, standard tools provide little or no support for analysis will be using data in hourly intervals, meaning that trade volume that requires combining visit logs with content presented on the values are accumulated by hours. website. In this paper we present three use-cases which combine big data frameworks and text mining approaches with the goal of 3. PROCESSING deeper understanding of visitors and their habits. We process input data using MapReduce parallelization paradigm. MapReduce processes the parts of data separately in the map phase Paper is organized as follows. First we provide a quick overview of and later joins the partial results in the reduce phase. This allows us the input data, and outline how we prepare it for the use-cases using to easily split processing into multiple computing units that can be MapReduce framework. We conclude by outlining three use-cases executed in parallel. In this paper we use Apache Hadoop [2] as applied to data from a large news website: comparing user MapReduce implementation and Apache Hive [3], which allows us segments, identifying user segments by example article and to execute SQL-like queries on Hadoop. correlating company news visit logs with stock volume. 3.1 Aggregating Visit Logs and Segment Data 2. INPUT DATA First we address the task of joining visit logs with user segment data 2.1 Website Visit Logs using a cookie shared by both datasets. Join can be executed using the query presented in Figure 1. The query joins three tables: a table Visit logs provide log of page views on a specific website. Each with the visit logs, a table linking user IDs from visit logs and page view is described by the time, user ID (stored as a first-party segment IDs from segment data, and a table providing segment cookie), page URL, referral URL, user agent, etc. Such information description. is typically collected using a pixel tracking mechanism and is commonly used by all standard web analytics tools. In use-cases We tested the query on one week of visit logs on several cluster we use ComScore as the data provider, compositions and the results are presented in Table 1. We can see how the performance of the cluster improves when adding additional instances. However, when going from 4 to 6 instances, we can see a smaller deduction in time taken. That is due to the last 25 reduce process taking similar amount of time with any number of As we can see in Table 1, Table 2 and Table 3, adding additional instances. task instances to the cluster can greatly affect the performance. However, at some point performance is improved less effectively Table 1. Performance analysis on segment query when adding additional task instances. This depends on the amount Instances Duration Improvement of data, number of files the data is stored in and types of task 2 3104 seconds -- instances. The last reduce task usually takes some time to complete 4 1763 seconds x1.76 and is processed on a single instance, which is why it’s time taken 6 1427 seconds x2.18 to process cannot be improved. INSERT INTO TABLE visitors_segids Table 2. Performance on visit logs data in external table SELECT seg_userid, segid Instances Duration Improvement FROM visit_logs t1, segment_data t2 2 713 seconds -- WHERE t1.seg_userid = t2.seg_userid; 4 446 seconds x1.60 6 344 seconds X2.07 INSERT INTO TABLE visitors_segvals Table 3. Performance on visit logs data in internal table SELECT seg_userid, segid, segment_title Instances Time taken Improvement FROM visitors_segids t1, sikdd_segments t2 2 275 seconds -- 4 265 seconds x1.04 WHERE t1.segid = t2.segid; 6 234 seconds x1.18 Figure 1. The query creates mapping between user IDs and 4. USE CASES segment names by joining visit logs and data segments. In this section we present three use-cases that combine visit logs, segment data, content and stock volume data. Use-cases were implemented in QMiner [4], a data analytics platform for INSERT INTO TABLE user processing large-scale real-time streams containing structured and SELECT userid, collect_set(url) unstructured data. FROM sikdd_visit_logs 4.1 Comparing User Segments GROUP BY userid; The goal of this task is to compare behavior of different user segments on the website. Segments are provided by User Segment Figure 2. Aggregating visited URLs by users IDs. data, which we joined with visit logs and website content. The use- case is prototyped as a web app using QMiner and Node.JS. 3.2 Aggregating Visit Logs and Content Figure 3 shows the interface where the user can select which segments of users should be compared. Query is specified as a We now address the task of aggregating visited pages by user. We collection of segments defining a subset of website visitors. The achieve this by simply grouping visit logs around user ID as show second group can be either a complement of the first group, or can in Figure 2 with the following example query. also be defined through another list of segments. Results of the query are inserted into a previously created table. We In the example we compare users identified as engineers versus are calling the collect_set function, which returns an array of users identified as administrators. We can see results for this query targeted column values when grouping the results. This way we get in Figure 4. Report shows some overall statistics and the odds a column with distinct user ID values and all connected URLs. In rations that are significant for the first group. The output can be Table 2 we compare performance on this query when for several directly transformed into instructions for an ad server. cluster compositions and can see linear improvement as we increase the number of instances. In Table 3 we compare this with the performance of the cluster on internal Hive table, as opposed to ad-hoc table pulled from some external source (e.g. Amazon S3). As we can see, the performance can be boosted by reading data from an internal table. When using an internal table, the data is stored locally on the instance. External tables are useful for reading data from an external source (e.g. AWS S3 service), which has to support the HDFS (Hadoop File System). External tables can still be stored on the local HDFS of the instance. Copying data from an external source to an internal table comes in handy when a certain table is used frequently, otherwise the difference in time taken is overturned by the amount of time needed to copy data from an external data source to an internal table. There is also a question of resources availability as moving data to the local instance can fill a lot of the cluster’s available storage space. In some cases, the data is simply too large to be moved to the cluster’s storage. Figure 3. Interface for selecting which segments we want to compare. E.g. engineer vs. administrator 26 trading happens during US trading hours, when also website traffic spikes. If we compare article visits with the trading volume, we can see a clear discrepancy with articles generating more traffic over first few days and the stock having higher trade volume over last few days. Figure 4. Segment query results 4.2 Analyzing Interests for Topics The goal of this task is to understand who is interested in specific topics. Topic can be either selected from a predefined list of categories (e.g. Advertisement), or defined via a document. For example, we can search for articles on Product Innovation, which is not included in the predefined list, by using Wikipedia page on product innovation as a query. Result of such as query is a subset of pages from the website that fit the given topic. We can then join this by using mapping from Section 3.2 to obtain list of visitors that read this content, and, by using mapping from Section 3.1, also their segments. Note that these joins can be executed on one month of data in QMiner in real- near time. Result is several reports. First, we can generate same kind of report as shown in Figure 4, by contrasting readers of identified pages with the rest of the website’s visitors. Second, we can aggregate visited pages and their content in order to identify top topics, people and keywords, as presented in Figure 5 and Figure 6. Figure 5. Results for querying on Advertisement input text 4.3 Comparing traffic and stock volume data As the last use-case we will check for correlation between the stock volume data and a combination of visits logs and content pages. All examples will be focused on AAPL (Apple Inc.) and we will look at the data aggregated by hour. We start by creating two time-series from the visit logs: (a) number of visits to AAPL quote page, and (b) number of visits to articles related to Apple. First time-series we can obtain directly from visit logs. For the second, we have to first identify significant mentions of entity Apple in the news articles and check their visit statistics. We compere these with a trade volume of AAPL stock. The intuition being, that more news there is about a specific company, the more trading there will be with their stock. First we compare quote page visits with the trading volume. As we can see in Figure 8, stock volume values fluctuate in similar patterns and time intervals as the number of requests values. Most of the trends can be explained by daily patterns, where most of Figure 6. Word cloud with top keywords from pages discussing the topic of Advertisement. 27 We can also check the correlation between the trade volumes and the visit data. Figure 9 shows us correlation between quote requests and stock volume. The x-axis represents the offset of request time based on the volume values time, measured in hours. For example, when the offset is -1, we compares quote request data at (h – 1) hour with stock volume data at h hour. The biggest correlation between number of quote requests and stock volume values is within the same hour (no offset). In this case, correlation coefficient is significant as the precise value is 0.62. Whereas if we take the number of quote requests 1 hour in advance of the volume values, we still get a significant correlation coefficient of 0.46, or even greater (0.54) if we take quote requests from 1-hour prior of the volume values. The correlation coefficients fall more, the higher the offset we take between stock volume data values and quote Figure 7. AAPL quote requests and volume request values. In any case we can observe that correlation between both data is significantly high. Correlation coefficients on Figure 10 tell us that there is some slight correlation between the numbere of visits to sites that mention Apple Inc. (AAPL) and the volume of AAPL stock on stock exchanges when taking visits prior to the volume movement into account. Similar to Figure 9, the x-axis on Figure 10 represents offset time of site visits, based on the time in stock volume data. There is a small jump in correlation, when taking site visit values that are offset by 2 hours into the past, compared to the stock volume data time. Altough, even at that point the correlation coefficient is 0.13. There is a negative correlation when taking site visit values from the hours following the time of stock market volume values. We can conclude that the correlation between site qoute requests and stock volume data is much stronger than correlation between site visits and stock volume data. However, we Figure 8. AAPL site visits and volume can also make an observation that we only get a positive correlation between site visits and stock volume when taking the numbers of AAPL requests and volume correlation site visits that occurred prior to stock volume into account. 0.8 5. REFERENCES [1] Štajner, Tadej, et. al. A service oriented framework for 0.6 natural language text enrichment. Informatica (Ljublj.), 2010, vol. 34, no. 3, 307-313 0.4 [2] Apache Hadoop: http://hadoop.apache.org 0.2 [3] Apache Hive: https://hive.apache.org [4] Fortuna, Blaz, et al. QMiner – Data Analytics Platform for 0 Processing Streams of Structured and Unstructured -3 -2 -1 0 1 2 3 Data. Software Engineering for Machine Learning Workshop, Neural Information Processing Systems 2014 Figure 9. AAPL requests and stock volume correlation AAPL site visits and volume correlation 0.2 0.1 0 -0.1 -3 -2 -1 0 1 2 3 -0.2 -0.3 Figure 10. AAPL site visits and stock volume correlation 28 Visual and Statistical Analysis of VideoLectures.NET Erik Novak Inna Novalija Artificial Intelligence Laboratory Artificial Intelligence Laboratory Jožef Stefan Institute Jožef Stefan Institute Jamova cesta 39, Ljubljana, Slovenia Jamova cesta 39, Ljubljana, Slovenia +386 31 272 332 +386 1 4773144 erik.novak@ijs.si inna.koval@ijs.si ABSTRACT In our work we adopt both approaches to learning analytics This paper presents learning analytics tools for visual and – visual and statistical analysis, which allows having a deep statistical analysis of data from a portal of video lectures. and more intuitive understanding of the obtained results. While learning analytics methods traditionally deal with 3. LEARNING ANALYTICS FOR measurement, collection and analysis of data about learners with an aim of improving the learning process, our VIDEOLECTURES solutions are targeted at viewers of VideoLectures.NET. VideoLectures Learning Analytics Dashboard [1] is a tool The novel VideoLectures Learning Analytics Dashboard developed for the analysis of viewer behavior and detecting and Lecture Landscape tools allow observing, searching which lectures are interesting for the users. We present the and analyzing viewer behavior and, at the same time, results of the preliminary work and the functionalities of the efficiently present the information from the portal to the dashboard. viewers. Data Processing. The data considered in the analysis is a set of log files from VideoLectures portal that contain raw General Terms events on the portal from September 2012 until December Learning Analytics, Measurement, Performance, Design. 2015. We have processed 11.3 GB of log files that included the ID, timestamp, session, log entry, lecture and other Keywords information such as event type, IP address, location (if visualization, analytics, data mining, VideoLectures.NET present) etc. 1. INTRODUCTION With the processed raw log files, we established main event VideoLectures.NET is a free and open access educational types that appeared in the raw logs: view (the user accessed video lectures repository. There are over 20.000 lectures by the lecture webpage), download (the user downloaded distinguished scholars and scientists at the most important presentation, video etc. from the lecture webpage) and and prominent events like conferences, summer schools, search (the user performed a search at the portal). workshops and science promotional events. In addition to the raw log files, we have also collected the Lectures on the VideoLectures.NET portal are categorized log files dedicated to the behavior of particular user while into various categories, such as Arts, Biology, Business, watching particular lectures which we call ranges log files. Computer Science, Social Sciences, Technology etc. These present the actions of the user while watching videos like moving forward, moving backwards on the player, In this paper we present tools for visual and statistical skipping some video section etc. analysis of VideoLectures.NET portal data – VideoLectures Learning Analytics Dashboard and Lecture Landscape. Analysis of Log Files. In order to analyze user behavior at While the goal of the VideoLectures Learning Analytics VideoLectures.NET portal, we have utilized and developed Dashboard is to aggregate, harmonize and analyze event a set of data analysis techniques. These were used in the data using various data analysis techniques, the Lecture development of VideoLectures Learning Analytics Landscape visualizes the information about videos, Dashboard, which contains both statistical analysis and categories and authors in an efficient and user-friendly way. visual exploration features. The analysis has been performed from four perspectives: 2. RELATED WORK the aggregated perspective for all lectures, perspective of Tomas & Cook [14] state that visual analysis should be singular lecture, aggregated perspective of all viewers and employed as a means to reveal patterns and trends within perspective of singular viewer. data, to develop more intuitive perception and to help with in-depth analysis. Conde et al. [8] present a learning In addition, we have developed a set of metrics for viewers analytics dashboard and its application in real-world case and lectures that provide us an insight into the user studies. Conceptual framework is developed by Bakharia et behaviour on the VideoLectures.NET portal. al. [7]. 29 The lecture metrics measures the number of views and The overall statistics of the viewers can also be found under viewers the lecture has, the average and standard deviation the all viewers tab. It shows the distribution of the viewers of time the lecture was watched (in minutes and percentage) through countries (see Figure 3) and other statistics and the average and standard deviation of moves going measured with the viewer metrics. forwards and backwards through the lecture video (in minutes and percentage). The viewer metrics measures the number of lecture views the viewer made, the average and standard deviation of time spent watching (in minutes and percentage) and the average and standard deviation of moves going forwards and backwards through the lecture video (in minutes and percentage). First we have analyzed a set of 1000 most popular (by views) videos relevant to the Data Science category. Those videos produced 7055427 views and 3020090 downloads in the period from September 2012 until December 2015. The number of visitors for these videos was 1045860. Moreover, there were 866912 searches at the portal. We have then analyzed all of the log files for which the statistics can be found on the Learning Analytics Dashboard. 3.1 VideoLectures Learning Analytics Dashboard Our interest is not only to make statistical analysis of the viewer behavior but also to have visual exploration features that enables us to see the traffic and activities made on the VideoLectures portal. For this we created graphs that show the number of views, downloads and searches the viewers make. An example can be seen in Figure 1 which shows the Figure 2. Lecture information for “Deep Learning in view trends. Natural Language Processing” lecture. Displays the basic lecture information, measures and graph of its activity. Figure 1. VideoLectures views trend. It shows the total number of lecture views through September 2012 until Figure 3. Viewer distribution through countries. Most December 2015. viewers come from the United States. Statistics for a particular lecture are also available. On the In order to implement VideoLectures Learning Analytics per lecture tab one can search for the desired lecture and Dashboard, we used QMiner [12] for processing data, get its title, description, the measures returned by the Node.js [3] for creating the server and dashboard and lecture metrics and a graph showing the lectures activity. Highcharts [4] for graphical implementation and dynamics Figure 2 shows the lecture information and measures of support. “Deep Learning in Natural Langauge Processing” lecture. 30 Using the interactive VideoLectures Learning Analytics lectures coordinates and draw the landscape using d3.js Dashboard, it is possible to check what are the lectures of visualization library [6]. broad interest at the VideoLectures.NET portal, how the 4.3 Explorer Functionality users of the portal behave through time, where the users Each lecture is presented by a point and size mapping to the come from and how they react at the specific videos. number of views. Similarity between lectures is mapped to distance between points; more similar lectures are brought 4. VIDEOLECTURES EXPLORER closer together. Hovering over a point brings up a tooltip Our interest is not only to analyze the viewer behavior but containing information about the lecture: its title and also to see the similarities between lectures. For this we description, the name of the presenters his affiliation, the developed VideoLectures Explorer [2], a tool for exploring language in which the lecture was presented, which the lectures published on VideoLectures.NET. The tool scientific categories it belongs to, its duration, when it was enables the user to search through the lectures and find published and the number of views since the last database similarities between them, e.g., to find lectures of a specific update. Any data attribute for which values are not category or presenter of interest. available is omitted from the tooltip. Landmarks show areas Here we explain the method used for visualizing lecture populated with lectures that are majorly of the same similarities and present the tools functionalities. category. The user can zoom in and out of the landscape to enable a more detailed look of the lectures. An example of 4.1 Data Acquisition the landscape can be seen in Figure 4, which shows the of The database used for the visualization and basic statistical the machine learning lecture landscape. analysis contains data from all lectures found on VideoLectues.NET. We have acquired data for 23224 lectures, keynotes, interviews, events etc. For each lecture the database contains the lectures title and description, the name of the presenter and his affiliation with city and country, the lectures publication date, video duration, its parent event, number of views and the scientific categories the lecture belongs to. The scientific categories have a hierarchical structure decided by the VideoLectures development team which we use in the visualization process. The database was constructed using the VideoLectures API [5]. 4.2 Methodology Our objective is to draw the lectures into a two-dimensional vector space to allow plotting on the computer screen, where the similarities of the lectures are maintained. The method we used has been described in [11]. Here is its quick summary: Using the bag-of-words [13] representation, we represent the lectures as vectors in a high-dimensional vector space, Figure 4. The landscape of machine learning lectures. where each dimension corresponds to one word from the created using the “Machine Learning” keyword. vocabulary. These are then used to construct the term- Lectures that are more similar are brought closer document matrix. Using Latent Semantic Indexing [10] we together. merge the dimensions associated with the terms that have similar meaning and get the most suitable set of words to describe the corresponding lectures. After that we use Clicking on the point shows the lectures information. It Multidimensional Scaling [14] to reduce the dimensionality shows the lectures title and presenter, if the lecture’s public of the original multidimensional vectors and map them onto and enabled and a link to the lecture video located on two dimensions where the distances between vectors are VideoLectures.NET. Clicking on the lecture link opens the preserved as well as possible. Once we have the two- corresponding lecture video on the portal. dimensional coordinates we can draw the landscape using When the landscape is generated the dashboard also shows the preferred visualization library. an additional information window (see Figure 5). This is in The features used for representing the lectures similarities two parts: the first part is the query information, containing are its title, description, categories and parent event. We the names of presenters, organizations and categories, the used the algorithm described in section 4.2 to calculate the minimum and maximum number of views, organization location and the language the user used to query the data. 31 The second part contains the basic statistics about the 7. REFERENCES queried data, the number of lectures in the queried data, the [1] Videolectures learning analytics | dashboard, total number of lecture views and scientific categories with http://learninganalytics.videolectures.net, accessed: the number of their occurrences in the queried data. 2016-09-09. Clicking on the category in the dashboard automatically [2] Videolectures learning analytics | landscape, queries the data using the category name, and the http://explore.videolectures.net, accessed: 2016-09-09. information window is updated along the landscape view. [3] Node.js, https://nodejs.org/en/, accessed: 2016-09-09. [4] Interactive javascript chart for your website | highcharts, http://www.highcharts.com/, accessed: 2016-09-09. [5] Swagger ui, http://videolectures.net/site/api/docs/, accessed: 2016-09-09. [6] D3.js – data-driven documents, http://d3js.org/, accessed: 2016-09-09. [7] A. Bakharia, L. Corrin, Linda, P. de Barba, G. Kennedy, D. Gašević, R. Mulder, D/ Williams, S. Dawson and L. Lockyer. A conceptual framework linking learning design with learning analytics. In Proc., Sixth International Conference on Learning Analytics & Knowledge, ACM, pages 329-338, 2016. [8] M. Á Conde, F. J. García-Peñalvo, D. A. Gómez- Aguilar and R. Therón. Exploring Software Engineering Subjects by Using Visual Learning Analytics Techniques , IEEE Revista Iberoamericana de Tecnologias del Aprendizaje, 10(4), pages 242-252, 2015. Figure 5. The additional information window created using the “Machine Learning” keyword. [9] T. F. Cox and M. A. Cox. Multidimensional Scaling. It contains the CRC press, New York, 2000. query keywords and the overall statistics about the queried lectures. [10] S. T. Dumais. Latent semantic analysis. Annual review of information science and technology, 38(1):188-230, January 2004. 5. CONCLUSION AND FUTURE WORK [11] B. Fortuna, D. Mladenić and M. Grobelnik. In this paper we presented learning analytics tools for Visualization of temporal semantic spaces. In Semantic visual and statistical analysis of data from knowledge management, pages 155-169, Springer, VideoLectures.NET portal. While learning analytics 2009. methods traditionally deal with measurement, collection and analysis of data about learners with an aim of [12] B. Fortuna, J. Rupnik, C. Fortuna, M. Grobelnik, V. Jovanoski, M. Karlovcec, B. Kazic, K. Kenda, G. improving the learning process, our solutions are targeted Leban, J. Novljan, M. Papler, L. Rei, B. Sovdat, L. at viewers of VideoLectures. The novel VideoLectures Stopar and A. Muhic. QMiner – Data analytics Learning Analytics Dashboard and Explorer tools allow platform for processing streams of structured and observing, searching and analyzing viewer behavior and, at unstructured data. Software Engineering for Machine the same time, efficiently present the information from the Learning Workshop, Neural Information Processing portal to the viewers. Systems, 2014. The future work will include the development of more efficient learning analytics suggestions for [13] G. Salton. Developments in automatic text retrieval. VideoLectures.NET portal and providing the Science, 253(5023):974-979, August 1991. recommendations on how to increase the viewer [14] J.J. Thomas and K.A.Cook. Illuminating the Path: The engagement into the portal. Research and Development Agenda for Visual 6. ACKNOWLEDGMENTS Analytics, IEEE CS Press, Los Alamitos, 2005. This work was supported by the Slovenian Research Agency and EDSA EU H2020 project (Contract no: H2020-ICT-643937). 32 Spatio-temporal clustering methods Matej Senožetnik, Luka Bradeško, Blaž Kažič, Dunja Mladenić, Tine Šubic Jožef Stefan Institute and Jožef Stefan International Postgraduate School Jamova cesta 39 1000 Ljubljana, Slovenia ABSTRACT In general, clustering methods are separated into following Tracking a person, an animal, or a vehicle generates a vast categories: amount of spatio-temporal data, that has to be processed and analyzed differently from ordinary data generally used • Partitioning methods - classify data into k groups in knowledge discovery. This paper presents existing spatio- or partitions, temporal clustering algorithms, suitable for such data and compares their running time, noise sensitivity, quality of re- • Hierarchical methods - hierarchically decompose given sults and the ability to discover clusters according to non- datasets, spatial, spatial and temporal values of the objects. • Density-based methods - are based on cluster den- sity, where clusters stop growing when neighbourhood Keywords density stops exceeding a given threshold, clustering, spatial-temporal data, density-based algorithms • Model-based methods - definition copied from [1]: ”In this method, a model is hypothesized for each clus- ter to find the best fit of data for a given model. This 1. INTRODUCTION method locates the clusters by clustering the density Due to emerging field of ICT and rapid development of sen- function. It reflects spatial distribution of the data sor technologies, a lot of spatio-temporal data has been col- points.” lected in the past few years. By processing and enriching raw spatio-temporal data we aim at extracting semantic in- formation, which is a basic requirement for the comprehen- The most appropriate methods for clustering spatio-temporal sion and later usage of this data. Normally, the very first data are density-based methods as they regard clusters as step of this process is clustering raw GPS coordinates into dense regions of objects in a data space that are separated more distinct groups of points, which already have some se- by regions of low density [7]. Thus we will focus on density mantics, such as whether the points belong to trajectory or based methods and their application to cluster GPS coordi- stationary point (i.e., stay point). nates. This paper aims to investigate different methods, used for 2. DENSITY-BASED ALGORITHMS clustering spatio-temporal data, generated by mobile phones, Table 1 shows some of the more common density-based algo- by collecting timestamped GPS coordinates of the phones’ rithms, used for the detection of stay points. For a successful location. By clustering the collected coordinates, we obtain detection of stay points, it is important that the clustering so called stay points (also referred to as points of interest or algorithm uses temporal information alongside bare spatial stop points [15]), which are the points in space where a mov- data. As indicated in Table 1, most of the algorithms do ing object has stayed within a certain distance threshold for make use of it, except for DBSCAN. Quality of the algo- a longer period of time [16]. When the stay points are calcu- rithm also depends on noise sensitivity (the less sensitivity lated, we can process the data further, to calculate the most the algorithm has, the better it is), as cellphones’ GPS coor- frequently visited locations (i.e., frequent locations), which dinates are frequently noisy due to connection glitches (for is the fundamental building block for further advanced ana- instance, when moving through forests, staying inside build- lytics use-cases, such as next location prediction [6]. ings or in bad weather conditions). Some algorithms return only clusters such as DBSCAN, ST-DBSCAN and OPTICS, but CB-SMoT, SMoT and SPD return a stay point (also re- ferred to as stops) or a path (also refered to as moves). DBSCAN has an overall average running time O(n log n), and the worst case run time complexity is O(n 2). Running time depends on parameter choice and version of implemen- tation. OPTICS has similar time complexity but it is 1.6 seconds slower than DBSCAN. ST-DBSCAN has the same running time as DBSCAN. 33 Algorithm name Spatio temporal Noise sensitivity Returning stay points/path DBSCAN 7 7 7 ST-DBSCAN 3 7 7 SMoT 3 3 3 CB-SMoT 3 7 3 SPD 3 3 3 OPTICS 3 7 7 Table 1: The most common density-based algorithms, used for the detection of stay points [14]. 2.1 Density Based Spatial Clustering of Ap- plication with Noise (DBSCAN) DBSCAN [5] is a density-based clustering algorithm which identifies arbitrary-shaped objects and detects noise in a given dataset. The algorithm starts with the first point in the dataset and detects all neighboring points within a given distance. If the total number of these neighboring points ex- ceeds a certain threshold, all of them have to be treated as part of a new cluster. The algorithm then iteratively collects the neighboring points within a given distance of the core points. The process is repeated until all of the points have been processed. DBSCAN’s advantages are that it robustly detects outliers, Figure 1: Example of trajectory points with three only needs two parameters (Eps and MinPts), is appropriate possible candidate stops [2]. for large datasets and data input order does not interfere with the results [12]. 2.3 Clustering-Based Stops and Moves of Tra- Numerous research studies have extended DBSCAN, such jectory (CB-SMoT) as in the example of GDBSCAN [13], which is a generaliza- tion of the original DBSCAN. GDBSCAN can cluster point CB-SMoT [11] algorithm is an alternative to the algorithm objects as well as spatially extended objects. Another one SMoT and addresses one of its main drawbacks - the in- of these extended algorithms is DJ-Cluster [17], used for capability to detect stay points that are not predefined by discovering personal gazetteers. It attaches semantic infor- user. It uses clustering methods to automatically detect mation to clusters and requires a list of points of interest. stay points. The idea behind this method is, that when we Also extension is ST-DBSCAN which is described bellow. move around points of interest (such as museums, monu- ments, night-clubs, etc.), we move slower than when we are ST-DBSCAN [4] is another algorithm that is based on DB- traveling from one place to another. In the first steps, the SCAN. It is making use of its ability to discover clusters with potential stops (the slower parts of a trajectory) are iden- various shapes, while improving some of the weak points of tified using the variation of the DBSCAN algorithm which the original algorithm. It adds temporal data to the cluster- considers one-dimensional trajectories and their speed. In ing results, identifies noisy objects if there are various den- the second step, the algorithm identifies the location of the sities of the input data, and more accurately differentiates potential stops (clusters) which were found in the first step. adjacent clusters. The authors report [11] that their algorithm discovers less incorrect stops, compared to SMoT algorithm. 2.2 Stops and Moves of Trajectory (SMoT) SMoT [2] algorithm divides data into two sets called moves 2.4 Stay Point Detection (SPD) and stops. The easiest way to understand how SMOT works The SPD [9] algorithm works by detecting whether the ob- is by reviewing the example shown in Figure 1. There are served entity has spent more than 30 minutes within a ra- three potential stop candidates with geometries Rc dius of 100 meters. If this happens, the region is detected as 1, Rc2 and Rc a stay point. Both threshold time and distance parameters 3 and with information about minimum time dura- tion for each of them. From the figure, we can observe that are adjustable and are chosen depending on the use-case and the point p accuracy of raw data. 0 is not inside any of these geometries, therefore it is a candidate for move dataset. The next few points are inside the first stop candidate (Rc The main advantages of SPD is its need for predefined struc- 1) and also exceed the minimum time duration which is specified for every geome- tures. It is not computationally demanding, but it is sen- try. In candidate Rc sitive to the accuracy of data usually generated by GPS 2, the point duration does not exceed minimum threshold, therefore Rc receiver. Namely, for different accuracies of GPS it returns 2 is not a stop. slightly different positions for the same location. It is also sensitive to noise, but this can be partially reduced by ad- justing the parameters of the algorithm. 34 Figure 2: OPTICS result is called reachability plot [10]. 2.5 Ordering Points to Identify the Clustering spent on a location), as well as the order of stay points on Structure (OPTICS) the timeline, thus making us unable to do further analysis, such as future location predictions and plotting frequency OPTICS [3] is an algorithm which is used for finding density- graphs. Due to this shortcoming and also different cluster based clusters in spatio-temporal data. Though it works in a densities, algorithms such as DBSCAN, or its improved ver- similar way as DBSCAN, OPTICS improves on DBSCAN’s sion DJ-cluster, are not appropriate for our use case. biggest weakness, the failure to detect clusters when density of the data varies. In [14], Sander et al. exposed that the problem of DB- SMoT algorithm is that the quantile function requires a In Figure 2, we can observe that OPTICS algorithm gen- priori knowledge of the proportion between points inside erates an easily visualized ordering of points, which can potential stops and total points in the dataset. This pro- be used to extract partitions and hierarchical clusters [10]. portion varies among datasets since users sometimes spend Zheng et al. [16] used OPTICS for clustering stay points. a whole day inside a stop (i.e., the proportion is one), while In their article, they present the idea that through the use on different occasions they might be visiting more than ten of a statistically significant number of user-generated GPS stops (i.e., the proportion is much smaller than one). Due to trajectories, the correlation between personal geographical this, we need and online algorithm that can function with- locations implicit in a person’s location history enables the out prior knowledge of stay point to path ratio. The SMoT provision of valuable services, such as personalized recom- algorithm uses predefined regions which can be a problem if mendations and target specific sales promotions. we want to detect some stay points outside of those zones (such as outdoors). Algorithm SPD, SMoT and CB-SMoT 2.6 Other algorithms are all sensitive to noise, but have the advantage of returning There exist many other different approaches on dealing with stay points ordered by timestamps. With other algorithms, spatial and spatio-temporal data. Some of these are de- such as DBSCAN and OPTICS, we need to find the right se- veloped as extensions (for example, DB-SMoT [12]) to well quence ourselves independently of the algorithm. Given our known algorithms, while others take an entirely new ap- requirements, we propose CB-SMoT, SMoT and SPD as the proach (TRACLUS [8]). It is important to note that other most suitable for clustering spatio-temporal data collected data can also be used alongside coordinates and timestamps. from a mobile phone. 3. DISCUSSION An efficient algorithm must be insensitive to noise. Cur- The goal of this discussion is to find an algorithm capable of rently, GPS has reception problems in narrow valleys, forests clustering user’s raw historical data of locations, as tracked or inside buildings, so it’s important to use different data by a mobile phone. Referring back to Table 1, we want an sources (for example, Wi-Fi networks, activity recognition) algorithm which has the following properties: for accurate stay point detection. Modern mobile phones nowadays also provide rudimentary activity recognition, which we can use to compare our own results against. If the ac- • is able to cluster spatio-temporal data; tivity recognition system is not detecting movement (the person is standing still) and our collected GPS coordinates • is noise insensitive; are consistently far apart, we may be facing a problem with • returns stay points and a paths. GPS accuracy. In such cases, additional data sources should be considered for more accurate measurement. As we had already stated in the beginning of this paper, be- To the best of our knowledge, up to this date SPD algorithm sides spatial data, temporal information is one of the most is considered as the best solution for detection of stay points, important additional information for stay point detection but suffers from detecting false stay points and paths. This algorithms, which enables better stay point detection per- problem can be alleviated by running multiple iterations of formance and empowers additional time related capabilities, the algorithm on resulting dataset. This is already outside of such as detecting time spent at each stay point. By using the scope of this paper and will be described in the separate an algorithm that clusters data only by spatial information, paper which is currently under preparation. we lose part of the useful information (in our case, the time 35 4. ACKNOWLEDGMENT databases: The algorithm gdbscan and its This work was supported by the Slovenian Research Agency applications. Data Min. Knowl. Discov., 2(2):169–194, and the ICT program of the EC under project OPTIMUM June 1998. (H2020-MG-636160). [14] Khoa a Tran, Sean J Barbeau, and Miguel a Labrador. Bacchelors: Automatic Identification of Points of 5. REFERENCES Interest in Global Navigation Satellite System Data : [1] Clustering analysis. http://www.tutorialspoint. A Spatial Temporal Approach Categories and Subject com/data_mining/dm_cluster_analysis.htm. Descriptors. (January):33–42, 2013. [2] Luis Otavio Alvares, Vania Bogorny, Bart Kuijpers, [15] Y. Zheng and X. Xie. Learning location correlation Jose Antonio Fernandes de Macedo, Bart Moelans, from gps trajectories. In 2010 Eleventh International and Alejandro Vaisman. A model for enriching Conference on Mobile Data Management, pages 27–32, trajectories with semantic geographical information. May 2010. In Proceedings of the 15th Annual ACM International [16] Yu Zheng. Trajectory Data Mining: An Overview. Symposium on Advances in Geographic Information ACM Trans. On Intelligent Systems and Technology, Systems, GIS ’07, pages 22:1–22:8, New York, NY, 6(3):1–41, 2015. USA, 2007. ACM. [17] Changqing Zhou, Dan Frankowski, Pamela Ludford, [3] Mihael Ankerst, Markus M. Breunig, Hans-Peter Shashi Shekhar, and Loren Terveen. Discovering Kriegel, and Jörg Sander. Optics: Ordering points to personal gazetteers: An interactive clustering identify the clustering structure. ACM Sigmod Record, approach. In Proceedings of the 12th Annual ACM pages 49–60, 1999. International Workshop on Geographic Information [4] Derya Birant and Alp Kut. ST-DBSCAN: An Systems, GIS ’04, pages 266–273, New York, NY, algorithm for clustering spatial–temporal data. Data USA, 2004. ACM. & Knowledge Engineering, 60(1):208–221, 2007. [5] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. pages 226–231. AAAI Press, 1996. [6] Sébastien Gambs, Marc-Olivier Killijian, and Miguel Nú˜ nez del Prado Cortez. Next place prediction using mobility markov chains. In Proceedings of the First Workshop on Measurement, Privacy, and Mobility, page 3. ACM, 2012. [7] Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis. On clustering validation techniques. Journal of Intelligent Information Systems, 17:107–145, 2001. [8] Jae-Gil Lee, Jiawei Han, and Kyu-Young Whang. Trajectory clustering: A partition-and-group framework. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD ’07, pages 593–604, New York, NY, USA, 2007. ACM. [9] Quannan Li, Yu Zheng, Xing Xie, Yukun Chen, Wenyu Liu, and Wei-Ying Ma. Mining user similarity based on location history. In Proceedings of the 16th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, GIS ’08, pages 34:1–34:10, New York, NY, USA, 2008. ACM. [10] Chris Mueller. Data Preparation. 2005. [11] Andrey Tietbohl Palma, Vania Bogorny, Bart Kuijpers, and Luis Otavio Alvares. A clustering-based approach for discovering interesting places in trajectories. In Proceedings of the 2008 ACM Symposium on Applied Computing, SAC ’08, pages 863–868, New York, NY, USA, 2008. ACM. [12] Parya Pasha and Zadeh Monajjemi. a Clustering-Based Approach for Enriching Trajectories With Semantic Information Using Vgi Sources a Clustering-Based Approach for Enriching Trajectories With Semantic Information Using Vgi Sources. 2013. [13] Jörg Sander, Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu. Density-based clustering in spatial 36 Indeks avtorjev / Author index Berčič Katja .................................................................................................................................................................................... 9 Bradeško Luka ......................................................................................................................................................................... 5, 33 Cattaneo Gabriella .......................................................................................................................................................................... 9 Cerle Gaber .................................................................................................................................................................................... 9 Choloniewski Jan ......................................................................................................................................................................... 13 Fortuna Blaž ........................................................................................................................................................................... 21, 25 Fuart Flavio .................................................................................................................................................................................... 9 Herga Zala .................................................................................................................................................................................... 21 Karlovčec Mario ............................................................................................................................................................................. 9 Kažič Blaž .................................................................................................................................................................................... 33 Kladnik Matic ............................................................................................................................................................................... 25 Leban Gregor ............................................................................................................................................................................... 13 Maček Sebastijan.......................................................................................................................................................................... 13 Mladenić Dunja ............................................................................................................................................................................ 33 Moore Pat ..................................................................................................................................................................................... 25 Novak Erik ................................................................................................................................................................................... 29 Novalija Inna ................................................................................................................................................................................ 29 Pita Costa Joao ............................................................................................................................................................................. 17 Rehar Aljoša ................................................................................................................................................................................. 13 Rihtar Matjaž ................................................................................................................................................................................ 17 Rupnik Jan .................................................................................................................................................................................... 21 Senožetnik Matej ...................................................................................................................................................................... 5, 33 Škraba Primož .............................................................................................................................................................................. 21 Šubic Tine .................................................................................................................................................................................... 33 Urbančič Jasna ............................................................................................................................................................................... 5 37 Konferenca / Conference Uredili / Edited by Izkopavanje znanja in podatkovna skladišča (SiKDD) / Data Mining and Data Warehouses (SiKDD) Dunja Mladenić, Marko Grobelnik Document Outline Blank Page Blank Page Blank Page Blank Page Blank Page Blank Page