Zbornik 19. mednarodne multikonference

INFORMACIJSKA DRUŽBA - IS 2016

Zvezek D

Proceedings of the 19th International Multiconference

INFORMATION SOCIETY - IS 2016

Volume D

Izkopavanje znanja in podatkovna

skladišča (SiKDD)

Data Mining and Data Warehouses (SiKDD)

Uredila / Edited by

Dunja Mladenić, Marko Grobelnik

http://is.ijs.si

10. oktober 2016 / 10 October 2016

Ljubljana, Slovenia





Zbornik 19. mednarodne multikonference

INFORMACIJSKA DRUŽBA – IS 2016

Zvezek D



Proceedings of the 19th International Multiconference

INFORMATION SOCIETY – IS 2016

Volume D



Izkopavanje znanja in podatkovna skladišča (SiKDD)

Data Mining and Data Warehouses (Sikdd)



Uredila / Edited by



Dunja Mladenić, Marko Grobelnik





10. oktober 2016 / 10 October 2016

Ljubljana, Slovenia





Urednika:





Dunja Mladenić

Laboratorij za umetno inteligenco

Institut »Jožef Stefan«, Ljubljana



Marko Grobelnik

Laboratorij za umetno inteligenco

Institut »Jožef Stefan«, Ljubljana





Založnik: Institut »Jožef Stefan«, Ljubljana

Priprava zbornika: Mitja Lasič, Vesna Lasič, Lana Zemljak

Oblikovanje naslovnice: Vesna Lasič





Dostop do e-publikacije:

http://library.ijs.si/Stacks/Proceedings/InformationSociety





Ljubljana, oktober 2016





CIP - Kataložni zapis o publikaciji

Narodna in univerzitetna knjižnica, Ljubljana



004.8(082)(0.034.2)



MEDNARODNA multikonferenca Informacijska družba (19 ; 2016 ; Ljubljana)

Izkopavanje znanja in podatkovna skladišča (SiKDD) [Elektronski vir] : zbornik

19. mednarodne multikonference Informacijska družba - IS 2016, 10. oktober 2016,

[Ljubljana, Slovenija] : zvezek D = Data mining and data warehouses (SiKDD) :

proceedings of the 19th International Multiconference Information Society - IS

2016, 10 October 2016, Ljubljana, Slovenia : volume D / uredila, edited by Dunja

Mladenić, Marko Grobelnik. - El. zbornik. - Ljubljana : Institut Jožef Stefan,

2016



Način dostopa (URL):

http://library.ijs.si/Stacks/Proceedings/InformationSociety/2016/IS2016_Volume_D%

20-%20SiKDD.pdf



ISBN 978-961-264-100-9 (pdf)

1. Gl. stv. nasl. 2. Vzp. stv. nasl. 3. Mladenić, Dunja

29872935





PREDGOVOR MULTIKONFERENCI

INFORMACIJSKA DRUŽBA 2016



Multikonferenca Informacijska družba (http://is.ijs.si) je z devetnajsto zaporedno prireditvijo osrednji srednjeevropski dogodek na področju informacijske družbe, računalništva in informatike. Letošnja prireditev je ponovno na več lokacijah, osrednji dogodki pa so na Institutu »Jožef Stefan«.



Informacijska družba, znanje in umetna inteligenca so spet na razpotju tako same zase kot glede vpliva na človeški razvoj. Se bo eksponentna rast elektronike po Moorovem zakonu nadaljevala ali stagnirala? Bo umetna inteligenca nadaljevala svoj neverjetni razvoj in premagovala ljudi na čedalje več področjih in s tem omogočila razcvet civilizacije, ali pa bo eksponentna rast prebivalstva zlasti v Afriki povzročila zadušitev rasti? Čedalje več pokazateljev kaže v oba ekstrema – da prehajamo v naslednje civilizacijsko obdobje, hkrati pa so planetarni konflikti sodobne družbe čedalje težje obvladljivi.



Letos smo v multikonferenco povezali dvanajst odličnih neodvisnih konferenc. Predstavljenih bo okoli 200

predstavitev, povzetkov in referatov v okviru samostojnih konferenc in delavnic. Prireditev bodo spremljale okrogle mize in razprave ter posebni dogodki, kot je svečana podelitev nagrad. Izbrani prispevki bodo izšli tudi v posebni številki revije Informatica, ki se ponaša z 39-letno tradicijo odlične znanstvene revije. Naslednje leto bo torej konferenca praznovala 20 let in revija 40 let, kar je za področje informacijske družbe častitljiv dosežek.



Multikonferenco Informacijska družba 2016 sestavljajo naslednje samostojne konference:

•

25-letnica prve internetne povezave v Sloveniji

•

Slovenska konferenca o umetni inteligenci

•

Kognitivna znanost

•

Izkopavanje znanja in podatkovna skladišča

•

Sodelovanje, programska oprema in storitve v informacijski družbi

•

Vzgoja in izobraževanje v informacijski družbi

•

Delavnica »EM-zdravje«

•

Delavnica »E-heritage«

•

Tretja študentska računalniška konferenca

•

Računalništvo in informatika: včeraj za jutri

•

Interakcija človek-računalnik v informacijski družbi

•

Uporabno teoretično računalništvo (MATCOS 2016).



Soorganizatorji in podporniki konference so različne raziskovalne institucije in združenja, med njimi tudi ACM

Slovenija, SLAIS, DKZ in druga slovenska nacionalna akademija, Inženirska akademija Slovenije (IAS). V imenu organizatorjev konference se zahvaljujemo združenjem in inštitucijam, še posebej pa udeležencem za njihove dragocene prispevke in priložnost, da z nami delijo svoje izkušnje o informacijski družbi. Zahvaljujemo se tudi recenzentom za njihovo pomoč pri recenziranju.



V 2016 bomo četrtič podelili nagrado za življenjske dosežke v čast Donalda Michija in Alana Turinga. Nagrado Michie-Turing za izjemen življenjski prispevek k razvoju in promociji informacijske družbe bo prejel prof. dr.

Tomaž Pisanski. Priznanje za dosežek leta bo pripadlo prof. dr. Blažu Zupanu. Že šestič podeljujemo nagradi

»informacijska limona« in »informacijska jagoda« za najbolj (ne)uspešne poteze v zvezi z informacijsko družbo.

Limono je dobilo ponovno padanje Slovenije na lestvicah informacijske družbe, jagodo pa informacijska podpora Pediatrične klinike. Čestitke nagrajencem!



Bojan Orel, predsednik programskega odbora

Matjaž Gams, predsednik organizacijskega odbora



i

FOREWORD - INFORMATION SOCIETY 2016



In its 19th year, the Information Society Multiconference (http://is.ijs.si) remains one of the leading conferences in Central Europe devoted to information society, computer science and informatics. In 2016 it is organized at various locations, with the main events at the Jožef Stefan Institute.



The pace of progress of information society, knowledge and artificial intelligence is speeding up, but it seems we are again at a turning point. Will the progress of electronics continue according to the Moore’s law or will it start stagnating? Will AI continue to outperform humans at more and more activities and in this way enable the predicted unseen human progress, or will the growth of human population in particular in Africa cause global decline? Both extremes seem more and more likely – fantastic human progress and planetary decline caused by humans destroying our environment and each other.



The Multiconference is running in parallel sessions with 200 presentations of scientific papers at twelve conferences, round tables, workshops and award ceremonies. Selected papers will be published in the Informatica journal, which has 39 years of tradition of excellent research publication. Next year, the conference will celebrate 20 years and the journal 40 years – a remarkable achievement.





The Information Society 2016 Multiconference consists of the following conferences:

•

25th Anniversary of First Internet Connection in Slovenia

•

Slovenian Conference on Artificial Intelligence

•

Cognitive Science

•

Data Mining and Data Warehouses

•

Collaboration, Software and Services in Information Society

•

Education in Information Society

•

Workshop Electronic and Mobile Health

•

Workshop »E-heritage«

•

3st Student Computer Science Research Conference

•

Computer Science and Informatics: Yesterday for Tomorrow

•

Human-Computer Interaction in Information Society

•

Middle-European Conference on Applied Theoretical Computer Science (Matcos 2016)



The Multiconference is co-organized and supported by several major research institutions and societies, among them ACM Slovenia, i.e. the Slovenian chapter of the ACM, SLAIS, DKZ and the second national engineering academy, the Slovenian Engineering Academy. In the name of the conference organizers we thank all the societies and institutions, and particularly all the participants for their valuable contribution and their interest in this event, and the reviewers for their thorough reviews.



For the fourth year, the award for life-long outstanding contributions will be delivered in memory of Donald Michie and Alan Turing. The Michie-Turing award will be given to Prof. Tomaž Pisanski for his life-long outstanding contribution to the development and promotion of information society in our country. In addition, an award for current achievements will be given to Prof. Blaž Zupan. The information lemon goes to another fall in the Slovenian international ratings on information society, while the information strawberry is awarded for the information system at the Pediatric Clinic. Congratulations!



Bojan Orel, Programme Committee Chair

Matjaž Gams, Organizing Committee Chair





ii

KONFERENČNI ODBORI

CONFERENCE COMMITTEES



International Programme Committee

Organizing Committee

Vladimir Bajic, South Africa

Matjaž Gams, chair

Heiner Benking, Germany

Mitja Luštrek

Se Woo Cheon, South Korea

Lana Zemljak

Howie Firth, UK

Vesna Koricki

Olga Fomichova, Russia

Mitja Lasič

Vladimir Fomichov, Russia

Robert Blatnik

Vesna Hljuz Dobric, Croatia

Aleš Tavčar

Alfred Inselberg, Israel

Blaž Mahnič

Jay Liebowitz, USA

Jure Šorn

Huan Liu, Singapore

Mario Konecki

Henz Martin, Germany



Marcin Paprzycki, USA



Karl Pribram, USA

Claude Sammut, Australia

Jiri Wiedermann, Czech Republic

Xindong Wu, USA

Yiming Ye, USA

Ning Zhong, USA

Wray Buntine, Australia

Bezalel Gavish, USA

Gal A. Kaminka, Israel

Mike Bain, Australia

Michela Milano, Italy

Derong Liu, Chicago, USA

Toby Walsh, Australia





Programme Committee

Bojan Orel, chair

Andrej Gams

Vladislav Rajkovič Grega

Nikolaj Zimic, co-chair

Matjaž Gams

Repovš

Franc Solina, co-chair

Marko Grobelnik

Ivan Rozman

Viljan Mahnič, co-chair

Nikola Guid

Niko Schlamberger

Cene Bavec, co-chair

Marjan Heričko

Stanko Strmčnik

Tomaž Kalin, co-chair

Borka Jerman Blažič Džonova

Jurij Šilc

Jozsef Györkös, co-chair

Gorazd Kandus

Jurij Tasič

Tadej Bajd

Urban Kordeš

Denis Trček

Jaroslav Berce

Marjan Krisper

Andrej Ule

Mojca Bernik

Andrej Kuščer

Tanja Urbančič

Marko Bohanec

Jadran Lenarčič

Boštjan Vilfan

Ivan Bratko

Borut Likar

Baldomir Zajc

Andrej Brodnik

Janez Malačič

Blaž Zupan

Dušan Caf

Olga Markič

Boris Žemva

Saša Divjak

Dunja Mladenič

Leon Žlajpah

Tomaž Erjavec

Franc Novak

Bogdan Filipič





iii





iv



KAZALO / TABLE OF CONTENTS



Izkopavanje znanja in podatkovna skladišča (SiKDD) / Data Mining and Data Warehouses (Sikdd) ................ 1

PREDGOVOR / FOREWORD ................................................................................................................................. 3

PROGRAMSKI ODBORI / PROGRAMME COMMITTEES ..................................................................................... 4

Near Real-Time Transportation Mode Detection Based on Accelerometer Readings / Urbančič Jasna,

Bradeško Luka, Senožetnik Matej ...................................................................................................................... 5

Application of Advanced Analytical Techniques in Support of Socio-Economic Impact Assessment of

Innovation Funding Programmes / Berčič Katja, Cattaneo Gabriel a, Karlovčec Mario, Fuart Flavio, Cerle Gaber ........................................................................................................................................................ 9

Information Flow between News Articles: Slovene Media Case Study / Choloniewski Jan, Leban Gregor, Maček Sebastijan, Rehar Aljoša ...................................................................................................................... 13

Data Analytics in Aquaculture / Pita Costa Joao, Rihtar Matjaž ........................................................................... 17

Modeling Probability of Default and Credit Limits / Herga Zala, Rupnik Jan, Škraba Primož, Fortuna Blaž ....... 21

Big Data Analysis Combining Website Visit Logs with User Segments and Website Content / Kladnik Matic, Fortuna Blaž, Moore Pat ........................................................................................................................ 25

Visual and Statistical Analysis of VideoLectures.NET / Novak Erik, Novalija Inna .............................................. 29

Spatio-Temporal Clustering Methods / Senožetnik Matej, Bradeško Luka, Kažič Blaž, Mladenić Dunja, Šubic Tine ......................................................................................................................................................... 33

Indeks avtorjev / Author index ................................................................................................................................ 37





v





vi



Zbornik 19. mednarodne multikonference

INFORMACIJSKA DRUŽBA – IS 2016

Zvezek D



Proceedings of the 19th International Multiconference

INFORMATION SOCIETY – IS 2016

Volume D



Izkopavanje znanja in podatkovna skladišča (SiKDD)

Data Mining and Data Warehouses (Sikdd)



Uredila / Edited by



Dunja Mladenić, Marko Grobelnik





10. oktober 2016 / 10 October 2016

Ljubljana, Slovenia

1





2

PREDGOVOR



Tehnologije, ki se ukvarjajo s podatki so v devetdesetih letih močno napredovale. Iz prve faze, kjer je šlo predvsem za shranjevanje podatkov in kako do njih učinkovito dostopati, se je razvila industrija za izdelavo orodij za delo s podatkovnimi bazami, prišlo je do standardizacije procesov, povpraševalnih jezikov itd. Ko shranjevanje podatkov ni bil več poseben problem, se je pojavila potreba po bolj urejenih podatkovnih bazah, ki bi služile ne le transakcijskem procesiranju ampak tudi analitskim vpogledom v podatke – pojavilo se je t.i. skladiščenje podatkov (data warehousing), ki je postalo standarden del informacijskih sistemov v podjetjih. Paradigma OLAP (On-Line-Analytical-Processing) zahteva od uporabnika, da še vedno sam postavlja sistemu vprašanja in dobiva nanje odgovore in na vizualen način preverja in išče izstopajoče situacije. Ker seveda to ni vedno mogoče, se je pojavila potreba po avtomatski analizi podatkov oz. z drugimi besedami to, da sistem sam pove, kaj bi utegnilo biti zanimivo za uporabnika – to prinašajo tehnike odkrivanja znanja v podatkih (data mining), ki iz obstoječih podatkov skušajo pridobiti novo znanje in tako uporabniku nudijo novo razumevanje dogajanj zajetih v podatkih.

Slovenska KDD konferenca pokriva vsebine, ki se ukvarjajo z analizo podatkov in odkrivanjem znanja v podatkih: pristope, orodja, probleme in rešitve.





FOREWORD



Data driven technologies have significantly progressed after mid 90’s. The first phases were mainly focused on storing and efficiently accessing the data, resulted in the development of industry tools for managing large databases, related standards, supporting querying languages, etc. After the initial period, when the data storage was not a primary problem anymore, the development progressed towards

analytical functionalities on how to extract added value from the data; i.e., databases started supporting not only transactions but also analytical processing of the data. At this point, data warehousing with OnLine-Analytical-Processing entered as a usual part of a company’s information system portfolio, requiring from the user to set well defined questions about the aggregated views to the data. Data Mining is a technology developed after year 2000, offering automatic data analysis trying to obtain new discoveries from the existing data and enabling a user new insights in the data. In this respect, the Slovenian KDD

conference (SiKDD) covers a broad area including Statistical Data Analysis, Data, Text and Multimedia Mining, Semantic Technologies, Link Detection and Link Analysis, Social Network Analysis, Data

Warehouses.





Dunja Mladenić, Marko Grobelnik





3

PROGRAMSKI ODBOR / PROGRAMME COMMITTEE



Dunja Mladenić

Marko Grobelnik



4

Near Real-Time Transportation Mode Detection Based on

Accelerometer Readings

Jasna Urbančič, Luka Bradeško, Matej Senožetnik

Jožef Stefan Institute

and

Jožef Stefan International Postgraduate School

Jamova cesta 39

1000 Ljubljana, Slovenia

ABSTRACT

public transportation with buses and trains can be a good

This paper describes a method for automatic transportation

alternative to private vehicles.

mode detection based on smartphone sensors. Our approach

is designed to work in real-time as it only requires 5s of sen-

As the main smartphone APIs already support fine-grained

sor readings for the detection. Because we used accelerom-

classification of non-motorized forms of transportation [2,

eter instead of GPS signal it uses less battery power and is

3], we focused on distinguishing means of motorized trans-

therefore more user and phone-friendly. For the mode detec-

port, specifically cars and buses, as the majority of passen-

tion we use multiple support vector machine models which

ger traffic in Ljubljana represent cars and buses. Trains and

enable us distinguishing between multiple modes (bus, train,

motorbikes are not that common, whereas subway and tram

car). Before the classification, raw measurements are pre-

infrastructures do not exist. Our goal is to recognize each

processed in order to cancel out the constant acceleration

mode of transportation in near real time while mobile phone

that is caused by the force of gravity. The results of the

users are traveling.

paper are promising and are based on the collected training

data from approximately 20 hours of driving on trains and

2.

RELATED WORK

public buses in Ljubljana.

Ever since smartphones appeared and gained accessibility

there has been a lot of research activity for their usage in

Keywords

user activity recognition and transportation mode detection.

activity recognition, support vector machine classification,

While the first attempts to recognize user activity were done

accelerometer

before smartphones, the real effort in that direction started

with the development of mobile phones having built-in sen-

1.

INTRODUCTION

sors [7]. Besides GPS sensors, also GSM triangulation and

Nowadays most smartphones have built-in sensors that mea-

local area wireless technology (Wi-Fi) can be employed for

sure motion, acceleration, orientation, and various other en-

the purpose of transportation mode detection. However its

vironmental conditions with quite high precision and sam-

accuracy is relatively low compared to GPS, therefore we

pling frequency.

This can be used with great success in

deem these out of scope of this paper[8].

everyday challenges, for example tracking and routing ap-

plications. It has been proven that smartphone sensors are

Latest state of the art research is focused on transporta-

useful in monitoring three-dimensional device movement and

tion mode detection based on GPS signal and/or accelerom-

positioning [1], and also user’s activity detection, which is

eter data. Approaches that rely solely on GPS trajectories

also in the domain of this paper.

require GPS signal of high-quality, whereas phones GPS

receiver is generally severely shielded during daily activi-

Mobile operating system developers are aware of such ap-

ties [10]. This may occur during travel underground, inside

plications, therefore their APIs include activity recognition

stations, or even when user is not sufficiently close to a win-

packages. However, they detect only a few modes - still,

dow when traveling in a vehicle [5], and results in loss of

walking, running, cycling, and in vehicle. Such coarse-grained

positioning information. Another known issue when using

classification is not enough for tracking and routing pur-

GPS signal on mobile device is high power consumption [5],

poses, specially in use-cases for urban environments, where

which is especially not pleasing in the case of longer com-

mutes. Both of these two issues suggest that the accelerom-

eter sensor is more appropriate for activity detection.

Another advantage of using accelerometer data over GPS

signal is that it does not require additional external data.

Many researches using GPS data used external data, such

as GIS data on bus stations, bus routes and railway lines [9].

5





Figure 1: Amplitudes of raw accelerometer data for

Figure 2: Amplitudes of preprocessed accelerometer

different means of motorized transportation.

data for different means of motorized transporta-

tion.

3.

TRANSPORTATION MODE DETECTION

For the purpose of collecting accelerometer data we extended

The gravity estimation algorithm works as follows: for a

the GPS tracking mobile application with the accelerometer

chosen sampling interval (in our case 1s), obtain an esti-

measuring ability. The phone sensor measures acceleration

mate of the gravity component on each axis by averaging

forces in m/s2 for all of the three physical axes (x, y, z). The

all measurements in the interval on that axis [6]. After ob-

sampling rate is 100Hz (1 measurement every 10 ms). To

taining the estimates, we subtracted the gravity component

increase the diversity of the training data-set, measurements

from all of the entries on corresponding axis in given time

were acquired in multiple ways:

interval. Through this we obtained only the dynamic ac-

celeration component of the signal. Amplitudes of dynamic

accelerometer signal are shown in Figure 2.

• Person is collecting data while traveling by the car and

stops the collection at the destination.

3.2

Classification

• Person is collecting data while traveling by the bus and

After preprocessing the accelerometer readings we extracted

stops the collection on exit.

features for the classification process. We used mean, vari-

ance, skewness, 5th, and 95th percentile of acceleration data

• Person is traveling by the train and is collecting data

on all three axes. We also split the acceleration into posi-

until the arrival to the final destination

tive part, which indicates that the velocity of movement in

• Person is collecting data while driving a motorbike and

that direction is increasing, and negative part, which indi-

stops the collection at the destination.

cates that the velocity is decreasing, and calculated the same

statistics on these two parts.

We collected approximately 20 hours of travelling measure-

We used support vector machine (SVM) classifier as it was

ments, with the travel modes distributed as follows: car −

previously successfully used in similar work [8]. The imple-

57%, bus−32%, train−11%, motorbike−0.1%. Amplitudes

mentation was SVM classifier (SVC) from QMiner package.

of the raw accelerometer data are shown in Figure 1.

QMiner is an open source analytics platform for performing

large scale data analysis written in C++ and exposed via a

3.1

Preprocessing

Javascript API [4].

In order to reduce the computation time and to have faster

response time for real-time classification, we split the recorded

First we focused on the binary classification of car and bus

accelerometer signal into smaller pieces that do not exceed

transportation versus the rest.

We trained binary classi-

5s timewise. This enables us to work on chunks of record

fiers for each of the labels in one against the rest manner.

that span only through 5s or less. This additionally pre-

That means that examples labled with this particular label

serves battery life, saves space, and reduces usage of mobile

represented positive examples, whereas all other examples,

data.

regardless of class represented negative examples. However,

we also did one against one classification for each pair of

Acceleration measurements are correlated with the orien-

labels. That means that examples of one class were marked

tation of the phone in 3D space, as gravity is measured to-

as positive, examples of another class were marked as nega-

gether with the dynamic acceleration caused by phone move-

tive, and the rest of the learning set was filtered out. Later,

ments. Thus we have to be able to separate the constant

we extended this to support multi-class classification. For

acceleration caused by the gravity and the dynamic part of

multi-class classification we used binary models and com-

the acceleration.

bined their predictions based on the distance between the

6

Table 1: Table of all extracted features.

Acceleration data

Features

X axis

Mean (Total, Acceleration, Deceleration)

Y axis

Standard deviation (Total, Acceleration, Deceleration)

Z axis

Skewness (Total, Acceleration, Deceleration)

Amplitude

5th percentile (Total, Acceleration, Deceleration)

95th percentile (Total, Acceleration, Deceleration)

Total number of features

60

Table 2:

Classification accuracy, precision, recall,

Table 3:

Confusion matrix for classification for 3

and F1 score for binary classifiers of different trans-

classes with 10-fold cross-validation.

portation modes as results of 10-fold cross valida-

True\ Pred.

Car

Bus

Train

UC

tion.

Car

0.818

0.012

0.008

0.162

Accuracy

Precision

Recall

F1 score

Bus

0.198

0.219

0.072

0.511

Car

0.855

0.852

0.910

0.880

Train

0.118

0.042

0.344

0.496

Bus

0.720

0.620

0.694

0.655

Train

0.876

0.726

0.671

0.697

Table 4:

Classification accuracy, precision, recall,

and F1 score for classification with 3 classifiers with

separating hyper plane and the test sample.

10-fold cross-validation.

Accuracy

Precision

Recall

F1 score

4.

EVALUATION

Car

0.823

0.826

0.818

0.827

Evaluation section is divided into two parts. In the first one

Bus

0.725

0.844

0.219

0.347

are presented the results of experiments with one against the

Train

0.859

0.683

0.181

0.286

rest classification, whereas in the second part we discuss the

Average

0.803

0.784

0.401

0.535

results of one against one classification. In both parts we

considered two different scenarios. In the first one we tested

if our approach can recognize a specific transportation mode

plan behind UC is, that we ask the application providing

from all the others (one-vs-all) or if SVC can distinguish

accelerometer data for new sample, which can help us re-

between two specific modes of transportation (one-vs-one).

classify into the proper class. The results in Table 3 and

In the second, we used previously obtained binary models

Table 4 are not surprising as the majority of cars is classified

to classify to three(3) different classes. Our main focus was

as cars and also most of misclassified cars are instances that

recognizing traveling by cars and buses as these represent

belong to either none or more than one class. In contrast

majority of the passenger traffic in Ljubljana and therefore

to cars, proportions of correctly classified buses and trains

the majority of our training data.

are smaller than the proportions of UC for these two classes,

which shows that our approach to combining predictions for

We measured performance of the models with classification

multiple classifiers might not be the best.

accuracy, precision and recall. Furthermore, we estimated

a harmonic mean of precision and recall with the F1 score.

4.2

One against one

We used 10-fold cross validation to tune the parameters of

We did similarly for one-vs-one binary classification. Results

each binary model.

of this are shown in Table 5, which shows that cars are

very well distinguishable from trains and vice versa. Buses

4.1

One against all

are less distinguishable from cars and trains, however the

Binary classification (one-vs-all) was done for each of the

accuracy and F1 score of all the classifications are still above

three classes (car, bus and train). Classification accuracy,

0.8.

precision and recall for the most suitable values of parame-

ters are listed in Table 2.

We used these six binary classifiers for multi-class classifica-

tion. Confusion matrix and accuracy, precision, recall and

We got the best results (accuracy, precision and recall) for

F1 score are listed in Tables 6 and 7. Tables show that clas-

car travel detection. There was some drop in the perfor-

sification accuracy, precision, recall and F1 score are higher

mance of bus travel detection, and even bigger drop for the

than in case of one against all classification. Confusion ma-

train detection. We assume that the main cause for perfor-

mance drop is smaller training data set for bus and train.

We plan to resolve this with additional data-set collection

Table 5:

Accuracy / F1 score for one-vs-one bi-

as part of the future work.

nary classification. Rows represent positive exam-

ples, whereas columns are negative examples.

For the multi-class classification we used binary models from

Car

Bus

Train

Table 2. We mapped results into 4 classes (car, bus, train

and UC - unable to classify). If according to the binary

Car

0.848/0.889

0.943/0.962

classification, an instance belongs to none of the classes or

Bus

0.856/0.832

0.858/0.896

more than one, we label it as UC (unable to classify). The

Train

0.934/0.883

0.815/0.760

7

[4] B. Fortuna, J. Rupnik, J. Brank, C. Fortuna,

Table 6:

Confusion matrix for classification for 3

V. Jovanoski, and M. Karlovcec. Qminer: Data

classes with 10-fold cross-validation.

analytics platform for processing streams of structured

True\ Pred.

Car

Bus

Train

and unstructured data. In Software Engineering for

Car

0.893

0.088

0.019

Machine Learning Workshop, Neural Information

Bus

0.282

0.573

0.145

Processing Systems, 2014.

Train

0.167

0.162

0.671

[5] S. Hemminki, P. Nurmi, and S. Tarkoma.

Accelerometer-based transportation mode detection

on smartphones. In Proceedings of the 11th ACM

Table 7:

Classification accuracy, precision, recall,

Conference on Embedded Networked Sensor Systems,

and F1 score with 10-fold cross-validation.

Accuracy

Precision

Recall

F1 score

page 13. ACM, 2013.

Car

0.825

0.785

0.893

0.835

[6] D. Mizell. Using gravity to estimate accelerometer

Bus

0.787

0.724

0.573

0.640

orientation. In Proc. 7th IEEE Int. Symposium on

Train

0.886

0.672

0.671

0.671

Wearable Computers (ISWC 2003), page 252.

Average

0.832

0.727

0.712

0.716

Citeseer, 2003.

[7] S. Reddy, M. Mun, J. Burke, D. Estrin, M. Hansen,

and M. Srivastava. Using mobile phones to determine

transportation modes. ACM Transactions on Sensor

trix shows that nearly 90% of cars is classified correctly,

Networks (TOSN), 6(2):13, 2010.

whereas for buses and trains the percentage of correctly clas-

sified instances drops to 57% and 67% respectively. It also

[8] M. A. Shafique and E. Hato. Use of acceleration data

shows that trains are equally likely to be miss-classified as

for transportation mode prediction. Transportation,

cars and the buses. In contrast to trains, buses are more

42(1):163–188, 2015.

often miss-classified as cars than trains. In comparison with

[9] L. Stenneth, O. Wolfson, P. S. Yu, and B. Xu.

one-vs-all multi-class classification, values of recall and F1

Transportation mode detection using mobile phones

score are much higher, whereas accuracy and precision for

and gis information. In Proceedings of the 19th ACM

these two approaches are comparable.

SIGSPATIAL International Conference on Advances

in Geographic Information Systems, pages 54–63.

ACM, 2011.

5.

CONCLUSION

[10] P. Widhalm, P. Nitsche, and N. Brändie. Transport

In the presented work we showed that it is possible to detect

mode detection with realistic smartphone sensor data.

transportation mode using support vector machine, with

In Pattern Recognition (ICPR), 2012 21st

short readings of accelerometer signal.

This proves that

International Conference on, pages 573–576. IEEE,

near real-time activity detection with fine-grained motorized

2012.

transportation modality is possible.

Nonetheless, there are still some issues for the future work.

First one is regarding unbalanced data set as we will have

to collect more data, especially for train and motorbike de-

tection. Secondly, our task will be improving binary clas-

sification regarding buses as this class was most often mis-

classified. However, this might be also caused by our policy

for combining multiple binary classification results, which is

also something we will have to work on.

6.

ACKNOWLEDGMENTS

This work was supported by the Slovenian Research Agency

and the ICT program of the EC under project OPTIMUM

(H2020-MG-636160).

7.

REFERENCES

[1] Sensors overview. https://developer.android.com/

guide/topics/sensors/sensors_overview.html,

2016. [Online; accessed 25-August-2016].

[2] ActivityRecognition. https://developers.google.

com/android/reference/com/google/android/gms/

location/ActivityRecognition, 2016. [Online;

accessed 25-August-2016].

[3] CMMotionActivity. https://developer.apple.com/

library/ios/documentation/CoreMotion/

Reference/CMMotionActivity_class/index.html#//

apple_ref/occ/cl/CMMotionActivity, 2016. [Online;

accessed 25-August-2016].

8

Application of Advanced Analytical Techniques in Support

of Socio-economic Impact Assessment of Innovation

Funding Programmes

Katja Berčič

Gabriella Cattaneo

Mario Karlovčec

Jožef Stefan Institute

IDC Italia

Jožef Stefan Institute

Jamova cesta 39

Viale Monza 14

Jamova cesta 39

1000 Ljubljana

20127 Milano

1000 Ljubljana

Slovenia

Italy

Slovenia

katja.bercic@ijs.si

gcattaneo@idc.com

mario.karlovcec@ijs.si

Flavio Fuart

Gaber Cerle

Jožef Stefan Institute

Jožef Stefan Institute

Jamova cesta 39

Jamova cesta 39

1000 Ljubljana

1000 Ljubljana

Slovenia

Slovenia

flavio.fuart@ijs.si

gaber.cerle@ijs.si

ABSTRACT

Keywords

This paper gives an introduction to the European Programme

impact assessment, network visualisation, statistical meth-

for Internet Innovation and the needs for advanced analyt-

ods

ical techniques in assessment of the socio-economic impact

of funded projects. A technical overview and architecture of

1.

INTRODUCTION

developed IT tools follows. Broadly, two set of tools were

The Future Internet Public-Private Partnership (FI-PPP)

explored. (1) An on-line assessment environment that pro-

is the European programme for Internet innovation. Phase

vides programme managers and companies with feedback

Three of the FI-PPP funding is targeting more that 1000

about their potential socio-economic impact. For this set of

entrepreneurs, start-ups or SMEs in an attempt to multi-

tools the emphasis is on visualisation techniques, therefore

ply the uptake and impact of the technologies developed in

a detailed rationale and scientific background for those is

previous phases1. Under that framework the Future Inter-

provided with usage examples. (2) Statistical modules for

net Impact Assurance (FI-MPACT)2 Support Action was

identification of funding approaches that, potentially, pro-

funded in order to collect and assess the qualitative and

vide best results (”best practices”) have been developed and

quantitative evidence of the potential socio-economic impact

applied to available data. Through one use-case we have

of the programme by measuring and projecting market sec-

shown that the proposed IT system is useful in measur-

tor economic potential, stakeholder take-up and technolog-

ing the impact of innovation funding programmes, detecting

ical impact of Phase III SME Accelerator projects to 2020.

good management practices and giving support to funded

Accelerator projects aim at investing in the strongest start-

projects. The whole system has been published on a public

ups (”subgrantees”) across Europe and create an ecosystem

repository with an open source license, giving opportunity

where entrepreneurs and business incubators meet.

to re-use, customize and integrate the reporting system into

other impact assessment environments.

Based on Key Performance Indicators (KPI) elaborated in

the Impact Assessment Guidebook [3], a set of Impact As-

Categories and Subject Descriptors

sessment tools were developed to collect empirical data

H.4.0 [Information Systems Applications]: General—

from subgrantees and other interested initiatives. In this

reporting, statistical analysis, visualisation, networks

context, developed IT tools provide an automated assess-

ment system, driven by a set of KPIs, allowing Accelera-

General Terms

tor projects, other start-ups and entrepreneurs to measure

Algorithms

the potential impact compared to industry standards and

the global community of FI-PPP projects. By benchmark-

ing their progress in relation to different business processes,

they can identify areas where improvements are needed and

measure progress.

Furthermore, a set of scripts for statistical data process-

ing were combined in Accelerator Benchmarking tools

1https://www.fi-ppp.eu/

2http://cordis.europa.eu/project/rcn/191426 en.html

9



to support the identification of accelerators best practices.

tool is implemented as a interactive scatter plot that posi-

This quantitative analysis was used to support, guide and

tions projects in two-dimensional space defined with per-

strengthen the expert judgement of qualitative indicators of

formance indicators. This enables comparison of projects

best performing accelerators.

with different performance levels against different pairs of

indicators. Performance indicators supported by the tool

In this paper we present the developed IT tools [2], their

are: innovation, market focus, feasibility, market needs and

usage in practice, some results and possible use of these

mattermark growth. Figure 1 gives an example of perfor-

tools in future.

mance graph where the selected project is compared with

other high performance projects, based on market focus and

innovation variables.

2.

IMPACT ASSESSMENT TOOL

The Impact Assessment tool3 was focused around funded

subgrantees to facilitate mapping of this portfolio, to con-

2.2

Multi-attribute Based Similarity Tool

tribute to the overall impact assessment of the FI-PPP Phase

In order to provide both the sub-grantees and accelerators

3 and assist in forecasting the potential impact of this inter-

better insight in project’s performance, a tool that shows

vention up to 2020. The Self-Assessment tool is open to all

the selected project’s position in relation to other projects

interested parties and respondents can undertake the survey

was created. In this tool, similarities among projects are

at different stages to measure their progress.

determined using a wide range of attributes that describe

projects. This gives users a chance to form their own judge-

The Impact/Self-Assessment tool provides a start-up sanity

ment about “good” or “bad” similarities. The tool is imple-

check, by calculating KPIs for Innovation, Market Focus,

mented using several well-known data and network analysis

Feasibility and Market Needs. Instant feedback is provided

methods, and visualized as a network graph. The visual-

by benchmarking respondent’s scores with average scores of

ization gives sub grantees and their mentors/reviewers in-

his/her peers, or a group of most successful peers.

sight into how they compete with other projects and ideas,

identify possible similarities, find opportunity windows or

search for possible partners. The tool is implemented as an

2.1

Key Performance Indicators Benchmark-

open API using QMiner [5] data analytics platform for pro-

ing

cessing large-scale real-time streams containing structured

and unstructured data. The approach used for implement-

ing multi-attribute based similarity tool consists of following

steps: (1) Importing data and feature extraction; (2) Com-

puting the main similarity graph; (3) Generating custom

graph for selected project; and (4) Visualization.

In the first step, project data is imported into the system and

features are extracted from the data. Prior to importing,

data is transformed into JSON file format. The file contains

a configuration part that determines which attributes of the

data will be used as features in further analysis. This makes

the system flexible for feature-set changes. The feature-set

consists of numerical, categorical and textual data obtained

from questionnaires. Feature extractor for textual data ap-

plies English stop-words removal in the tokenization phase

and normalizes the word frequencies using TF-IDF weight-

ing.

The main similarity graph shows project similarities based

on multiple attributes in form of a two dimensional net-

work graph. This graph is the basis for constructing other

custom graphs where a particular selected project is in the

Figure 1: Performance graph based on market fo-

focus. The graph is constructed in two steps. First, the

cus and innovation variables for a selected project

multidimensional vector representation of projects obtained

represented with a large violet circle.

Different

in the feature extraction phase is transformed into two di-

shapes correspond to performance levels:

rhom-

mensional representation by using Multidimensional scaling

bus for Hight Performance Initiatives, squares and

(MDS) [7] method. A similar approach of dimensionality

crosses for a small selection promising initiatives

reduction was used in [4] for visualization of a text docu-

that received further funding for dissemination ac-

ment corpus. In the second step of graph construction, the

tivities.

two dimensional projections of projects are connected using

Delaunay triangulation [9].

This dynamic benchmarking against other respondents gives

entrepreneurs and their mentors a tool to monitor progress

A custom graph is such a graph where one project is in

and identify areas where additional support is required. The

the main focus. This can be a project selected from the

list of already imported projects. It can also be generated

3https://github.com/JozefStefanInstitute/fi-impact

for a new project that is in the preparation phase, to see

10



egorical indicators, CDF diagrams for practice scores and

score histograms). All the diagrams are produced in two

versions - with all sub-grantees and per-accelerator. Results

are grouped into three main categories, as described below.

The diagram in Figure 3 shows the structure of the tool.

The degree of connectivity tables are a result of analysis of

FI-PPP project partnerships. They contain counts of dif-

ferent types of connections. Every pair of projects can have

more than one connection, where each such connection be-

tween two projects in the three FI-PPP phases represents a

partner that appeared in both.

Correlation analysis is performed on all numerical acceler-

ator properties and all numerical subgrantee performance

indicators by correlating each possible pair. The results in-

clude a heatmap of correlations, a table of correlations and

a corresponding table with sizes of samples on which the

correlations were computed.

Figure 2: Multi-attribute based similarity graph for

an arbitrary project. The selected project is repre-

Best practices identification produces notched box plots for

sented with a black dot in centre of the figure.

categorical accelerator properties, CDF plots for binary ac-

celerator properties, correlation outputs for numerical ac-

celerator properties (heatmap and tables) and histograms

where is it positioned in relation to other projects. A cus-

for performance indicators. These can then provide insight

tom graph is generated by first extracting features from the

into which practices coincided with better subgrantee per-

selected project using the same feature extractors that were

formance.

used for computing the main similarity graph. In second

step, distance between each of the imported project and the

We define a good practice as an activity performed by one or

selected project is computed (either cosine or euclidean dis-

more of the Accelerators’ consortia according to their accel-

tance can be used). Based on user selected threshold, an

eration plans, which based on objective evidence, is shown

edge is created between more similar projects, while less sim-

to have contributed to the good performance of subgrantees.

ilar projects are excluded from the graph. Finally, projects

Here, good performance of the subgrantees is primarily their

selected for the custom graph are connected based on the

market success (measured in terms of positive dynamics

main similarity graph.

of revenue growth and customer growth); their ability to

convince potential investors and collect additional funding

The final step of the similarity tool is graph visualization.

(“traction”); if they are not yet on the market, their market

The graph is visualized using SigmaJS4 library and a plug-

readiness (measured by FI-IMPACT’s KPIs scores).

in that implements ForceAtlas2 [6] algorithm.

The lay-

outing algorithm mimics the physical attraction and repul-

The full report about the usage of developed tools, com-

sion forces, placing well connected and similar project closer

bined with qualitative data based on face-to-face interviews

on the two dimensional plane. An example of a similarity

with Accelerators coordinators and interpretation of results

graph for a project is given in Figure 2.

is given in [1]. Some insights were drawn from the results

computed by the benchmarking tool, while some of the con-

3.

ACCELERATOR BENCHMARKING/BEST

clusions formulated with other methods were confirmed by it

[1]. For example the geographical scope of the partnerships

PRACTICES REPORTING TOOLS

did not seem to make a major difference, while the presence

Accelerator Benchmarking/Best Practices Reporting Tools

of professional accelerators had a strong positive influence

are written in R [8] programming language and software

on most subgrantee performance indicators.

environment for statistical computing and graphics. In FI-

IMPACT, several R modules5 have been developed to auto-

There were several results computed in the benchmarking

matically produce statistical reports about Accelerator Bench-

tool that did not provide any meaningful insights, but did

marking and Best Practices based on FI-IMPACT databases.

confirm that the data exhibits some expected properties.

For example, unsurprisingly, the benchmarking tool showed

The databases are imported and merged into an R data

a strong positive correlation between investment per sub-

frame. The tool then produces correlation heatmaps, prac-

grantee and KPIs measuring feasibility and innovation.

tice scores with accompanying mean difference t-test results

and FI PPP phase projects connectivity information. Most

4.

CONCLUSIONS

of the results are produced in final-form CSV files. The tool

also outputs several sets of diagrams (boxplots for all cat-

We have shown how a set of IT tools and computational

methods can be used to assess the socio-economic impact

4http://sigmajs.org/

of a big investment programme, such as FI-PPP European

5https://github.com/JozefStefanInstitute/FI-Impact-R-

programme for Internet innovation. On one hand, the im-

analysis

plementation followed a strict methodology used for bench-

11



Figure 3: Benchmarking tool inputs and outputs

marking through questionnaires and calculation of respec-

[accessed 12 September 2016].

tive KPIs, on the other hand, more exploratory techniques

[2] FI-IMPACT Project. Fi-impact online assessment

were used for statistical analysis and visualisation.

environment, 2016. Deliverable D4.3 Available from

http://www.fi-impact.eu/page/docdownload/479/

While it is true that many of the results of the statistical

[accessed 12 September 2016].

analysis implemented in the Accelerator benchmarking tools

[3] FI-IMPACT Project. Impact assessment guidebook, v2,

did not show statistically significant new results, it is also

2016. Deliverable D2.1.v2 Available from

true that the results displayed most of the expected correla-

http://www.fi-impact.eu/page/docdownload/474/

tions. The statistical analysis explains little of the variations

[accessed 12 September 2016].

in performance of the subgrantees. In our opinion, the main

[4] B. Fortuna, M. Grobelnik, and D. Mladenic.

weakness of the statistical correlation analysis was the very

Visualization of text document corpus. Informatica

limited available data on actual market performance. How-

(Slovenia), 29(4):497–504, 2005.

ever, weak signals, combined with with the results of the

[5] B. Fortuna, J. Rupnik, J. Brank, C. Fortuna,

qualitative interviews, point to the positive role of profes-

V. Jovanoski, M. Karlovcec, B. Kazic, K. Kenda,

sional accelerators within consortia, and positive impacts of

G. Leban, D. Mladenic, A. Muhic, B. Novak, E. Novak,

practices such as workshops, matchmaking and providing

J. Novljan, M. Papler, L. Rei, B. Sovdat, and L. Stopar.

gateways to further funding.

Qminer: Data analytics platform for processing streams

of structured and unstructured data. In Software

Impact Assessment and Similarity Graph benchmarking tools

Engineering for Machine Learning Workshop, Neural

have been deployed and made available to over 700 sub-

Information Processing Systems, 2014.

grantees (projects), their mentors an coordinators.

The

[6] M. Jacomy, T. Venturini, S. Heymann, and M. Bastian.

tool was deemed very useful for long-term monitoring of

Forceatlas2, a continuous graph layout algorithm for

sub-grantees performance. In this respect, the FI-IMPACT

handy network visualization designed for the gephi

project consortium is drawing plans to assure long-term sus-

software. PLoS ONE, 9(6):1–12, 06 2014.

tainability of the deployed IT systems.

[7] J. B. Kruskal. Multidimensional scaling by optimizing

goodness of fit to a nonmetric hypothesis.

5.

ACKNOWLEDGEMENTS

Psychometrika, 29(1):1–27, 1964.

This work was supported by the the ICT Programme of the

[8] R Core Team. R: A Language and Environment for

EC under FI-IMPACT (FP7-ICT-FI-632840).

Statistical Computing. R Foundation for Statistical

Computing, Vienna, Austria, 2013.

6.

REFERENCES

[9] P. Su and R. L. S. Drysdale. A comparison of

[1] FI-IMPACT Project. Analysis of accelerators’ good

sequential delaunay triangulation algorithms. In

practices annex 8.5 to deliverable d2.4 update of

Proceedings of the Eleventh Annual Symposium on

impact assesment and forecast, 2016. Annex 8.5 of D2.4

Computational Geometry, SCG ’95, pages 61–70, New

Available from

York, NY, USA, 1995. ACM.

http://www.fi-impact.eu/page/docdownload/477/

12

Information flow between news articles: Slovene media

case study

Jan Chołoniewski

Gregor Leban

Sebastijan Maček,

Center of Excellence for

Artificial Intelligence

Aljoša Rehar

Complex Systems Research,

Laboratory,

Slovenska Tiskovna Agencija,

Faculty of Physics,

Jožef Stefan Institute,

Tivolska 48, 1000 Ljubljana,

Warsaw University of

Jamova 39, 1000 Ljubljana,

Slovenia

Technology,

Slovenia

{sm,ar}@sta.si

Koszykowa 75, PL-00662,

gregor.leban@ijs.si

Warsaw, Poland

choloniewski@if.pw.edu.pl

ABSTRACT

fully applied to determine a relation between two articles.

We present results of a study on usage of text similarity mea-

Possible relations that we want to determine are (a) there

sures based on co-occurrence of words and phrases to classify

is no relation, (b) they share a common source, or (c) one is

a relation between a pair of news articles (i.e. no relation,

based on the other one. To find the most efficient way to do

both based on a common source, one based on the other).

that, we calculated cosine similarity of ”bag of n-grams” rep-

For each Slovenian article written in Slovene and published

resentations of articles from Slovene media published on one

online on 27th June 2016, we found the most similar release

day with releases from Slovenska Tiskovna Agencija (STA;

from the Slovenian Press Agency (STA) database to obtain

Slovenian Press Agency) to preselect the most similar re-

a list of candidate article-source pairs. Four experts from

lease to each article, asked experts to annotate the candi-

STA were asked to score the pairs, and their annotations

date pairs, and compared results for different thresholds and

were used to train classifiers and evaluate their accuracy.

n-grams with the annotations.

The rest of the paper is structured as follows: in Section 2

1.

INTRODUCTION

we highlight a process of obtaining experimental data (can-

Propagating, exchanging, organizing and processing infor-

didate source release matching, expert annotation study),

mation are important parts of human social interactions on

in Section 3 we describe applied methods and benchmark

both micro- [1] and macro-level [2]. After years of local and

parameters, then in Section 4 we present results of classi-

nationwide scale, newspapers, press agencies and news out-

fication study in two simplified cases; Section 5 contains a

lets started to operate at global level using Internet. An easy

discussion and possible improvements, and Section 6 sums

and open access to their releases is desirable for news con-

up the research.

sumers and can be tracked e.g. with website traffic statistics

or in social media. Article reuse by other publishers (autho-

2.

DATA

rized or not) is however not that straight-forward.

Data for the study consist of randomly selected 469 articles

out of 895 published on 27th June 2016 from 62 Slovene

Combining natural language processing methods with data

online news outlets as tracked by the EventRegistry [9].

gathered in the Internet allows to quantify and measure so-

For each article, we have found the most similar one in the

cial information processing phenomena [3, 4, 5, 6]. The ad-

STA releases database in terms of cosine similarity of two-

vancements in NLP might serve all types of text-based media

, three- and four-word phrases ({2,3,4}-grams) occurrence

(in particular online news outlets) to provide tools to track

vectors with TF-IDF weighting (see section 3 for details). A

spreading of their texts.

histogram of obtained similarities is presented in Figure 1.

About 75% (354 out of 469) of candidate pairs obtained a

A tool that automatically finds articles based on a given ar-

cosine similarity below 0.1, and 10% (47 out of 469) over 0.9.

ticle might be useful for news outlets and press agencies to

track usage of their releases and to find cases of plagiarism

The pairs were scored by four experts from STA (A1, A2,

or unauthorized use. Moreover, it might be applied to large

A3, A4). They were asked to mark each pair with one of

scale news spreading studies [3]. A software-assisted plagia-

the following scores:

rism detection is a well-known problem in an information

retrieval field [7], and using text similarity-based methods is

one of the most popular approaches [8]. To the best of our

• NF – the proper source release has not been found

knowledge, the following paper is the first published study

despite it is present in the STA database,

of plagiarism detection in Slovene media supported by pro-

• NC – the proper source release has not been found and

fessional press agency workers.

it is not present in the STA database,

The aim of the presented work is to check if text comparison

• DS – the proper source release has been found (al-

methods based on co-occurrence of phrases can be success-

though it might be one of many sources of the article),

13

(2,4)-gram cosine similarity histogram

103

A1

A2

A3

A4

A1

100%

87%

86%

87%

A2

87%

100%

89%

89%

A3

86%

89%

100%

83%

102

A4

87%

89%

83%

100%

(a) Raw annotations

count 101

A1

A2

A3

A4

A1

100%

88%

86%

88%

A2

88%

100%

91%

89%

A3

86%

91%

100%

84%

100

A4

88%

89%

84%

100%

0.0

0.2

0.4

0.6

0.8

1.0

cosine similarity

(b) Simplified A – DS and IDS merged

Figure 1: A histogram of {2,3,4}-gram cosine similarities of

A1

A2

A3

A4

candidate pairs with a logarithmic Y-axis.

A1

100%

96%

99%

95%

A2

96%

100%

99%

96%

A3

99%

99%

100%

95%

• IDS – the article and the proposed source release are

A4

95%

96%

95%

100%

both based on the same third party source.

(c) Simplified B – IDS and NC merged

Table 2: Agreement among annotators.

In cases where the source was not found (NF), the annota-

tors provided a link to the proper source release.

In Table 1, we present basic statistics of the annotations

3.

METHODS

given by experts. We considered two methods of simplifying

Articles and releases were mapped to ”bag of n-grams” rep-

the annotations.

The first one (A), merges DS and IDS

resentations. Additionally, n-gram counts were transformed

marks to discriminate between two classes – a given pair

using term frequency-inverted document frequency (TF-IDF)

contains pieces of the same information or is unrelated. The

weighting trained on a corpus of 5,000 randomly selected

second one (B), merges IDS with NC – the algorithm’s task

Slovene articles stored in the EventRegistry published dur-

is to check if one text is directly based on the other one.

ing two weeks preceding the analyzed day.

Terms which

occurred in more than 25% of documents were discarded.

person

total

NF

NC

DS

IDS

Laplace smoothing with α = 1 was applied to include terms

which were not present in the corpus.

A1

469

3

315

98

53

A2

469

2

358

97

12

For n = 1, ..., 5, weighed term vectors of Slovene articles

A3

95

0

61

23

11

from 27th June 2016 were compared with all vectors of STA

A4

95

0

70

20

5

releases published between 20th and 27th June 2016 to find

candidate source releases. For each n, we tested classifiers

Table 1: Basic statistics of raw candidate release-article

with a threshold from 0.00 to 1.00 with steps of 0.01 to find

pairs annotations by the STA experts. total – a number

the threshold for which the method achieves the highest F1-

of annotated pairs; NF – source not detected despite the

score for A and B simplifications separately. The releases

source release is in the STA archive; NC – no source release

were compared with the list created using preselected pairs

in the STA archive; DS – one article is a direct source of

and experts’ comments. A source release for a given article

the other; IDS – both documents based on the same third

and a given threshold was considered as correctly found,

source article.

when it matched the one annotated by human and cosine

similarity score was above the threshold. A given article

was considered correctly classified if a source release was

In Table 2, percentages of agreement among annotators are

correctly found or if the article was correctly marked as not

being presented for (a) raw annotations, (b) simplification

having a source release in the STA database.

A and (c) simplification B (see above).

Parameters used to score the classification were accuracy, re-

The annotators were sometimes non-unanimous when both

call, precision, and F1-score. We used following definitions:

articles in a pair had a common source (compare Table 2a

and 2b, mean agreement = 87%). They were more consis-

T P + T N

tent when a release was a source of a given article (compare

accuracy =

Table 2a and 2c, mean agreement = 96%).

T P + T N + F P + F N

Additionally, because of score inconsistencies, the final list

T P

has been prepared after discussing problematic cases.

recall = TP + FN

14

T P

1-gram cosine similarity histogram

precision =

250

T P + F N

recall × precision

200

F 1 = 2 recall + precision

150

where TP – number of articles with a correctly found source,

TN – number of articles correctly marked as not having

count 100

source in the STA database, FP – number of articles in-

correctly marked as having source in the database, FN –

50

number of articles incorrectly marked as not having source

in the database. Cases when articles had incorrectly found

0.0

0

0.2

0.4

0.6

0.8

1.0

source were counted separately as errors.

cosine similarity

Each annotator could have scored differently each article-

fractions of scores

for a given similarity (n = 1)

source pair thus the mean values and standard deviations of

1.0

parameters were calculated when considering lists of anno-

NC

NF

tations separately.

0.8

IDS

DS

4.

RESULTS

0.6

For each n value, we have found a threshold which maxi-

mized mean F1-score over all annotators. Results are shown

0.4

fraction

in Table 3a for the simplification A and in Table 3b for the

simplification B.

0.2

n

threshold

acc

σ

0.0

acc

F1

σF1

errors

0.0

0.2

0.4

0.6

0.8

1.0

cosine similarity

1

0.29

0.90

0.02

0.83

0.04

18

2

0.09

0.91

0.02

0.84

0.03

4

{2,3,4}

0.06

0.91

0.01

0.84

0.02

3

Figure 2: (top) A histogram of n-gram cosine similarities

3

0.05

0.91

0.01

0.84

0.02

4

and (bottom) a fraction of each score in each similarity bin

4

0.04

0.90

0.02

0.83

0.02

3

(see Section 2 for abbreviation expansions) for n = 1.

5

0.03

0.90

0.02

0.83

0.02

6

(a) Simplified A – direct and indirect relations merged

final list in each cosine similarity bin for n = 1 and n = 3

(respectively). The cosine similarities are not dramatically

n

threshold

acc

σ

more separated in any of the cases but using n = 1 leads to

acc

F1

σF1

errors

significantly higher number of errors, and using n = 5 – to

1

0.56

0.95

0.01

0.88

0.03

4

a slight increase of number of errors.

2

0.46

0.96

0.01

0.90

0.04

1

{2,3,4}

0.27

0.96

0.01

0.90

0.03

1

5.

DISCUSSION

3

0.25

0.96

0.01

0.90

0.03

1

For most values of n, over 85% of candidate pairs had ex-

4

0.22

0.96

0.01

0.90

0.03

1

treme cosine similarity values (below 0.1 or over 0.9). Two

5

0.13

0.95

0.01

0.89

0.02

2

articles with cosine similarity equal to 1 are duplicates while

(b) Simplified B – indirect relations and lacks of relation merged the articles with cosine similarity equal to 0 are completely unrelated. Similarities between those values are not that

Table 3: Thresholds resulting with the best F1 for differ-

clear to interpret.

Obtaining more pairs with intermedi-

ent ns. acc – mean accuracy, σacc – standard deviation of

ate values would make results for boundary cases more reli-

accuracy, F1 – mean F1-score, σF1 – standard deviation of

able. After closer examination, very similar pairs which were

F1-score, errors – mean number of incorrectly found sources.

marked as unrelated turned out to be annotators’ mistakes.

On the other hand, in the opposite cases (pairs with low co-

The results for the simplification A are satisfying when com-

sine similarity but marked as related) the analyzed articles

pared to the agreement among annotators. There were no

were rewritten; using lemmatization might be sufficient to

significant difference between specific n > 1 but for n = 1

identify them as similar.

there were as many as 18 errors. The results for the simplifi-

cation B are comparable with an agreement among annota-

Using different ns did not cause significant changes of ac-

tors (see Table 2) and the classifiers could not find a correct

curacies and F1-scores of classifiers in both simplified cases

source only in one case in which article was mainly based on

but n > 1 allows to correctly find more sources than n = 1.

some other release and only partially on the detected one.

In most n = 1 errors, the algorithm pointed at some more

Again, there is very little difference among different n-grams

general release about a given topic.

which might suggest that in most cases articles use similar

phrasing as the source release and the method is efficient.

We considered three types of relations between text pairs –

lack of relation, common source, and direct sourcing (one

In Figures 2 and 3, we show a histogram of cosine similarities

based on the other).

For the first and the last types of

and a stacked bar plot showing fraction of each score in the

relation, it was usually possible to distinguish between them

15

3-gram cosine similarity histogram

400

them, and can be detected with about 96% accuracy. A

discrimination between not related and related pairs was

350

possible with a 90% accuracy.

300

250

The results might be useful for a broader use although a par-

200

tial supervision in boundary cases would be required. We

count 150

suspect that lemmatization, proper quotations filtering and

discarding proper nouns might result in achieving higher ac-

100

curacies. Using cross-lingual similarity measures would be

50

another interesting modification.

0.0

0

0.2

0.4

0.6

0.8

1.0

cosine similarity

7.

ACKNOWLEDGMENTS

fractions of scores

We thank Aleš Pečnik for a technical support, and Tjaša

for a given similarity (n = 3)

Doljak, Naum Dretnik and Sabina Zonta from STA for an-

1.0

NC

notations. This research has received funding as RENOIR

NF

Project from the European Union’s Horizon 2020 research

0.8

IDS

and innovation programme under the Marie Sk lodowska-

DS

Curie grant agreement No. 691152. JCh has been also sup-

0.6

ported by Ministry of Science and Higher Education (Poland),

grant No. 34/H2020/2016.

0.4

fraction

0.2

8.

REFERENCES

[1] M.N. Bechtoldt, C.K.W. De Dreu, B.A. Nijstad, and

H.-S. Choi. Motivated information processing, social

0.0

0.0

0.2

0.4

0.6

0.8

1.0

cosine similarity

tuning, and group creativity. J. Pers. Soc. Psychol.,

99(4):622–637, 2010.

[2] O. Oh, M. Agrawal, and H.R. Rao. Community

Figure 3: (top) A histogram of cosine similarities and (bot-

intelligence and social media services: A rumor

tom) a fraction of each score in each similarity bin (see Sec-

theoretic analysis of tweets during social crises. MIS

tion 2 for abbreviation expansions) for n = 3.

Quart., 37(2):407–426, 2013.

[3] K. Lerman. Social information processing in news

but in the proposed way it was not possible to accurately

aggregation. IEEE Internet Comput., 11(6):16–28,

identify when two articles had a common source.

2007.

[4] V. Niculae, C. Suen, J. Zhang,

The method we used has not created a completely clear sep-

C. Danescu-Niculescu-Mizil, and J. Leskovec.

aration between considered relation types. In the further

QUOTUS: The structure of political media coverage

work the approach could be improved with lemmatization,

as revealed by quoting patterns. In Proceedings of the

mapping to WordNet synsets, discarding proper nouns, or a

24th International Conference on World Wide Web,

proper treatment of quotations.

WWW ’15 Companion, pages 798–808, 2015.

[5] A. Chmiel, J. Sienkiewicz, M. Thelwall, G. Paltoglou,

It is also important to take into account that the experts

K. Buckley, A. Kappas, and J.A. Ho lyst. Collective

were able to discriminate pairs because of their domain-

emotions online and their influence on community life.

specific knowledge.

Nevertheless, even highly trained in-

PLoS ONE, 6(7), 2011.

dividuals scored some pairs differently. In many cases, there

[6] J. Cho loniewski, J. Sienkiewicz, J. Ho lyst, and

can be more then one source release of an article or an article

M. Thelwall. The role of emotional variables in the

might be based only partially on a given release.

classification and prediction of collective social

dynamics. Acta. Phys. Pol. A, 127(3):A21–A28, 2015.

An important future work will include use of cross-lingual

[7] A Parker and JO Hamblen. Computer algorithms for

techniques (e.g. [10]) to compute similarities and detect pla-

plagiarism detection. IEEE T. Educ., 32(2):94–99,

giarism in news articles in different languages.

1989.

[8] D. Metzler, Y. Bernstein, W.B. Croft, A. Moffat, and

6.

CONCLUSIONS

J. Zobel. Similarity measures for tracking information

We have presented a case study of estimating usage of STA

flow. In Proceedings of International Conference on

releases by Slovene news outlets. We applied ”bag on n-

Information and Knowledge Management, pages

grams” representations of articles and releases with TF-IDF

517–524, 2005.

weighting, and compared them pairwise using cosine simi-

[9] Gregor Leban, Blaz Fortuna, Janez Brank, and Marko

larity. Detected candidate ”article-source release” pairs were

Grobelnik. Event registry: Learning about world

annotated by experts.

events from news. In Proceedings of the 23rd

International Conference on World Wide Web, WWW

We compared results of automatic source detection with the

’14 Companion, pages 107–110, 2014.

annotations, and as expected found that articles have higher

[10] http://xling.ijs.si.

cosine similarity to releases when they are directly based on

16





DATA ANALYTICS IN AQUACULTURE

Joao Pita Costa and Matjaž Rihtar

Artificial Intelligence Laboratory

Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia

joao.pitacosta@ijs.si, matjaz.rihtar@ijs.si



ABSTRACT

intuitive and inferential. The major factors to consider in the

design and purchase of process control and artificial intelligence

The specific challenges in aquaculture today reveal needs and

software are functionality/intuitiveness, compatibility, flexibility,

problems that must be addressed appropriately and in sync with

upgrade path, hardware requirements and cost. Of these,

the most recent optimization methods. It is now the time to bring

intuitiveness and compatibility are the most important. The

the techniques of aquaculture to a new level of development and

software must be intuitive to the user or they will not use the

understanding. In that, one must consider the state of the art

system. Regarding compatibility, the manufacturer should be

methods of statistics and data mining that permit a deeper insight

congruent with open architecture designs so that the chosen

into the aquaculture reality through the collected datasets, either

software is interchangeable with other software products.

from daily data or from sampling to sampling data. This must also

be tuned to the expert knowledge of the fish farmers, their

procedures and technology in use today. In this paper we review

the state of the art of data analytics methodology in aquaculture,

the data available deriving from the procedures characteristic to

this business, and propose mathematical models that permit a

deeper insight on the data. We also address the data unknowns

and strategies developed that will contribute to the success of the

business, leading to discover valuable information from the data



that can be made usable, relevant and actionable.

Figure 1. Dynamical plots developed for the project aquaSmart, available

through a public interface where the fish farmers can upload their data and

Categories and Subject Descriptors

do a preliminary analysis and visualization.

E.3 Data Structures; I.2 Artificial Intelligence; I.6 Simulation and

Modelling

The models presented in this paper were developed in the context

of the EU project aquaSmart [1]. This project aims enhancing the

General Terms

innovation capacity within the aquaculture sector, by helping



companies to transform captured data into knowledge and use this

Algorithms, Data Science, Aquaculture

knowledge to dramatically improve performance. In particular,

the tools constructed in that context (illustrated in Figure 1) serve

Keywords

the aquafarmers to evaluate feed performance, considering

Aquaculture, data analytics and visualization,

important factors such as the water temperature and average fish

weight, but also underlying factors such as the oxygen level.

1. INTRODUCTION

2. UNIQUE CHALLENGES

Modern research and commercial aquaculture operations have

begun to adopt new technologies, including computer control

It is well known that the production in aquaculture has specific

systems. Aquafarmers realize that by controlling the

features and objectives associated with it. When talking about the

environmental conditions and system inputs (e.g. water, oxygen,

adaptation of existing technology, the features important to the

temperature, feed rate and stocking density), physiological rates of

production in aquaculture come from weather prediction. These

cultured species and final process outputs (e.g. ammonia, pH and

are the oxygen levels and water temperature, which are very

growth) can be regulated [2]. These are exactly the kinds of

specific to this activity. The tasks in fish farming carry several

practical measurements that will allow commercial aquaculture

uncertainties – often expressed by measurements or even

facilities to optimize their efficiency by reducing labor and utility

evaluations – that permit further optimization [9]. A classic

costs. Anticipated benefits for aquaculture process control and

example is the aim for a better control on the food loss and food

artificial intelligence systems are: increased process efficiency;

quality. A contribution of data mining in this context would be of

reduced energy and water losses; reduced labor costs; reduced

interest to the aquafarming industry, saving or relocating

stress and disease; improved accounting; improved understanding

resources.

of the process.

An important variable that remains undetermined during the

The technologies and implementation of the technologies

complete production pipeline is the exact number of fish. A

necessary for the development of computer intelligent

margin of up to 10% of number of fries is added to the initial

management systems come in a wide variety [8] and enhanced

production at time t=0 due to uncertainty of number of deaths in

commercial aquaculture production [3]. Today’s artificial

the transport. That means that we already have a maximum of

intelligence (AI) systems offer the aquaculturist a proven

10% more fish than our estimations (assuming that no fries die

methodology for implementing management systems that are both

during transport or adaptation at t=0). Other than that we can only

17



have less fish than we estimated due to the lost fish because of

appropriate adjustments; (4) Life To Date (LTD) cumulative data

unknown reasons. This is already an open problem at the level of

that is calculated from the time when the fish enters the net as a

the bounds for total amount of harvested fish and the description

fry to the date of data collection, and will last until the date of the

of best-case scenario and worst-case scenario. This represents a

harvest.

big lack of knowledge about production. In fact, the unknown

number of fish until the end of the production is important for the

The identification data in input (1) is rather unspecific, as we

amount of food given and, consequently, for the resources spent.

cannot at this date in time identify the fish one by one as it is done

in other animal farming such as cows and pigs. The data in this

Feed composition has also a large impact on the growth of

input category is distinguished between the group of production

animals, particularly marine fish. Quantitative dynamic models

indicating localization - Unit - and the individual production

exist to predict the growth and body composition of marine fish

series of fish - Batch. There is no further distinction in the

for a given feed composition over a timespan of several months

identification. Batch has to go with Unit. Aquafarmers may have

[7]. The model takes into consideration the effects of

different batches in one unit or fish from one batch in many units.

environmental factors, particularly temperature, on growth, and it

The daily data in input (2) is recorded by the aquafarmers on a

incorporates detailed kinetics describing the main metabolic

daily basis. These data columns follow the development of the

processes (protein, lipid, and central metabolism) known to play

fish since day one when it enters as a fry. The data inputted

major roles in growth and body composition. That showed that

mostly follows one batch of fish from the beginning till the end of

multiscale models in biology can yield reasonable and useful

the production. One input data can have several units but, for

results. The model predictions are reliable over several timescales

purposes of the algorithms used, we consider only the time spent

and in the presence of strong temperature fluctuations, which are

in one unit. For some of the algorithms used, the data is split this

crucial factors for modeling marine organism growth.

way (some data tables don’t have values in the column ‘harvest’)

with clear input/output within one unit.

3. UNKNOWNS IN THE DATA

The sampling data in input (3) serves the aquafarmer to

It is curious that the underlying problems with the data unknowns

improve/fix his/her initial Feed Conversion Ratio (FCR) model

in aquaculture represent a problem of large dimensions for the

with real data. This includes features that can be learned by a

industry of aquaculture, in which the production is

specific set of data. Those features will later be important for the

straightforward. In fact, it is not known at any time in production,

algorithms. They often correspond to columns with potential

the exact number of fish in production, and therefore it is not

effect on the end result. Also, they can influence the production

possible to calculate with exactness the amount of food needed to

(e.g. ‘feeder’). The software will adapt to data and will try to do

support an appropriate growth. Furthermore, there are many

the analysis and prediction from the available data. Note that the

conditionings in the progress of the production that must be taken

input will also include data columns unknown to the system and

into account and are hard to measure with the existing and

optional to the aquafarmer. We cannot predict the relevance of the

available technology. In that, it is important to describe some of

data on those columns (neither their nature) but will consider them

the features of the data including an assessment on its quality and

in the overall global analytics.

measures to overcome obstacles to the analysis.

The input and output variables of the dataset are classified as:

numerical and categorical. Numerical variables can be: continuous

measured quantities expressed as a float (e.g. ‘av. weight’);

discrete quantities expressed as an integer (e.g. ‘number of fish’).

Categorical variables can be: regular categorical data including

non-ordered classes (e.g. species Bream/Bass); or ordinal classes

that can be ordered in levels (e.g. estimations poor/fair/good).

From the variables that can be measured it is important to

distinguish between: (i) variables that do not change over time,

often identifying population attributes (e.g. identifications such as

‘year’ or ‘hatchery’); (ii) variables that can change over time but

do not change within a sampling period (e.g. ‘batch’); (iii)

variables that change daily, taken into account when samplings

occur (e.g. ‘average weight’).

Table 1. Classification of values according to time dependence.



Figure 2. The proposed data cleaning process for aquaculture data,

change in time?

direct

calculated

derived

including the update of the metadata in the system and user interaction.

yes

water temp.

FCR, SFR

av. weight



The daily data, the sampling data and the LTD data in inputs 2, 3

no

identification

av. weight at t=0

hatchery

and 4 fall into three categories: (i) Direct values, that correspond

to the direct observation of the aquafarmers on either variables

Essentially we have four types of input data according to the

values including small errors measured in the field (e.g. sampling

impact they assure: (1) identification data that permits the fish

measures such as average weight) or precise values provided by

farmer to manage the production and correctly identify the fish;

external sources (e.g. water temperature or oxygen level); (ii)

(2) Daily data that is provided by the fish farmers resulting from

Calculated values, that are dependent of a number of other

their everyday data input (e.g. ‘date’, ‘av. wt.’, ‘actual feed’, etc.);

observed values (e.g. LTD values calculated from the daily data);

(3) Sampling data, collected at predetermined points of the fish

(iii) Derived values – values deriving from previously available

growth timeline, to confirm the model values and make the

18





data (e.g. FCR calculated from the table, given average weight

numbers accordingly. Also specifying the influence of sexual

and water temperature).

maturity and the lack of oxygen, which are done by

hand/intuition, have features to take in consideration by the math

The original data provided by the aquafarmers has variances/holes

model. The FCR models in this paper consider only temperature

and is not precise because it is not measured automatically but

and average weight.

instead entered by human hand (with some exceptions such as

‘temperature’). Sometimes it is not entered for 1 or 2 days due to

Each aquaculture entity draws an appropriate FCR table to that

the bad weather, which complicates the access to the

batch of fish. Higher temperature leads to lower energy spent and

measurements and to the units themselves (sometimes this adds

faster growth, and consequently to a lower FCR. As the fish gets

up to 4 days without entries). Sometimes this is due to intentional

bigger, he needs more food to increase his biomass in percentage,

fasting to readjust features and in that case the data measurements

and thus the FCR grows higher with the increase of the average

stay the same as the ones in the previous fields, just before fasting

weight. The quality of the food and the size of the pellet size are

takes place. The major discrepancies should be pushed to the user

not considered at this point. At high temperatures (above 30

as a compromise. If the data is missing up to a certain threshold,

degrees in the case of bream and bass) low oxygen leads to low

the data will be sent back to the user in order to be inputted once

conversion to biomass. This is one of the hidden variables in the

again after appropriate corrections. The options for the missing

model, which should be considered separately at a later stage. One

data problem are to consider it as an error and report it to the user

of the possibilities would be to penalize the FCR tables for the

requesting the missing data, or consider the average from the

lack of oxygen. The other variable is the high reproduction of the

missing data in the sense of interpolation on a fixed mesh grid.

fish in low temperatures and high average weight, which highly

affects the growth of the fish. Recall that the Economic FCR is the

4. DATA ANALYTICS IN AQUACULTURE

real FCR index following from the quotient between food given to

the fish and the fish biomass. When the temperature is too high or

Mathematical modeling aims to describe the different aspects of

too low we should ignore the data that is filled in with zeros and

the real world, their interaction, and their dynamics through

considered empirical data.

mathematics. It constitutes the third pillar of science and

In the following we present the plots of the models for the three

engineering, achieving the fulfillment of the two more traditional

fish farms in AquaSmart. It includes 3 fish farms.

disciplines, which are theoretical analysis and experimentation

[4]. Nowadays, mathematical modeling has a key role also in

aquaculture. In the following section we will present an overview

of that. Growth and reproductive modeling of wild and captive

species is essential to understand how much of food resources an

organism must consume, and how changes to the resources in an

ecosystem alter the population sizes [6].

The FCR is an important performance indicator to estimate the



growth of the fish. It is widely used by the aquaculture fish

farmers in pair with the Specific Feeding Ratio (SFR). Its

Figure 3. Company A: Real data (on the left) and FCR model (on the

importance follows from the fact that 70% of the production costs

right) for the bream production.

in aquaculture are assigned to the food given to the fish during

growth. Some of it will fall through the net and some will be

spared. The optimization of the feeding of the fish can carry great

benefits to the economic development of the fish farms.

Specifically, the FCR permits the aquafarmer to determine how

efficiently a fish is converting feed into new tissue, defined as

growth [10]. Recall that the FCR is a ratio that does not have any

units provided by the formula:



FCR = dry weight of feed consumed/wet weight of gain

Figure 4. Company B: Real data (on the left) and FCR model (on the

right) for the bream production.

while the feed conversion efficiency (FCE) is expressed as a

percentage as follows:

FCE = 1/FCR × 100

There seems to be some controversy among aquatic animal

nutritionists as to which is the proper parameter to measure, but in

aquaSmart we used FCR (exposing here FCE for completion).

Moreover, the FCR and FCE are based on dry weight of feed and

fish gain, as the water in dry pelleted feed is not considered to be





significant. A typical feed pellet contains about 10% moisture that

Figure 5. Company C: Real data (on the left) and FCR model (on the

will only slightly improve the FCR and FCE.

right) for the bream production.

The FCR table allows the fish farmer to assess the amount of food

The model (on the right) produced based on the sample data (on

to give to the fish according to their average weight and the

the left) serves as a base of comparison with the historical data

temperature of the water. Each farm has its own FCR table. This

provided by a particular fish farm. Thus, with the new real data

is an opportunity to create our own table/model by tweaking the

getting in our system, the fish farmer can compare it with the

19





model and make an evaluation on the progress of the production.

Moreover, the statistical analysis of the results permit a clearer

These models complement and confirm the expert knowledge: the

visualization of the important features in the data that can boost

high values on the right correspond to high fish reproduction in

the production and optimize the processes related to it. That will

cold water temperatures and high average weight values. On the

enable classification and forecast based on the analytics of the

other hand, high temperatures represent low levels of oxygen

available data.

which request higher feeding rate to maintain and increase the

growth rate.

In that, future work includes the production of guidelines

validated by end-users in order to facilitate the application of

The big number of peaks in the real data, plotted on the left,

further advanced learning methods in aquaculture.

correspond to the real values. Typically the input data can be seen

within a grid. The following images show the grid view of both

ACKNOWLEDGMENTS

the real data (on the left) and the FCR model (on the right) for the

company C.

The authors would like to acknowledge that his work was funded

by the project AquaSmart, co-funded by the European

Commission with the agreement number 644715 (H2020

Programme). We would also like to thank to Primož Škraba, Luka

Stopar, Dunja Mladenić, Pedro Gomes de Oliveira, Ruben Costa,

Gerasimos Antzoulatos, Nir Tzohari, John McLaughlin and Gary

McManus.



Figure 6. The grid view of both the real data (on the left) and the FCR

REFERENCES

model (on the right) for the company C.



We then use least squares method to interpolate the missing

[1] aquaSmart Consortium, The aquaSmart Project [Online].

values including all non-peak values as those interpolated values.

URL: www.aquasmartdata.eu. [accessed in 22.8.2016].

It does so by approximate the solution of overdetermined systems.

[2] Bhujel, R. C. (2011). Statistics for Aquaculture. John Wiley

The average weight must be represented using specific values that

& Sons.

are important in the fish production decision making, and

eventually distinct from fish farm to fish farm. Thus we consider a

[3] Beveridge M (2004); Cage Aquaculture; Third Edition,

second interpolation to produce a final FCR table that is consistent

Oxford, UK.

with the systems in use by the fish farms. The nearest neighbours

[4] N. R. Draper and H. Smith, Applied regression analysis, 3th

algorithm is used here to find the values outside the area [5]. That

edition. New York: Wiley, 1998.

permits us to consider the complete table of measurements in line

[5] Han, J., Kamber, M., & Pei, J. (2011). Data Mining:

with the sample data available and the missing values calculated

Concepts and Techniques. Elsevier.

for the area inside the region.

[6] Dobson, A. J., & Barnett, A. (2008). An Introduction to

5. CONCLUSIONS

Generalized Linear Models, Third Edition. CRC Press.

The challenges of aquaculture for data analytics are very specific

[7] Bar, N. S., & Radde, N. (2009). Long-term prediction of fish

in the field and must be addressed with the appropriate

growth under varying ambient temperature using a multiscale

methodology and technology, in tune with the expertise of the fish

dynamic model. BMC Systems Biology, 3(1).

farmers. The uncertainty of measures, such as the number of fish

[8] Lee P.G. 2000. Process control and artificial intelligence

until the time of harvest, derives in variances that do not permit a

software for aquaculture. Aquacultural Engineering, 23, 13-

complete accuracy of some of the calculations. This is particularly

36.

important to some of the available tools to monitor the business,

such as the feed conversion rate tables in use by the fish farmers

[9] Rizzo, G., and Spagnolo, M. 1996. A Model for the Optimal

to optimize the production costs.

Management of Sea Bass Dicentrarchus Labrax Aquaculture.

Mar. Resour. Econ. 11: 267–286.

The mathematical models developed in the aquaSmart project and

[10] Stickney, R. R. (2007). Aquaculture: An introduction. (C.

discussed in this paper aim to contribute to the improvement of

Publishing, Ed.). CABI Publishing.

the aquaculture procedures, providing a deeper insight on the

information retained in the collected data, using state-of-the-art



methods of data mining in line with the expert knowledge of the

field transferred to the metadata in the data store.



20

Modeling Probability of Default and Credit Limits

Zala Herga

Jan Rupnik

Jožef Stefan Institute

Jožef Stefan Institute

Jamova 39

Jamova 39

Ljubljana, Slovenija

Ljubljana, Slovenija

zala.herga@ijs.si

jan.rupnik@ijs.si

Primoz Škraba

Blaž Fortuna

Jožef Stefan Institute

Jožef Stefan Institute

Jamova 39

Jamova 39

Ljubljana, Slovenija

Ljubljana, Slovenija

primoz.skraba@ijs.si

blaz.fortuna@ijs.si

ABSTRACT

understand. For credit limits, we implemented a linear pro-

Creditors carry the risk of their clients not meeting their

gramming based approach [5] which provides portfolio opti-

debt obligations. In the literature, these events are often

mization with risk aversion constraints.

referred to as default events. These can be modeled for each

company through a probability of default (PD). Measures

This paper presents the workflow and the methodology that

can be taken to limit the default risk: in this paper we fo-

we employed to build a portfolio credit allocation model.

cused on credit limit. Firstly, we predict PD of a company

Due to privacy concerns it does not include any expiremental

using a logistic regression model, based on publicly available

results and does not discuss any concrete results or aspects

financial data. Secondly, we effectively find an optimal port-

of real data.

folio under risk aversion constraints and show how variation

of inputs affects the results.

The paper is organized as follows. Section 2 provides an

overview of related work. In Section 3 the data that is used

in modeling is described. Section 4 first describes the ap-

Categories and Subject Descriptors

proach and computation of the PD model and then presents

Mathematics of computing [Mathematical optimization]:

the results. Section 5 provides a short theoretical introduc-

[Linear programming, Convex optimization]; Computing method- tion to portfolio optimization and then presents our compu-ologies [Machine learning]: [Supervised learning by re-

tation and results. Section 6 concludes the paper.

gression]

2.

RELATED WORK

Keywords

The Altman Z-score [1] is a widely used credit-scoring model.

PD model, logit, credit limit model, portfolio optimization,

It is a linear combination of five commonly used financial in-

linear programming, risk management

dicators and it predicts company’s degree of PD. Both [6]

and [8] argue that Altman Z-score and distance-to-default

1.

INTRODUCTION

([7]) are not appropriate to use in the context of small busi-

Payment defaults represent a key default risk (also credit

nesses.

The authors in [6] predict PD using delinquency

risk) to creditors. Creditors can limit their risk by either

data on French small businesses. They propose a scoring

insuring their claims or taking preventative measures before

model with an accuracy ratio based solely on information

extending a credit. Standard tools to measure default risk

about the past payment behavior of corporations. Similarly,

include different kinds of credit ratings.

[8] forecasts distress in European SME portfolios. They es-

timate the PD using a multi-period logit model. They found

Our goal was to create a model that predicts a company’s

that the larger the SMEs, the less vulnerable they are to the

probability of default (PD) and provides credit limit sugges-

macroeconomic situation. They also show that SMEs across

tions based on the computed PD. One of our constraints

Europe are sensitive to the same firm-specific factors.

was that the underlying PD model be simple and easy to

[2] examine the accuracy of a default forecasting model based

on Merton’s bond pricing model [7] and show that it does

not produce a sufficient statistic for the PD. [11] compare

the predictive accuracy of PD among six data mining meth-

ods. They also present a novel “sorting smoothing method”

for estimating the real PD. Using a simple linear regression,

they show that artificial neural networks produce the best

forecasting model. [3] used the Merton model to show that,

on contrary to what theory suggests, the difference in re-

turns between high and low PD stock is negative and that

21

returns almost monotonically decrease as the PD increases.

bins

0.8

However, they found a positive relationship systematic de-

pGood

fault risk exposure and returns.

0.6

pBad

0.4

On the problem of portfolio optimization, [9] showed that

0.2

Conditional Value at Risk minimization with a minimum

0

expected return can be computed using linear programming

1

2

3

4

5

techniques. [5] built on this idea and showed that alterna-

WOE

1

tively, one can maximize returns while not allowing large

0.5

risks.

0

3.

DATA

-0.5

The dataset that we used covers several thousand companies

-1

1

1.5

2

2.5

3

3.5

4

4.5

5

from several European countries. Data for each company

consists of two parts: financial data and trading data.

Figure 1: Bins and WOE on simulated data. The greater

Financial data corresponds to publicly available data - bal-

the value of the simulated data the greater the evidence that

ance sheets and income statements of a company.

a company is ’good’.

Trading data consists of private information on trades be-

tween our data provider and his clients. It contains monthly

tus of a company. We transformed each financial indicator

data about the sum of trades, outstanding debts, disputed

vector into a feature vector by binning and assigning Weight

claims and delayed payments. The data is available for years

of Evidence (WOE) [10] to it. The idea is as follows: create

between January 2010 and June 2016.

n bins in range from min to max indicator value and assign

each company to the corresponding bin. Then count the

4.

PD MODEL

number of ’bad’ (defaulted) and ’good’ companies in each

All creditors carry the risk of their clients not paying the

bin. Then assign WOE to a bin as

bills. We can be quite certain that some clients will not

P(company = good)

pay, however, we have difficulties identifying those clients.

log

.

P(company = bad)

Hence, our first task was to compute the probability of de-

fault for each client company. The end model should also

Since WOE scores will be used as inputs to a linear model,

be simple and easy (intuitive) for domain experts to under-

they should be a monotonous function over bins, meaning,

stand.

the higher the financial indicator value the better the com-

pany is (if WOE is increasing) or the higher the indicator the

Default can be defined in multiple ways, considering the

worse the company is (if WOE is decreasing). In Figure 1

available data and how strict we want our model to be. Do

we show an example of binning and WOE transformation

we consider it a default if the client is only one day late

on simulated data.

on payments? Or do we let them be 30, 45, 60 days late

before taking action? Are we going to consider a client de-

Thus, we obtained 40 out of 45 features. Another feature

faulted if he owes 10e? What if the client didn’t pay one

was the size of the company (based on income) and four of

bill but has been paying all the bills after? We are required

them were country features (dummy variables - one for each

to make a judgment call and choose a definition that meets

country).

the creditor’s needs best.

As described above, we mapped “raw” features into WOE

As soon as a client defaulted we removed it from the dataset.

features:

Meaning, each client can only default once and after that we

assume there was no more trading done with them.

(x1, x2, ...) → (woe(x1), woe(x2)..)

PD of a company is then predicted through logistic distri-

There were some companies that were already defaulted at

bution function:

the beginning of the time-span of our dataset. We removed

these companies from the dataset since they provide no use-

1

F (x) =

ful information. We also filtered out clients with low sales

1 + exp−(β0+β1·woe(x1)+β2·woe(x2)+...+βn·woe(xn))

since their financial data can be very unreliable and their

where βi denotes linear regression coefficients.

impact on our model is big compared to the trading volume

that they generate.

4.1

Computation

Due to the constraint that the model should be simple and

Since we have the response variable observations on monthly

easy to understand we chose to model the PD with logistic

level, we interpolated features to obtain the same frequency

regression.

of explanatory variables. We also took into account offset of

the financial and trading data: financial data is only made

From the available financial data we calculated 45 financial

available sometime around June each year for the previous

indicators for each company that cover aspects of solvency,

year (e.g. in June 2016 we only know financial statuses of

liquidity, debt, profitability and operative effectiveness sta-

2015).

22

Missed trading volume vs. disputed claims

but how to set the limit? If the limit is to high, the client

25

might not be able to pay the bills, but if the limit is too low,

profit is lost on trading volume. The model that we created

20

is inspired by [5], is based on PD calculation presented in

X: 6.456

Y: 15.51

first part of this paper and takes into account the level of

15

risk that creditor is willing to take.

10

Let us introduce some standard financial risk-related terms.

disputed claims

Value at Risk (VaR) is an upper percentile of loss distribu-

tion. Probability level is denoted as α. E.g. 95% − V aR of

5

1, 000, 000e means that there is a 0.05 probability that loss

will exceed 1, 000, 000e. Conditional Value at Risk (CVaR)

0

is the conditional expected loss under the condition that it

0

50

100

150

200

250

missed trading volume

exceeds VaR. CVaR at α = 95% level is the expected loss

in the 5% worst scenarios. ω denotes the maximum allowed

Figure 2: Missed trading volume vs. disputed claims on

CVaR of the portfolio at level α.

simulated data. The info box marks 6%-level: if trading was

stopped with the worst 6% companies, approx. 7.5 could be

We will denote loss associated with the portfolio x and ran-

saved on disputed claims and 6.5 · margin profit would be

dom vector y (with density p(y)) as f (x, y); CV aR will be

lost on trading volume. The figure is included for illustrative

noted as φα(x), which is given by

purposes and is not based on real data for privacy concerns.

Z

φα(x) = (1 − α)−1

f (x, y)p(y)dy.

f (x,y)>V aRα

We then filtered out features that:

It has been established [9] that φα(x) can be computed by

• had high percentage of missing values (We kept thresh-

minimizing the following function:

old as a parameter. in this paper, threshold of 0.7 was

Z

used.)

Fα(x, ζ) = ζ + (1 − α)−1

max{f (x, y) − ζ, 0}p(y)dy,

y∈ n

R

• were highly correlated (In this paper, threshold of 0.9

was used).

and that the value ζ which attains the minimum is equal to

V aRα.

After that, 27 features were left in our feature set.

Finding credit allocations x ∈ X that maximize the expected

profit under CV aR is equivalent to the following optimiza-

We used Matlab for computation. The data was trained

tion problem [5]:

on 42 months and tested on 19 months. Two models were

trained: standard logistic model and stepwise logistic model.

min

− R(x)

x∈X,ζ∈R

In stepwise regression, explanatory variables are added to a

(1)

subject to

F

model by an automatic procedure [4]. Regression coefficients

α(x, ζ ) ≤ ω

are estimated by maximum likelihood estimation.

where R(x) is the expected profit and the set X is given by

a set of box constraints (lower and upper bounds on each

4.2

Results

component of x).

Test data consisted of 20% of the data (not used in training).

There were 16 features chosen in stepwise model. Both of

5.1

Computation

the models have 0 p-values and return very similar results

By using the PDs from the first part of the paper, we can

in predicted values.

simulate the default events and compute several random sce-

narios for our portfolio.

Since each company is assigned

We evaluated models by comparing the amount of disputed

a probability of default we can generate random scenarios

claims (true negatives) to the amount of missed trading

(where certain companies default) over the full portfolio by

volume (false positives) given ceased trading with compa-

sampling from independent (with different weight) Bernoulli

nies with PD exceeding some threshold (Figure 2). Note,

random variables.

that expenses based on missed trading volume cannot be

directly compared to disputed claims; disputed claims are a

By generating a set of sample scenario vectors y

direct expense, whereas missed trading volume number con-

1, . . . , yJ

with their corresponding probabilities π

sist largely of expenses (that a company in that case did

1, . . . , πJ we can ap-

proximate F (x, y) by a finite sum:

not have). Hence, one needs to multiply the trading volume

with company’s (average) margin to obtain actual opportu-

J

X

nity costs.

˜

Fα(x, ζ) = ζ + (1 − α)−1

πj max{f (x, y) − ζ, 0}

j=1

5.

CREDIT LIMITS MODEL

Using

Naturally, a question arises once we identify risky clients:

how to handle them? Client should have set a credit limit,

zj ≥ f (x, yj ) − ζ,

zj ≥ 0,

j = 1, . . . , J,

ζ ∈ R

23

PD vs. amount of credit

PD vs. amount of credit

the constraints in (1) can be reduced to a system of linear

constraints:

J

X

ζ + (1 − α)−1

πj zj ≤ ω

(2)

j=1

PD

PD

f (x, yj ) − ζ − zj ≤ 0,

zj ≥ 0,

j = 1, . . . , J,

ζ ∈ R (3)

In our case, the expected profit is

0

0.2

0.4

0.6

0.8

1

1.2

0

0.2

0.4

0.6

0.8

1

1.2

R(x) = (1 − pd) · x · margin − pd · x.

Approved credit / max credit

Approved credit / max credit

(a)

(b)

As for PD modeling, we used Matlab for computation. We

PD vs. amount of credit

PD vs. amount of credit

combined left-side part of constraints from (2) and (3) in a

matrix A; the right-hand side was combined in a vector b.

We also added additional constraints on x, that are specific

to our problem: we set an upper and lower bound (ub and

lb correspondingly) to the credit limit.

PD

PD

We then have to solve linear program:



Ax ≤ b

min −R(x)

such that

0

0.2

0.4

0.6

0.8

1

1.2

0

0.2

0.4

0.6

0.8

1

x

lb ≤ x ≤ ub

Approved credit / max credit

Approved credit / max credit

(c)

(d)

5.2

Results

Our method provides an optimal portfolio based on α, ω,

Figure 3: Each circle in these scatter-plots represents one

margin, PDs, credit limit upper- and lower bounds. By op-

company. The x-axis represents the relative amount of ap-

timal portfolio we mean monthly credit limit for each com-

proved credit. Y-axis represents predicted PD of a company.

pany, which takes value between the provided credit limit

In 3a default values are used, while Figure (b) is based on

bounds. In Figure 3 we present some scatter-plots based on

reducing ω by a factor of 10, figure (c) is based on increasing

results; the default values used for these graphs are α = 0.95,

the margin by a factor of 10 and figure (d) is based on lifting

lb = 0, ub = max trading volume (based on historical data)

the lower bound to 10% of the upper bound. The scales of

and margin = 0.01. Most companies get either zero or max-

PDs are omitted on the graphs due to privacy concerns.

imum credit approved; there are only some companies the

get part of the max credit approved. In 3b we decreased

ω by a factor of 10. This is a lot stricter constraint and

Default Risk: The Good, The Bad, and The Anomaly

consequently there are more companies that get zero credit

(March 2015), 2015.

approved and many companies that get only a part of max

[4] R. R. Hocking. A biometrics invited paper. the

credit approved. In 3c we increased margin from 0.01 to 0.1.

analysis and selection of variables in linear regression.

In practice this means, that the credit-giver is making profit

Biometrics, 32(1):1–49, 1976.

on trading volume. In 3d, we moved credit lower bound

[5] P. Krokhmal, J. Palmquist, and S. Uryasev. Portfolio

from zero to 0.1 · ub. This makes the PD threshold stricter.

optimization with conditional value-at-risk objective

and constraints. Journal of risk, 4:43–68, 2002.

6.

CONCLUSION

[6] A. Marouani. Predicting default probability using

We presented a logit model based on weight of evidence fea-

delinquency: The case of french small businesses.

tures to predict a company’s probability of default. Standard

Available at SSRN 2395803, 2014.

and stepwise methods were used to train the data. Both

[7] R. C. Merton. On the pricing of corporate debt: The

methods provide similar results.

risk structure of interest rates. The Journal of finance,

29(2):449–470, 1974.

In second part of the paper we introduce an efficient portfolio

[8] D. Michala, T. Grammatikos, S. F. Filipe, et al.

optimization technique that was used to determine credit

Forecasting distress in european sme portfolios. EIF

limits for creditor’s clients. We presented the results and

Research & Market Analysis Working Paper, 17, 2013.

showed how variation of inputs impacts the results.

[9] R. T. Rockafellar and S. Uryasev. Optimization of

conditional value-at-risk. Journal of risk, 2:21–42,

7.

REFERENCES

2000.

[1] E. I. Altman. Financial ratios, discriminant analysis

[10] D. Sharma. Evidence in favor of weight of evidence

and the prediction of corporate bankruptcy. The

and binning transformations for predictive modeling.

journal of finance, 23(4):589–609, 1968.

available at: http://ssrn.com/abstract=1925510,

[2] S. T. Bharath and T. Shumway. Forecasting default

September 2011.

with the kmv-merton model. In AFA 2006 Boston

[11] I.-C. Yeh and C.-h. Lien. The comparisons of data

Meetings Paper, 2004.

mining techniques for the predictive accuracy of

[3] S. Ferreira Filipe, T. Grammatikos, and D. Michala.

probability of default of credit card clients. Expert

Pricing default risk: The good, the bad, and the

Systems with Applications, 36(2):2473–2480, 2009.

anomaly. Theoharry and Michala, Dimitra, Pricing

24

Big Data Analysis Combining Website Visit Logs with

User Segments and Website Content

Matic Kladnik

Blaž Fortuna

Pat Moore

Jožef Stefan Institute

Jožef Stefan Institute

Bloomberg LP

Jamova cesta 39

Jamova cesta 39

New York, USA

Ljubljana, Slovenia

Ljubljana, Slovenia

pmoore26@bloomberg.net

matic.kladnik@ijs.si

blaz.fortuna@ijs.si



ABSTRACT

2.2 User Segment Data

This paper provides three use-cases of combining website visit logs

Web ecosystem allows for deeper annotation of users based on their

with user segment data, website content and stock volume data. The

behavior around the web. For example, visit to a car dealership

use-cases provide concrete examples showcasing how to derive

website can be noted by a third-party data provider, which has an

insights into behavior of users on the website. We also present how

agreement and some tracking mechanism installed on the

to efficiently derive required data using MapReduce technique, and

dealership’s website. Such user segment data is further aggregated

implement the use-cases using QMiner data analytics platform.

and distributed by data providers such as Krux. Connecting visitors

to our website with external database of user segments can provide

Categories and Subject Descriptors

a much richer understanding of our audience and, as we will see in

the first use-case, allows for some interesting analysis. Example of

Information Systems → World Wide Web → Web Mining

user segments covered include estimated household income,

General Terms

gender, estimated home net worth and others.

Algorithms, Management

2.3 Website Content

Keywords

Content of the pages on our website provide additional source of

Web log analysis, Big data, MapReduce, Hadoop, Hive, QMiner

data we can use in the analysis when joined with visit logs and user

segment data. Content can be represented on different levels of

1. INTRODUCTION

granularity: words, entities, topics, etc.

Visitors on modern websites leave behind large amounts of traces

In our use-cases we collect the content by crawling the URLs

in form of a web server log, query logs, pixel tracking logs, etc.

mentioned in the visit logs and extracting their content by removing

Website owners analyze these traces to better understand how

boilerplate. Each page is processed using a standard NLP pipeline

visitors navigate their website, identify their interests, and optimize

[1], providing us with list of topics, named entities and tickers

advertising opportunities. Most typical use-cases are covered by

(stocks from companies).

end-user tools such as Google Analytics, ComScore, Omniture, etc.

More sophisticated use-cases, which require custom or ad-hoc

2.4 Stock Volume Data

processing of the data, are covered by big data frameworks such as

Our last use-case combines visit logs with market data. We use

MapReduce and its implementations such as Apache Hadoop.

price and trading volume data provided by Kibot. In this paper we

However, standard tools provide little or no support for analysis

will be using data in hourly intervals, meaning that trade volume

that requires combining visit logs with content presented on the

values are accumulated by hours.

website. In this paper we present three use-cases which combine

big data frameworks and text mining approaches with the goal of

3. PROCESSING

deeper understanding of visitors and their habits.

We process input data using MapReduce parallelization paradigm.

MapReduce processes the parts of data separately in the map phase

Paper is organized as follows. First we provide a quick overview of

and later joins the partial results in the reduce phase. This allows us

the input data, and outline how we prepare it for the use-cases using

to easily split processing into multiple computing units that can be

MapReduce framework. We conclude by outlining three use-cases

executed in parallel. In this paper we use Apache Hadoop [2] as

applied to data from a large news website: comparing user

MapReduce implementation and Apache Hive [3], which allows us

segments, identifying user segments by example article and

to execute SQL-like queries on Hadoop.

correlating company news visit logs with stock volume.

3.1 Aggregating Visit Logs and Segment Data

2. INPUT DATA

First we address the task of joining visit logs with user segment data

2.1 Website Visit Logs

using a cookie shared by both datasets. Join can be executed using

the query presented in Figure 1. The query joins three tables: a table

Visit logs provide log of page views on a specific website. Each

with the visit logs, a table linking user IDs from visit logs and

page view is described by the time, user ID (stored as a first-party

segment IDs from segment data, and a table providing segment

cookie), page URL, referral URL, user agent, etc. Such information

description.

is typically collected using a pixel tracking mechanism and is

commonly used by all standard web analytics tools. In use-cases

We tested the query on one week of visit logs on several cluster

we use ComScore as the data provider,

compositions and the results are presented in Table 1. We can see

how the performance of the cluster improves when adding

additional instances. However, when going from 4 to 6 instances,

we can see a smaller deduction in time taken. That is due to the last

25



reduce process taking similar amount of time with any number of

As we can see in Table 1, Table 2 and Table 3, adding additional

instances.

task instances to the cluster can greatly affect the performance.

However, at some point performance is improved less effectively

Table 1. Performance analysis on segment query

when adding additional task instances. This depends on the amount

Instances

Duration

Improvement

of data, number of files the data is stored in and types of task

2

3104 seconds

--

instances. The last reduce task usually takes some time to complete

4

1763 seconds

x1.76

and is processed on a single instance, which is why it’s time taken

6

1427 seconds

x2.18

to process cannot be improved.





INSERT INTO TABLE visitors_segids

Table 2. Performance on visit logs data in external table

SELECT seg_userid, segid

Instances

Duration

Improvement

FROM visit_logs t1, segment_data t2

2

713 seconds

--

WHERE t1.seg_userid = t2.seg_userid;

4

446 seconds

x1.60



6

344 seconds

X2.07



INSERT INTO TABLE visitors_segvals

Table 3. Performance on visit logs data in internal table

SELECT seg_userid, segid, segment_title

Instances Time taken Improvement

FROM visitors_segids t1, sikdd_segments t2

2

275 seconds

--

4

265 seconds

x1.04

WHERE t1.segid = t2.segid;

6

234 seconds

x1.18

Figure 1. The query creates mapping between user IDs and

4. USE CASES

segment names by joining visit logs and data segments.

In this section we present three use-cases that combine visit logs,



segment data, content and stock volume data. Use-cases were

implemented in QMiner [4], a data analytics platform for

INSERT INTO TABLE user

processing large-scale real-time streams containing structured and

SELECT userid, collect_set(url)

unstructured data.

FROM sikdd_visit_logs

4.1 Comparing User Segments

GROUP BY userid;

The goal of this task is to compare behavior of different user

segments on the website. Segments are provided by User Segment

Figure 2. Aggregating visited URLs by users IDs.

data, which we joined with visit logs and website content. The use-



case is prototyped as a web app using QMiner and Node.JS.

3.2 Aggregating Visit Logs and Content

Figure 3 shows the interface where the user can select which

segments of users should be compared. Query is specified as a

We now address the task of aggregating visited pages by user. We

collection of segments defining a subset of website visitors. The

achieve this by simply grouping visit logs around user ID as show

second group can be either a complement of the first group, or can

in Figure 2 with the following example query.

also be defined through another list of segments.

Results of the query are inserted into a previously created table. We

In the example we compare users identified as engineers versus

are calling the collect_set function, which returns an array of

users identified as administrators. We can see results for this query

targeted column values when grouping the results. This way we get

in Figure 4. Report shows some overall statistics and the odds

a column with distinct user ID values and all connected URLs. In

rations that are significant for the first group. The output can be

Table 2 we compare performance on this query when for several

directly transformed into instructions for an ad server.

cluster compositions and can see linear improvement as we increase

the number of instances.

In Table 3 we compare this with the performance of the cluster on

internal Hive table, as opposed to ad-hoc table pulled from some

external source (e.g. Amazon S3). As we can see, the performance

can be boosted by reading data from an internal table. When using

an internal table, the data is stored locally on the instance. External

tables are useful for reading data from an external source (e.g. AWS

S3 service), which has to support the HDFS (Hadoop File System).

External tables can still be stored on the local HDFS of the instance.

Copying data from an external source to an internal table comes in

handy when a certain table is used frequently, otherwise the

difference in time taken is overturned by the amount of time needed

to copy data from an external data source to an internal table. There

is also a question of resources availability as moving data to the

local instance can fill a lot of the cluster’s available storage space.

In some cases, the data is simply too large to be moved to the



cluster’s storage.

Figure 3. Interface for selecting which segments we want to

compare. E.g. engineer vs. administrator

26





trading happens during US trading hours, when also website traffic

spikes.

If we compare article visits with the trading volume, we can see a

clear discrepancy with articles generating more traffic over first few

days and the stock having higher trade volume over last few days.





Figure 4. Segment query results



4.2 Analyzing Interests for Topics

The goal of this task is to understand who is interested in specific

topics. Topic can be either selected from a predefined list of

categories (e.g. Advertisement), or defined via a document. For

example, we can search for articles on Product Innovation, which

is not included in the predefined list, by using Wikipedia page on

product innovation as a query.

Result of such as query is a subset of pages from the website that

fit the given topic. We can then join this by using mapping from

Section 3.2 to obtain list of visitors that read this content, and, by

using mapping from Section 3.1, also their segments. Note that

these joins can be executed on one month of data in QMiner in real-

near time.

Result is several reports. First, we can generate same kind of report

as shown in Figure 4, by contrasting readers of identified pages

with the rest of the website’s visitors. Second, we can aggregate

visited pages and their content in order to identify top topics, people



and keywords, as presented in Figure 5 and Figure 6.

Figure 5. Results for querying on Advertisement input text

4.3 Comparing traffic and stock volume data



As the last use-case we will check for correlation between the stock

volume data and a combination of visits logs and content pages. All

examples will be focused on AAPL (Apple Inc.) and we will look

at the data aggregated by hour.

We start by creating two time-series from the visit logs: (a) number

of visits to AAPL quote page, and (b) number of visits to articles

related to Apple. First time-series we can obtain directly from visit

logs. For the second, we have to first identify significant mentions

of entity Apple in the news articles and check their visit statistics.

We compere these with a trade volume of AAPL stock. The

intuition being, that more news there is about a specific company,

the more trading there will be with their stock.

First we compare quote page visits with the trading volume. As we

can see in Figure 8, stock volume values fluctuate in similar

patterns and time intervals as the number of requests values. Most



of the trends can be explained by daily patterns, where most of

Figure 6. Word cloud with top keywords from pages discussing

the topic of Advertisement.



27





We can also check the correlation between the trade volumes and

the visit data. Figure 9 shows us correlation between quote requests

and stock volume. The x-axis represents the offset of request time

based on the volume values time, measured in hours. For example,

when the offset is -1, we compares quote request data at (h – 1)

hour with stock volume data at h hour. The biggest correlation

between number of quote requests and stock volume values is

within the same hour (no offset). In this case, correlation coefficient

is significant as the precise value is 0.62. Whereas if we take the

number of quote requests 1 hour in advance of the volume values,

we still get a significant correlation coefficient of 0.46, or even

greater (0.54) if we take quote requests from 1-hour prior of the

volume values. The correlation coefficients fall more, the higher



the offset we take between stock volume data values and quote

Figure 7. AAPL quote requests and volume

request values. In any case we can observe that correlation between

both data is significantly high.

Correlation coefficients on Figure 10 tell us that there is some slight

correlation between the numbere of visits to sites that mention

Apple Inc. (AAPL) and the volume of AAPL stock on stock

exchanges when taking visits prior to the volume movement into

account. Similar to Figure 9, the x-axis on Figure 10 represents

offset time of site visits, based on the time in stock volume data.

There is a small jump in correlation, when taking site visit values

that are offset by 2 hours into the past, compared to the stock

volume data time. Altough, even at that point the correlation

coefficient is 0.13. There is a negative correlation when taking site

visit values from the hours following the time of stock market

volume values. We can conclude that the correlation between site



qoute requests and stock volume data is much stronger than

correlation between site visits and stock volume data. However, we

Figure 8. AAPL site visits and volume

can also make an observation that we only get a positive correlation

between site visits and stock volume when taking the numbers of

AAPL	requests	and	volume	correlation

site visits that occurred prior to stock volume into account.

0.8

5. REFERENCES

[1] Štajner, Tadej, et. al. A service oriented framework for

0.6

natural language text enrichment. Informatica (Ljublj.), 2010,

vol. 34, no. 3, 307-313

0.4

[2] Apache Hadoop: http://hadoop.apache.org

0.2

[3] Apache Hive: https://hive.apache.org

[4] Fortuna, Blaz, et al. QMiner – Data Analytics Platform for

0

Processing Streams of Structured and Unstructured

-3

-2

-1

0

1

2

3

Data. Software Engineering for Machine Learning



Workshop, Neural Information Processing Systems 2014

Figure 9. AAPL requests and stock volume correlation



AAPL	site	visits	and	volume	correlation

0.2

0.1

0

-0.1 -3

-2

-1

0

1

2

3

-0.2

-0.3



Figure 10. AAPL site visits and stock volume correlation

28

Visual and Statistical Analysis of VideoLectures.NET

Erik Novak

Inna Novalija

Artificial Intelligence Laboratory

Artificial Intelligence Laboratory

Jožef Stefan Institute

Jožef Stefan Institute

Jamova cesta 39, Ljubljana, Slovenia

Jamova cesta 39, Ljubljana, Slovenia

+386 31 272 332

+386 1 4773144

erik.novak@ijs.si

inna.koval@ijs.si

ABSTRACT

In our work we adopt both approaches to learning analytics

This paper presents learning analytics tools for visual and

– visual and statistical analysis, which allows having a deep

statistical analysis of data from a portal of video lectures.

and more intuitive understanding of the obtained results.

While learning analytics methods traditionally deal with

3. LEARNING ANALYTICS FOR

measurement, collection and analysis of data about

learners with an aim of improving the learning process, our

VIDEOLECTURES

solutions are targeted at viewers of VideoLectures.NET.

VideoLectures Learning Analytics Dashboard [1] is a tool

The novel VideoLectures Learning Analytics Dashboard

developed for the analysis of viewer behavior and detecting

and Lecture Landscape tools allow observing, searching

which lectures are interesting for the users. We present the

and analyzing viewer behavior and, at the same time,

results of the preliminary work and the functionalities of the

efficiently present the information from the portal to the

dashboard.

viewers.

Data Processing. The data considered in the analysis is a

set of log files from VideoLectures portal that contain raw

General Terms

events on the portal from September 2012 until December

Learning Analytics, Measurement, Performance, Design.

2015. We have processed 11.3 GB of log files that included

the ID, timestamp, session, log entry, lecture and other

Keywords

information such as event type, IP address, location (if

visualization, analytics, data mining, VideoLectures.NET

present) etc.

1. INTRODUCTION

With the processed raw log files, we established main event

VideoLectures.NET is a free and open access educational

types that appeared in the raw logs: view (the user accessed

video lectures repository. There are over 20.000 lectures by

the lecture webpage), download (the user downloaded

distinguished scholars and scientists at the most important

presentation, video etc. from the lecture webpage) and

and prominent events like conferences, summer schools,

search (the user performed a search at the portal).

workshops and science promotional events.

In addition to the raw log files, we have also collected the

Lectures on the VideoLectures.NET portal are categorized

log files dedicated to the behavior of particular user while

into various categories, such as Arts, Biology, Business,

watching particular lectures which we call ranges log files.

Computer Science, Social Sciences, Technology etc.

These present the actions of the user while watching videos

like moving forward, moving backwards on the player,

In this paper we present tools for visual and statistical

skipping some video section etc.

analysis of VideoLectures.NET portal data – VideoLectures

Learning Analytics Dashboard and Lecture Landscape.

Analysis of Log Files. In order to analyze user behavior at

While the goal of the VideoLectures Learning Analytics

VideoLectures.NET portal, we have utilized and developed

Dashboard is to aggregate, harmonize and analyze event

a set of data analysis techniques. These were used in the

data using various data analysis techniques, the Lecture

development

of

VideoLectures

Learning

Analytics

Landscape visualizes the information about videos,

Dashboard, which contains both statistical analysis and

categories and authors in an efficient and user-friendly way.

visual exploration features.

The analysis has been performed from four perspectives:

2. RELATED WORK

the aggregated perspective for all lectures, perspective of

Tomas & Cook [14] state that visual analysis should be

singular lecture, aggregated perspective of all viewers and

employed as a means to reveal patterns and trends within

perspective of singular viewer.

data, to develop more intuitive perception and to help with

in-depth analysis. Conde et al. [8] present a learning

In addition, we have developed a set of metrics for viewers

analytics dashboard and its application in real-world case

and lectures that provide us an insight into the user

studies. Conceptual framework is developed by Bakharia et

behaviour on the VideoLectures.NET portal.

al. [7].

29





The lecture metrics measures the number of views and

The overall statistics of the viewers can also be found under

viewers the lecture has, the average and standard deviation

the all viewers tab. It shows the distribution of the viewers

of time the lecture was watched (in minutes and percentage)

through countries (see Figure 3) and other statistics

and the average and standard deviation of moves going

measured with the viewer metrics.

forwards and backwards through the lecture video (in



minutes and percentage).

The viewer metrics measures the number of lecture views

the viewer made, the average and standard deviation of time

spent watching (in minutes and percentage) and the average

and standard deviation of moves going forwards and

backwards through the lecture video (in minutes and

percentage).

First we have analyzed a set of 1000 most popular (by

views) videos relevant to the Data Science category. Those

videos produced 7055427 views and 3020090 downloads in

the period from September 2012 until December 2015. The

number of visitors for these videos was 1045860.

Moreover, there were 866912 searches at the portal. We

have then analyzed all of the log files for which the



statistics can be found on the Learning Analytics

Dashboard.

3.1 VideoLectures Learning Analytics Dashboard

Our interest is not only to make statistical analysis of the

viewer behavior but also to have visual exploration features

that enables us to see the traffic and activities made on the

VideoLectures portal. For this we created graphs that show

the number of views, downloads and searches the viewers



make. An example can be seen in Figure 1 which shows the

Figure 2. Lecture information for “Deep Learning in

view trends.

Natural Language Processing” lecture. Displays the



basic lecture information, measures and graph of its

activity.





Figure 1. VideoLectures views trend. It shows the total

number of lecture views through September 2012 until

Figure 3. Viewer distribution through countries. Most

December 2015.

viewers come from the United States.





Statistics for a particular lecture are also available. On the

In order to implement VideoLectures Learning Analytics

per lecture tab one can search for the desired lecture and

Dashboard, we used QMiner [12] for processing data,

get its title, description, the measures returned by the

Node.js [3] for creating the server and dashboard and

lecture metrics and a graph showing the lectures activity.

Highcharts [4] for graphical implementation and dynamics

Figure 2 shows the lecture information and measures of

support.

“Deep Learning in Natural Langauge Processing” lecture.

30



Using the interactive VideoLectures Learning Analytics

lectures coordinates and draw the landscape using d3.js

Dashboard, it is possible to check what are the lectures of

visualization library [6].

broad interest at the VideoLectures.NET portal, how the

4.3 Explorer Functionality

users of the portal behave through time, where the users

Each lecture is presented by a point and size mapping to the

come from and how they react at the specific videos.

number of views. Similarity between lectures is mapped to

distance between points; more similar lectures are brought

4. VIDEOLECTURES EXPLORER

closer together. Hovering over a point brings up a tooltip

Our interest is not only to analyze the viewer behavior but

containing information about the lecture: its title and

also to see the similarities between lectures. For this we

description, the name of the presenters his affiliation, the

developed VideoLectures Explorer [2], a tool for exploring

language in which the lecture was presented, which

the lectures published on VideoLectures.NET. The tool

scientific categories it belongs to, its duration, when it was

enables the user to search through the lectures and find

published and the number of views since the last database

similarities between them, e.g., to find lectures of a specific

update. Any data attribute for which values are not

category or presenter of interest.

available is omitted from the tooltip. Landmarks show areas

Here we explain the method used for visualizing lecture

populated with lectures that are majorly of the same

similarities and present the tools functionalities.

category. The user can zoom in and out of the landscape to

enable a more detailed look of the lectures. An example of

4.1 Data Acquisition

the landscape can be seen in Figure 4, which shows the of

The database used for the visualization and basic statistical

the machine learning lecture landscape.

analysis contains data from all lectures found on

VideoLectues.NET. We have acquired data for 23224



lectures, keynotes, interviews, events etc. For each lecture

the database contains the lectures title and description, the

name of the presenter and his affiliation with city and

country, the lectures publication date, video duration, its

parent event, number of views and the scientific categories

the lecture belongs to. The scientific categories have a

hierarchical structure decided by the VideoLectures

development team which we use in the visualization

process. The database was constructed using the

VideoLectures API [5].

4.2 Methodology

Our objective is to draw the lectures into a two-dimensional

vector space to allow plotting on the computer screen,

where the similarities of the lectures are maintained. The

method we used has been described in [11]. Here is its

quick summary:

Using the bag-of-words [13] representation, we represent



the lectures as vectors in a high-dimensional vector space,

Figure 4. The landscape of machine learning lectures.

where each dimension corresponds to one word from the

created using the “Machine Learning” keyword.

vocabulary. These are then used to construct the term-

Lectures that are more similar are brought closer

document matrix. Using Latent Semantic Indexing [10] we

together.

merge the dimensions associated with the terms that have



similar meaning and get the most suitable set of words to

describe the corresponding lectures. After that we use

Clicking on the point shows the lectures information. It

Multidimensional Scaling [14] to reduce the dimensionality

shows the lectures title and presenter, if the lecture’s public

of the original multidimensional vectors and map them onto

and enabled and a link to the lecture video located on

two dimensions where the distances between vectors are

VideoLectures.NET. Clicking on the lecture link opens the

preserved as well as possible. Once we have the two-

corresponding lecture video on the portal.

dimensional coordinates we can draw the landscape using

When the landscape is generated the dashboard also shows

the preferred visualization library.

an additional information window (see Figure 5). This is in

The features used for representing the lectures similarities

two parts: the first part is the query information, containing

are its title, description, categories and parent event. We

the names of presenters, organizations and categories, the

used the algorithm described in section 4.2 to calculate the

minimum and maximum number of views, organization

location and the language the user used to query the data.

31



The second part contains the basic statistics about the

7. REFERENCES

queried data, the number of lectures in the queried data, the

[1] Videolectures learning analytics | dashboard,

total number of lecture views and scientific categories with

http://learninganalytics.videolectures.net, accessed:

the number of their occurrences in the queried data.

2016-09-09.

Clicking on the category in the dashboard automatically

[2] Videolectures learning analytics | landscape,

queries the data using the category name, and the

http://explore.videolectures.net, accessed: 2016-09-09.

information window is updated along the landscape view.

[3] Node.js, https://nodejs.org/en/, accessed: 2016-09-09.



[4] Interactive javascript chart for your website |

highcharts, http://www.highcharts.com/, accessed:

2016-09-09.

[5] Swagger ui, http://videolectures.net/site/api/docs/,

accessed: 2016-09-09.

[6] D3.js – data-driven documents, http://d3js.org/,

accessed: 2016-09-09.

[7] A. Bakharia, L. Corrin, Linda, P. de Barba, G.

Kennedy, D. Gašević, R. Mulder, D/ Williams, S.

Dawson and L. Lockyer. A conceptual framework

linking learning design with learning analytics. In

Proc., Sixth International Conference on Learning

Analytics & Knowledge, ACM, pages 329-338, 2016.

[8] M. Á Conde, F. J. García-Peñalvo, D. A. Gómez-

Aguilar and R. Therón. Exploring Software

Engineering Subjects by Using Visual Learning

Analytics Techniques , IEEE Revista Iberoamericana

de Tecnologias del Aprendizaje, 10(4), pages 242-252,



2015.

Figure 5. The additional information window created

using the “Machine Learning” keyword.

[9] T. F. Cox and M. A. Cox. Multidimensional Scaling.

It contains the

CRC press, New York, 2000.

query keywords and the overall statistics about the

queried lectures.

[10] S. T. Dumais. Latent semantic analysis. Annual review

of information science and technology, 38(1):188-230,



January 2004.

5. CONCLUSION AND FUTURE WORK

[11] B. Fortuna, D. Mladenić and M. Grobelnik.

In this paper we presented learning analytics tools for

Visualization of temporal semantic spaces. In Semantic

visual

and

statistical

analysis

of

data

from

knowledge management, pages 155-169, Springer,

VideoLectures.NET portal. While learning analytics

2009.

methods traditionally deal with measurement, collection

and analysis of data about learners with an aim of

[12] B. Fortuna, J. Rupnik, C. Fortuna, M. Grobelnik, V.

Jovanoski, M. Karlovcec, B. Kazic, K. Kenda, G.

improving the learning process, our solutions are targeted

Leban, J. Novljan, M. Papler, L. Rei, B. Sovdat, L.

at viewers of VideoLectures. The novel VideoLectures

Stopar and A. Muhic. QMiner – Data analytics

Learning Analytics Dashboard and Explorer tools allow

platform for processing streams of structured and

observing, searching and analyzing viewer behavior and, at

unstructured data. Software Engineering for Machine

the same time, efficiently present the information from the

Learning Workshop, Neural Information Processing

portal to the viewers.

Systems, 2014.

The future work will include the development of more

efficient

learning

analytics

suggestions

for

[13] G. Salton. Developments in automatic text retrieval.

VideoLectures.NET

portal

and

providing

the

Science, 253(5023):974-979, August 1991.

recommendations on how to increase the viewer

[14] J.J. Thomas and K.A.Cook. Illuminating the Path: The

engagement into the portal.

Research and Development Agenda for Visual

6. ACKNOWLEDGMENTS

Analytics, IEEE CS Press, Los Alamitos, 2005.

This work was supported by the Slovenian Research

Agency and EDSA EU H2020 project (Contract no:

H2020-ICT-643937).

32

Spatio-temporal clustering methods

Matej Senožetnik, Luka Bradeško, Blaž Kažič, Dunja Mladenić, Tine Šubic

Jožef Stefan Institute

and

Jožef Stefan International Postgraduate School

Jamova cesta 39

1000 Ljubljana, Slovenia

ABSTRACT

In general, clustering methods are separated into following

Tracking a person, an animal, or a vehicle generates a vast

categories:

amount of spatio-temporal data, that has to be processed

and analyzed differently from ordinary data generally used

• Partitioning methods - classify data into k groups

in knowledge discovery. This paper presents existing spatio-

or partitions,

temporal clustering algorithms, suitable for such data and

compares their running time, noise sensitivity, quality of re-

• Hierarchical methods - hierarchically decompose given

sults and the ability to discover clusters according to non-

datasets,

spatial, spatial and temporal values of the objects.

• Density-based methods - are based on cluster den-

sity, where clusters stop growing when neighbourhood

Keywords

density stops exceeding a given threshold,

clustering, spatial-temporal data, density-based algorithms

• Model-based methods - definition copied from [1]:

”In this method, a model is hypothesized for each clus-

ter to find the best fit of data for a given model. This

1.

INTRODUCTION

method locates the clusters by clustering the density

Due to emerging field of ICT and rapid development of sen-

function.

It reflects spatial distribution of the data

sor technologies, a lot of spatio-temporal data has been col-

points.”

lected in the past few years. By processing and enriching

raw spatio-temporal data we aim at extracting semantic in-

formation, which is a basic requirement for the comprehen-

The most appropriate methods for clustering spatio-temporal

sion and later usage of this data. Normally, the very first

data are density-based methods as they regard clusters as

step of this process is clustering raw GPS coordinates into

dense regions of objects in a data space that are separated

more distinct groups of points, which already have some se-

by regions of low density [7]. Thus we will focus on density

mantics, such as whether the points belong to trajectory or

based methods and their application to cluster GPS coordi-

stationary point (i.e., stay point).

nates.

This paper aims to investigate different methods, used for

2.

DENSITY-BASED ALGORITHMS

clustering spatio-temporal data, generated by mobile phones,

Table 1 shows some of the more common density-based algo-

by collecting timestamped GPS coordinates of the phones’

rithms, used for the detection of stay points. For a successful

location. By clustering the collected coordinates, we obtain

detection of stay points, it is important that the clustering

so called stay points (also referred to as points of interest or

algorithm uses temporal information alongside bare spatial

stop points [15]), which are the points in space where a mov-

data. As indicated in Table 1, most of the algorithms do

ing object has stayed within a certain distance threshold for

make use of it, except for DBSCAN. Quality of the algo-

a longer period of time [16]. When the stay points are calcu-

rithm also depends on noise sensitivity (the less sensitivity

lated, we can process the data further, to calculate the most

the algorithm has, the better it is), as cellphones’ GPS coor-

frequently visited locations (i.e., frequent locations), which

dinates are frequently noisy due to connection glitches (for

is the fundamental building block for further advanced ana-

instance, when moving through forests, staying inside build-

lytics use-cases, such as next location prediction [6].

ings or in bad weather conditions). Some algorithms return

only clusters such as DBSCAN, ST-DBSCAN and OPTICS,

but CB-SMoT, SMoT and SPD return a stay point (also re-

ferred to as stops) or a path (also refered to as moves).

DBSCAN has an overall average running time O(n log n),

and the worst case run time complexity is O(n 2). Running

time depends on parameter choice and version of implemen-

tation. OPTICS has similar time complexity but it is 1.6

seconds slower than DBSCAN. ST-DBSCAN has the same

running time as DBSCAN.

33



Algorithm name

Spatio temporal

Noise sensitivity

Returning stay points/path

DBSCAN

7

7

7

ST-DBSCAN

3

7

7

SMoT

3

3

3

CB-SMoT

3

7

3

SPD

3

3

3

OPTICS

3

7

7

Table 1: The most common density-based algorithms, used for the detection of stay points [14].

2.1

Density Based Spatial Clustering of Ap-

plication with Noise (DBSCAN)

DBSCAN [5] is a density-based clustering algorithm which

identifies arbitrary-shaped objects and detects noise in a

given dataset. The algorithm starts with the first point in

the dataset and detects all neighboring points within a given

distance. If the total number of these neighboring points ex-

ceeds a certain threshold, all of them have to be treated as

part of a new cluster. The algorithm then iteratively collects

the neighboring points within a given distance of the core

points. The process is repeated until all of the points have

been processed.

DBSCAN’s advantages are that it robustly detects outliers,

Figure 1: Example of trajectory points with three

only needs two parameters (Eps and MinPts), is appropriate

possible candidate stops [2].

for large datasets and data input order does not interfere

with the results [12].

2.3

Clustering-Based Stops and Moves of Tra-

Numerous research studies have extended DBSCAN, such

jectory (CB-SMoT)

as in the example of GDBSCAN [13], which is a generaliza-

tion of the original DBSCAN. GDBSCAN can cluster point

CB-SMoT [11] algorithm is an alternative to the algorithm

objects as well as spatially extended objects. Another one

SMoT and addresses one of its main drawbacks - the in-

of these extended algorithms is DJ-Cluster [17], used for

capability to detect stay points that are not predefined by

discovering personal gazetteers. It attaches semantic infor-

user.

It uses clustering methods to automatically detect

mation to clusters and requires a list of points of interest.

stay points. The idea behind this method is, that when we

Also extension is ST-DBSCAN which is described bellow.

move around points of interest (such as museums, monu-

ments, night-clubs, etc.), we move slower than when we are

ST-DBSCAN [4] is another algorithm that is based on DB-

traveling from one place to another. In the first steps, the

SCAN. It is making use of its ability to discover clusters with

potential stops (the slower parts of a trajectory) are iden-

various shapes, while improving some of the weak points of

tified using the variation of the DBSCAN algorithm which

the original algorithm. It adds temporal data to the cluster-

considers one-dimensional trajectories and their speed. In

ing results, identifies noisy objects if there are various den-

the second step, the algorithm identifies the location of the

sities of the input data, and more accurately differentiates

potential stops (clusters) which were found in the first step.

adjacent clusters.

The authors report [11] that their algorithm discovers less

incorrect stops, compared to SMoT algorithm.

2.2

Stops and Moves of Trajectory (SMoT)

SMoT [2] algorithm divides data into two sets called moves

2.4

Stay Point Detection (SPD)

and stops. The easiest way to understand how SMOT works

The SPD [9] algorithm works by detecting whether the ob-

is by reviewing the example shown in Figure 1. There are

served entity has spent more than 30 minutes within a ra-

three potential stop candidates with geometries Rc

dius of 100 meters. If this happens, the region is detected as

1, Rc2

and Rc

a stay point. Both threshold time and distance parameters

3 and with information about minimum time dura-

tion for each of them. From the figure, we can observe that

are adjustable and are chosen depending on the use-case and

the point p

accuracy of raw data.

0 is not inside any of these geometries, therefore

it is a candidate for move dataset.

The next few points

are inside the first stop candidate (Rc

The main advantages of SPD is its need for predefined struc-

1) and also exceed the

minimum time duration which is specified for every geome-

tures. It is not computationally demanding, but it is sen-

try. In candidate Rc

sitive to the accuracy of data usually generated by GPS

2, the point duration does not exceed

minimum threshold, therefore Rc

receiver. Namely, for different accuracies of GPS it returns

2 is not a stop.

slightly different positions for the same location. It is also

sensitive to noise, but this can be partially reduced by ad-

justing the parameters of the algorithm.

34



Figure 2: OPTICS result is called reachability plot [10].

2.5

Ordering Points to Identify the Clustering

spent on a location), as well as the order of stay points on

Structure (OPTICS)

the timeline, thus making us unable to do further analysis,

such as future location predictions and plotting frequency

OPTICS [3] is an algorithm which is used for finding density-

graphs. Due to this shortcoming and also different cluster

based clusters in spatio-temporal data. Though it works in a

densities, algorithms such as DBSCAN, or its improved ver-

similar way as DBSCAN, OPTICS improves on DBSCAN’s

sion DJ-cluster, are not appropriate for our use case.

biggest weakness, the failure to detect clusters when density

of the data varies.

In [14], Sander et al.

exposed that the problem of DB-

SMoT algorithm is that the quantile function requires a

In Figure 2, we can observe that OPTICS algorithm gen-

priori knowledge of the proportion between points inside

erates an easily visualized ordering of points, which can

potential stops and total points in the dataset. This pro-

be used to extract partitions and hierarchical clusters [10].

portion varies among datasets since users sometimes spend

Zheng et al. [16] used OPTICS for clustering stay points.

a whole day inside a stop (i.e., the proportion is one), while

In their article, they present the idea that through the use

on different occasions they might be visiting more than ten

of a statistically significant number of user-generated GPS

stops (i.e., the proportion is much smaller than one). Due to

trajectories, the correlation between personal geographical

this, we need and online algorithm that can function with-

locations implicit in a person’s location history enables the

out prior knowledge of stay point to path ratio. The SMoT

provision of valuable services, such as personalized recom-

algorithm uses predefined regions which can be a problem if

mendations and target specific sales promotions.

we want to detect some stay points outside of those zones

(such as outdoors). Algorithm SPD, SMoT and CB-SMoT

2.6

Other algorithms

are all sensitive to noise, but have the advantage of returning

There exist many other different approaches on dealing with

stay points ordered by timestamps. With other algorithms,

spatial and spatio-temporal data.

Some of these are de-

such as DBSCAN and OPTICS, we need to find the right se-

veloped as extensions (for example, DB-SMoT [12]) to well

quence ourselves independently of the algorithm. Given our

known algorithms, while others take an entirely new ap-

requirements, we propose CB-SMoT, SMoT and SPD as the

proach (TRACLUS [8]). It is important to note that other

most suitable for clustering spatio-temporal data collected

data can also be used alongside coordinates and timestamps.

from a mobile phone.

3.

DISCUSSION

An efficient algorithm must be insensitive to noise. Cur-

The goal of this discussion is to find an algorithm capable of

rently, GPS has reception problems in narrow valleys, forests

clustering user’s raw historical data of locations, as tracked

or inside buildings, so it’s important to use different data

by a mobile phone. Referring back to Table 1, we want an

sources (for example, Wi-Fi networks, activity recognition)

algorithm which has the following properties:

for accurate stay point detection. Modern mobile phones

nowadays also provide rudimentary activity recognition, which

we can use to compare our own results against. If the ac-

• is able to cluster spatio-temporal data;

tivity recognition system is not detecting movement (the

person is standing still) and our collected GPS coordinates

• is noise insensitive;

are consistently far apart, we may be facing a problem with

• returns stay points and a paths.

GPS accuracy. In such cases, additional data sources should

be considered for more accurate measurement.

As we had already stated in the beginning of this paper, be-

To the best of our knowledge, up to this date SPD algorithm

sides spatial data, temporal information is one of the most

is considered as the best solution for detection of stay points,

important additional information for stay point detection

but suffers from detecting false stay points and paths. This

algorithms, which enables better stay point detection per-

problem can be alleviated by running multiple iterations of

formance and empowers additional time related capabilities,

the algorithm on resulting dataset. This is already outside of

such as detecting time spent at each stay point. By using

the scope of this paper and will be described in the separate

an algorithm that clusters data only by spatial information,

paper which is currently under preparation.

we lose part of the useful information (in our case, the time

35

4.

ACKNOWLEDGMENT

databases: The algorithm gdbscan and its

This work was supported by the Slovenian Research Agency

applications. Data Min. Knowl. Discov., 2(2):169–194,

and the ICT program of the EC under project OPTIMUM

June 1998.

(H2020-MG-636160).

[14] Khoa a Tran, Sean J Barbeau, and Miguel a Labrador.

Bacchelors: Automatic Identification of Points of

5.

REFERENCES

Interest in Global Navigation Satellite System Data :

[1] Clustering analysis. http://www.tutorialspoint.

A Spatial Temporal Approach Categories and Subject

com/data_mining/dm_cluster_analysis.htm.

Descriptors. (January):33–42, 2013.

[2] Luis Otavio Alvares, Vania Bogorny, Bart Kuijpers,

[15] Y. Zheng and X. Xie. Learning location correlation

Jose Antonio Fernandes de Macedo, Bart Moelans,

from gps trajectories. In 2010 Eleventh International

and Alejandro Vaisman. A model for enriching

Conference on Mobile Data Management, pages 27–32,

trajectories with semantic geographical information.

May 2010.

In Proceedings of the 15th Annual ACM International

[16] Yu Zheng. Trajectory Data Mining: An Overview.

Symposium on Advances in Geographic Information

ACM Trans. On Intelligent Systems and Technology,

Systems, GIS ’07, pages 22:1–22:8, New York, NY,

6(3):1–41, 2015.

USA, 2007. ACM.

[17] Changqing Zhou, Dan Frankowski, Pamela Ludford,

[3] Mihael Ankerst, Markus M. Breunig, Hans-Peter

Shashi Shekhar, and Loren Terveen. Discovering

Kriegel, and Jörg Sander. Optics: Ordering points to

personal gazetteers: An interactive clustering

identify the clustering structure. ACM Sigmod Record,

approach. In Proceedings of the 12th Annual ACM

pages 49–60, 1999.

International Workshop on Geographic Information

[4] Derya Birant and Alp Kut. ST-DBSCAN: An

Systems, GIS ’04, pages 266–273, New York, NY,

algorithm for clustering spatial–temporal data. Data

USA, 2004. ACM.

& Knowledge Engineering, 60(1):208–221, 2007.

[5] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and

Xiaowei Xu. A density-based algorithm for discovering

clusters in large spatial databases with noise. pages

226–231. AAAI Press, 1996.

[6] Sébastien Gambs, Marc-Olivier Killijian, and

Miguel Nú˜

nez del Prado Cortez. Next place prediction

using mobility markov chains. In Proceedings of the

First Workshop on Measurement, Privacy, and

Mobility, page 3. ACM, 2012.

[7] Maria Halkidi, Yannis Batistakis, and Michalis

Vazirgiannis. On clustering validation techniques.

Journal of Intelligent Information Systems,

17:107–145, 2001.

[8] Jae-Gil Lee, Jiawei Han, and Kyu-Young Whang.

Trajectory clustering: A partition-and-group

framework. In Proceedings of the 2007 ACM SIGMOD

International Conference on Management of Data,

SIGMOD ’07, pages 593–604, New York, NY, USA,

2007. ACM.

[9] Quannan Li, Yu Zheng, Xing Xie, Yukun Chen,

Wenyu Liu, and Wei-Ying Ma. Mining user similarity

based on location history. In Proceedings of the 16th

ACM SIGSPATIAL International Conference on

Advances in Geographic Information Systems, GIS ’08,

pages 34:1–34:10, New York, NY, USA, 2008. ACM.

[10] Chris Mueller. Data Preparation. 2005.

[11] Andrey Tietbohl Palma, Vania Bogorny, Bart

Kuijpers, and Luis Otavio Alvares. A clustering-based

approach for discovering interesting places in

trajectories. In Proceedings of the 2008 ACM

Symposium on Applied Computing, SAC ’08, pages

863–868, New York, NY, USA, 2008. ACM.

[12] Parya Pasha and Zadeh Monajjemi. a

Clustering-Based Approach for Enriching Trajectories

With Semantic Information Using Vgi Sources a

Clustering-Based Approach for Enriching Trajectories

With Semantic Information Using Vgi Sources. 2013.

[13] Jörg Sander, Martin Ester, Hans-Peter Kriegel, and

Xiaowei Xu. Density-based clustering in spatial

36





Indeks avtorjev / Author index



Berčič Katja .................................................................................................................................................................................... 9

Bradeško Luka ......................................................................................................................................................................... 5, 33

Cattaneo Gabriella .......................................................................................................................................................................... 9

Cerle Gaber .................................................................................................................................................................................... 9

Choloniewski Jan ......................................................................................................................................................................... 13

Fortuna Blaž ........................................................................................................................................................................... 21, 25

Fuart Flavio .................................................................................................................................................................................... 9

Herga Zala .................................................................................................................................................................................... 21

Karlovčec Mario ............................................................................................................................................................................. 9

Kažič Blaž .................................................................................................................................................................................... 33

Kladnik Matic ............................................................................................................................................................................... 25

Leban Gregor ............................................................................................................................................................................... 13

Maček Sebastijan.......................................................................................................................................................................... 13

Mladenić Dunja ............................................................................................................................................................................ 33

Moore Pat ..................................................................................................................................................................................... 25

Novak Erik ................................................................................................................................................................................... 29

Novalija Inna ................................................................................................................................................................................ 29

Pita Costa Joao ............................................................................................................................................................................. 17

Rehar Aljoša ................................................................................................................................................................................. 13

Rihtar Matjaž ................................................................................................................................................................................ 17

Rupnik Jan .................................................................................................................................................................................... 21

Senožetnik Matej ...................................................................................................................................................................... 5, 33

Škraba Primož .............................................................................................................................................................................. 21

Šubic Tine .................................................................................................................................................................................... 33

Urbančič Jasna ............................................................................................................................................................................... 5





37





Konferenca / Conference

Uredili / Edited by

Izkopavanje znanja in podatkovna

skladišča (SiKDD) /

Data Mining and Data Warehouses (SiKDD)

Dunja Mladenić, Marko Grobelnik





Document Outline


Blank Page

Blank Page

Blank Page

Blank Page

Blank Page

Blank Page