Volume 42 Number 1 March 2018 Special Issue: AI in Slovenia Anniversary Edition Guest Editors: Mitja Luštrek Jure Žabkar Marko Grobelnik 1977 Editorial Boards Informatica is a journal primarily covering intelligent systems in the European computer science, informatics and cognitive com- munity; scientific and educational as well as technical, commer- cial and industrial. Its basic aim is to enhance communications between different European structures on the basis of equal rights and international refereeing. It publishes scientific papers accep- ted by at least two referees outside the author’s country. In addi- tion, it contains information about conferences, opinions, critical examinations of existing publications and news. Finally, major practical achievements and innovations in the computer and infor- mation industry are presented through commercial publications as well as through independent evaluations. Editing and refereeing are distributed. Each editor from the Editorial Board can conduct the refereeing process by appointing two new referees or referees from the Board of Referees or Edi- torial Board. Referees should not be from the author’s country. If new referees are appointed, their names will appear in the list of referees. Each paper bears the name of the editor who appointed the referees. Each editor can propose new members for the Edi- torial Board or referees. Editors and referees inactive for a longer period can be automatically replaced. Changes in the Editorial Board are confirmed by the Executive Editors. The coordination necessary is made through the Executive Edi- tors who examine the reviews, sort the accepted articles and main- tain appropriate international distribution. The Executive Board is appointed by the Society Informatika. Informatica is parti- ally supported by the Slovenian Ministry of Higher Education, Science and Technology. Each author is guaranteed to receive the reviews of his article. When accepted, publication in Informatica is guaranteed in less than one year after the Executive Editors receive the corrected version of the article. Executive Editor – Editor in Chief Matjaž Gams Jamova 39, 1000 Ljubljana, Slovenia Phone: +386 1 4773 900, Fax: +386 1 251 93 85 matjaz.gams@ijs.si http://dis.ijs.si/mezi/matjaz.html Editor Emeritus Anton P. Železnikar Volaričeva 8, Ljubljana, Slovenia s51em@lea.hamradio.si http://lea.hamradio.si/˜s51em/ Executive Associate Editor - Deputy Managing Editor Mitja Luštrek, Jožef Stefan Institute mitja.lustrek@ijs.si Executive Associate Editor - Technical Editor Drago Torkar, Jožef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia Phone: +386 1 4773 900, Fax: +386 1 251 93 85 drago.torkar@ijs.si Contact Associate Editors Europe, Africa: Matjaz Gams N. and S. America: Shahram Rahimi Asia, Australia: Ling Feng Overview papers: Maria Ganzha, Wiesław Pawłowski, Aleksander Denisiuk Editorial Board Juan Carlos Augusto (Argentina) Vladimir Batagelj (Slovenia) Francesco Bergadano (Italy) Marco Botta (Italy) Pavel Brazdil (Portugal) Andrej Brodnik (Slovenia) Ivan Bruha (Canada) Wray Buntine (Finland) Zhihua Cui (China) Aleksander Denisiuk (Poland) Hubert L. Dreyfus (USA) Jozo Dujmović (USA) Johann Eder (Austria) George Eleftherakis (Greece) Ling Feng (China) Vladimir A. Fomichov (Russia) Maria Ganzha (Poland) Sumit Goyal (India) Marjan Gušev (Macedonia) N. Jaisankar (India) Dariusz Jacek Jakóbczak (Poland) Dimitris Kanellopoulos (Greece) Samee Ullah Khan (USA) Hiroaki Kitano (Japan) Igor Kononenko (Slovenia) Miroslav Kubat (USA) Ante Lauc (Croatia) Jadran Lenarčič (Slovenia) Shiguo Lian (China) Suzana Loskovska (Macedonia) Ramon L. de Mantaras (Spain) Natividad Martı́nez Madrid (Germany) Sando Martinčić-Ipišić (Croatia) Angelo Montanari (Italy) Pavol Návrat (Slovakia) Jerzy R. Nawrocki (Poland) Nadia Nedjah (Brasil) Franc Novak (Slovenia) Marcin Paprzycki (USA/Poland) Wiesław Pawłowski (Poland) Ivana Podnar Žarko (Croatia) Karl H. Pribram (USA) Luc De Raedt (Belgium) Shahram Rahimi (USA) Dejan Raković (Serbia) Jean Ramaekers (Belgium) Wilhelm Rossak (Germany) Ivan Rozman (Slovenia) Sugata Sanyal (India) Walter Schempp (Germany) Johannes Schwinn (Germany) Zhongzhi Shi (China) Oliviero Stock (Italy) Robert Trappl (Austria) Terry Winograd (USA) Stefan Wrobel (Germany) Konrad Wrona (France) Xindong Wu (USA) Yudong Zhang (China) Rushan Ziatdinov (Russia & Turkey) Informatica 42 (2018) 1–1 1 Introduction to the Special Anniversary Issue on “AI in Slovenia” Slovenian Artificial Intelligence Society (SLAIS) celebrated its 25th anniversary last year. Along with 40th anniversary of the Informatica journal and 20th Information Society conference, the idea of dedicating a special issue to commemorate these events was born. Informatica has long been a publication venue for Slovenian artificial intelligence (AI) research, one of the Information Society subconferences is dedicated to AI, and the best papers from it are traditionally published in the Informatica journal, so commemorating them jointly is very fitting. We decided to invite papers presenting current work of Slovenian AI researchers, as well as position papers providing a (historical) perspective on some AI topic. The global research on AI has had several ups and downs through the history. It started in the 1950s with Alan Turing’s seminal paper "Computing Machinery and Intelligence" and the Dartmouth conference. These events sparked a golden age of discovery, which ended in the so-called first AI winter in the 1970s. However, it was soon succeeded by another boom in the 1980s, spurred by the Japanese fifth-generation computer project. The second AI winter followed, caused by the perception that AI is not fulfilling its promise. After that, we have seen steady progress, and right now we are at another peak of success: AI is applied in all areas of life and business, with examples ranging from self-driving- cars to the games of go and chess. As recorded by Marko Bohanec during his tenure as SLAIS chair, Slovenian AI research started in 1972, at the end of the initial golden age. It began at the Computer Science Department at Jožef Stefan Institute (JSI), and later expanded to the Faculty of Computer and Information Science (FRI) of the University of Ljubljana. Initially, AI research in Slovenia was concerned with heuristic search, including knowledge-based approaches to computer chess. The emphasis then gradually shifted and expanded to the areas of machine learning, knowledge representation, computer-aided multi- attribute decision making, qualitative reasoning and modelling, and combinatorial optimisation. This provided a solid basis for later application projects. In 1982, the development and implementation of AI tools started and soon resulted in numerous practical applications. Most of these applications were based on Assistant Professional, an inductive learning system, and DEX, a computer-aided decision making system. Later, the research encompassed practically all major areas of AI – basic and applied – some of which are sampled in the papers composing this special issue. SLAIS was established in 1992, when Slovenian AI research was already quite extensive, as evidenced by having over 60 members one year after establishment. The membership later rose to a peak of 157. SLAIS is a member of the European Association for Artificial Intelligence (EurAI), and three of its members were elected EurAI fellows. SLAIS and Slovenian AI is in most respects firmly embedded in the European and global AI community, mainly through participation in international research projects beginning in 1990 with the first European Framework Programme. We are looking to continue along this path, as well as strengthen the collaboration and sense of community within Slovenia with efforts such as this special issue. The first paper of the special issue is by Igor Kononenko, presenting early research on machine learning in Slovenia starting in 1982. It is followed by a paper of the AI pioneer Ivan Bratko on computer chess – a topic dating back to the very beginning of AI research and recently again brought to prominence by the success of AlphaZero. After these we have a range of papers on various topic representing most major Slovenian groups engaged in AI research. In addition to Igor Kononenko, the Laboratory for Cognitive Modelling at FRI is represented by Marko Robnik-Šikonja with a paper on explaining the predictions of machine-learning models. The AI Laboratory at FRI contributed a paper on argumentation in interactive machine learning by Martin Možina. From the AI Laboratory at JSI we have a paper on semantic annotation of documents by Janez Brank et al. The Department of Intelligent Systems at JSI is represented by two papers on AI applications in the health domain: on continuous blood pressure estimation from PPG signal by Gašper Slapničar et al., and on psychological arousal recognition from physiological signals by Martin Gjoreski et al. The Department of Knowledge Technologies at JSI is represented by three papers: on assessing the quality of feature rankings by Ivica Slavkov et al., on computational creativity by Senja Pollak et al., and on a related topic of creatively blending software components by Matej Martinc et al. Finally, we have a paper on text understanding by Jure Zupan, who does not belong to any established Slovenian AI group. We regret that we have not been able to include any paper from the Bioinformatics Laboratory at FRI or from the University of Maribor, but since we are confident in continued success of AI research in Slovenia, we trust there will be further opportunities for this. Mitja Luštrek Jure Žabkar Marko Grobelnik 2 Informatica 42 (2018) 1–1 M. Luštrek et al. Informatica 42 (2018) 3–6 3 Early Machine Learning Research in Ljubljana Igor Kononenko University of Ljubljana, Faculty of Computer and Information Science Večna pot 113, Ljubljana, Slovenia E-mail: igor.kononenko@fri.uni-lj.si Keywords: machine learning, decision trees, naive Bayesian classifier, ReliefF Received: October 17, 2017 We describe early machine learning research in Ljubljana, motivated by medical diagnostic problems, in the areas of building decision trees with Assistant, the development of Naïve and Semi-Naïve Bayesian classifier and its explanations of individual predictions, and the development of ReliefF and RReliefF algorithms for non-myopic evaluation of attributes in classification and regression, respectively. Povzetek: V članku opišemo zgodnje raziskave na področju strojnega učenja v Ljubljani, ki so bile motivirane z medicinskimi diagnostičnimi problemi. Razvili smo sistem Asistent za gradnjo odločitvenih dreves, naivni in delno naivni Bayesov klasifikator in metodo razlage njunih napovedi ter algoritma ReliefF in RReliefF za nekratkovidno ocenjevanje atributov v klasifikaciji in regresiji. 1 Introduction As a young researcher, I started my research in Machine learning (ML) in 1982 at the University of Ljubljana and with strong connection with the Artificial Intelligence (AI) group at Jožef Stefan Institute in Ljubljana, Slovenia. My supervisor Prof. Ivan Bratko suggested to me to use Quinlan's (1979) algorithm ID3 for learning medical diagnostic rules. My first data set, obtained from Ljubljana Institute of Oncology, was a description of 339 patients with known correct locations of the primary tumor in the body out of 22 possible locations. The diagnostic task was to determine the location of the primary tumor for new patients, given the description of patients’ age, sex, tumor grade, and locations of detected metastases. We tested the classification accuracy of physicians-experts and they were able to correctly classify 42% of patients. The performance of ID3 on this hard diagnostic problem was not satisfactory (lower than 40%), that is why we started to research the possible deficiencies of ID3 and search for the methodologies which would circumvent them. At that time only few researchers applied ML to medical diagnosis, see (Kononenko, 2001) for an overview. ID3 was developed in 1979 and was not yet applied to medical diagnosis, nobody was using Naïve Bayes (Good, 1950; 1964), which was yet to be rediscovered by us and subsequently by ML community, and more advanced ML approaches, such as multilayered neural networks, support vector machines and random forests were developed much later. Therefore, building decision trees with ID3 seemed to be a good starting point. Note also that there was no internet at that time and the spreading of news about scientific development was significantly slower compared to nowadays. For example, we became aware of system CART (Breiman et al. 1984) for building classification and regression trees several years after it was published. 2 Induction of decision trees with Assistant Our first discovery was that Information gain, used by ID3 to evaluate the quality of attributes, was biased to overestimate the multivalued attributes, so normalization was required. Another observation was, that lower levels of the tree become unreliable due to small numbers of training examples, so a kind of pruning was needed. Also, at certain level of the tree, built by ID3, a null (empty) leaves could appear, indicating that there was no corresponding training instances for such a leaf, which required a technique to classify new instances which fall in such a leaf. Yet another problem was that ID3 was not able to deal with missing values of attributes. Introduction of an additional value “unknown” for each attribute did not work well, as it led to larger trees and an additional reduction of the number of instances in the leaves. The research resulted in the development of a new decision tree learning algorithm, called Assistant (Kononenko et al., 1984), which reached the classification accuracy of 44% in the primary tumor diagnostic task. The reason for encouraging results is that (good) ML algorithms can model the probability distributions more accurately than human experts. On the other hand, physicians use additional information about patients which cannot be straightforwardly coded in a form suitable for ML. Therefore, the comparison of prediction performance is biased, as physicians were, for the sake of comparison, constrained to use the same information as ML algorithms. Our encouraging results motivated other researchers to apply ML in various areas of medical diagnosis, see an overview in (Kononenko, 2001). 4 Informatica 42 (2018) 3–6 I. Kononenko The main five contributions of Assistant with respect to ID3 were: 2.1. An ad-hoc normalization of the Information gain – dividing information gain of the attribute with k possible values with log2k in order to prevent the overestimation of multivalued attributes. Although it improved the performance, it was ad-hoc. Ross Quinlan, inspired by our research, introduced another normalization – so called Gain-ratio in his famous system C4.5 (Quinlan, 1986), while the appropriate normalization of Information gain was introduced in ML community later with the so-called Distance measure (Mantaras, 1989). 2.2. Using (an ad-hoc) decision tree pruning. We introduced a parameter which indicated how many training instances should be in the leaf in order to allow further subtree building. Later, inspired by our idea, many researchers proposed various pre- and post-pruning techniques, however all of them introduced one or more parameters for controlling the strength of pruning. For example, our colleague from Jožef Stefan Institute in Ljubljana, Bojan Cestnik developed a post-pruning technique based on the m-estimate of probabilities (Cestnik and Bratko, 1991) which uses parameter m for pruning control. We were looking for a parameter-less pruning techniques, yet without success. We needed another ten years to develop a satisfactory decision tree pre-pruning method which required no parameter setting. The method is based on the MDL-principle (Li and Vitanyi, 1993), which we first used to develop the MDL attribute evaluation method (Kononenko, 1995). The basic idea is to evaluate how compressive a (discrete) attribute is. The effectiveness of that method depends on the appropriate selection of (optimal) data coding. The same idea was later extended to parameter-less decision tree pre-pruning (Kononenko, 1998). The method evaluates how compressive the subtree is in comparison to a leaf alone (without the subtree). Again, the effectiveness of the method depends on the appropriate coding of the data and the tree structure. 2.3. Classification in combination with the Naïve Bayesian classifier (NB) in the tree leaves. One version of this idea is to use NB in the empty (null) leaves. This allows us to classify new instances for which no support from the training set in the corresponding leaf exists. The obvious generalization is to use NB in all leaves, allowing the classification process to efficiently use the information of attributes, not tested on the path from the root to the leaf. Later, the same idea was used by researchers who developed regression trees, where in the leaves Linear regression can be used. 2.4. Building binary decision trees. In order to avoid over-splitting the training data set (and also to overcome the bias of Information gain to overestimate multivalued attributes) we introduced the binarization of continuous and discrete attributes in order to build binary decision trees. Binary trees proved to be smaller and more accurate, avoiding also the so called replication problem – the appearance of more identical or similar subtrees in a non-binary decision tree. 2.5. Dealing with incomplete data. We introduced the methodology for dealing with missing values of attributes, by introducing the instance weights which correspond to the (conditional) probabilities that the instance with missing value has a certain attribute value. The weighted instance then follows all the branches from the current node, each with an appropriate weight. This attribute weighting was generalized to the so called “don’t care” values, where any attribute value is allowed. For such an instance the weight is multiplied with the number of possible values of the attribute with “don’t care” value. The methodology was later adopted as a standard way for dealing with incomplete data in decision tree learning. Later, a reimplementation of Assistant was developed, called Assistant 86 (Cestnik et al., 1987) which was followed by a commercial system Assistant Professional. 3 Naïve Bayesian classifier During the development of the Assistant learning algorithm, I intuitively developed a »simple statistical method«, as I called it at that time and compared its results with decision trees. The surprisingly simple method performed on the primary tumor problem equally well as Assistant did. At that time, however, we claimed that decision trees are preferable due to their “transparency”, which does not hold for »statistical methods«. I knew, that my »statistical method« was ad- hoc but I was not able to formally interpret it. With the help of Prof. Bratko we realized that my ad-hoc statistical method was almost the same as the Naïve Bayesian classifier (NB), however lacking the prior probability of the class in the NB formula. (At that time we called it Simple Bayes and only at the ISSEK Workshop in Bled, Slovenia in 1984, where I for the first time presented Assistant for building decision trees, Prof. Donald Michie tossed the name “Naïve Bayesian classifier” – and later this name was accepted by ML community). It turned out that the corrected NB (“statistical method” upgraded with the prior class probability) was able to significantly outperform Assistant in the primary tumor domain (reaching 50% of classification accuracy) as well as on two other medical diagnostic problems (lymphography diagnosis and the breast cancer recurrence prediction). We became motivated to further research NB in relation to decision trees (Kononenko, 1989a), and we developed the explanation method for NB where for each attribute the amount of information for or against the class is provided in the sum of information contributions during the classification process (Kononenko, 1989b). The explanation is obtained by changing probabilities P in the NB formula into information contributions (using - log2P). Surprisingly, this explanation turned out to be more intuitive and more transparent to physicians, who claimed that they also sum up the evidence for or against the diagnosis. Early Machine Learning Research in Ljubljana Informatica 42 (2018) 3–6 5 In 1988 I was listening to an inspiring talk by Prof. Igor Grabec in Ljubljana about artificial neural networks and I decided to do more research in this area. We generalized the Hopfield's (1982) discrete model into Bayesian neural networks, where each neuron in the model uses NB (Kononenko, 1989c), and later in my PhD I generalized it into continuous model. Our generalization of NB to Semi-naïve Bayes (Kononenko, 1991) motivated several researchers to try different approaches to avoid the naivety of NB. At the same time, in his PhD, Bojan Cestnik developed the m-estimate of probabilities, which proved to improve the performance of NB (Cestnik, 1990). 4 ReliefF and RReliefF In 1992 I attended the ICML conference in Aberdeen in Scotland. The audience was highly impressed by the talk of Prof. Larry Rendell, who described the algorithm RELIEF, developed by his PhD student Kira (Kira and Rendell, 1992). RELIEF is a non-myopic attribute evaluator, i.e. it is able to efficiently evaluate the quality of attributes even if there are strong interactions between attributes. This breakthrough in the field of attribute evaluation lead to the development of ReliefF algorithm (Kononenko, 1994) which was later adopted by the ML community as a standard for evaluating the attributes in classification and many improvements and adaptations of RelieF were developed. ReliefF improved RELIEF in three major directions: 1. Dealing with noisy data. RELIEF was sensitive to noise in the data. Instead of searching for each instance one nearest hit (nearest instance from the same class) and one nearest miss (nearest instance form the opposite class), ReliefF searches for k nearest hits and k nearest misses where k is a parameter, set by the user (in the same sense as k-NN algorithms deal with noise). 2. Dealing with multiclass problems. RELIEF was designed for two-class problems only. ReliefF generalizes to more than two classes by searching for k nearest misses from each “opposite” class and appropriately weights the contributions of nearest misses with the prior probabilities of corresponding classes. 3. Dealing with incomplete data. RELIEF was designed for complete data, without any missing values. While calculating the distances between instances, ReliefF calculates the contributions of attributes with missing values using the conditional probabilities of values given the class. ReliefF is able to evaluate continuous and discrete attributes for classification. Together with my PhD student Marko Robnik-Šikonja, we developed a regressional version of ReliefF, called RReliefF, which enables the evaluation the quality of discrete and continuous attributes in regression (Robnik- Šikonja and Kononenko, 1997). Note that in regression there are no hits and no misses, as instances do not belong to classes, but rather have real values of regression variable. The basic idea of RReliefF is to use the difference of two instances in regression values to model the “probability that two instances do not belong to the same class”. Together with my PhD student Uroš Pompe, we developed also a variant of Relief which enables the (non-myopic) evaluation of literals in Inductive Logic Programming (ILP) (Pompe and Kononenko, 1998). The basic idea is to make a non-symmetrical evaluation measure, biased towards “positive class”, as in ILP only positive examples should be covered by good literals (only a theory for the positive class is built) and negative examples should not be covered by good literals. 5 Conclusion Our development of ML algorithms was highly motivated by medical diagnostic problems. Our applications started in oncology and later spread to other medical areas, such as prognostics of the femoral neck fracture recovery, rheumatology, diagnosis of lower urinary tract disorders, coronary artery disease, sport injuries etc. The overview of our research of ML for medical diagnosis was described in (Kononenko, 2001), which had a great impact on scientific community. Other, earlier references, with the greatest impact on the ML community, include (Kononenko et al., 1984; Cestnik et al., 1987; Kononenko, 1991; 1994). The unattained goals of our early ML research, a general method for explaining individual predictions in a similar way as the NB’s explanations, and a general method for estimating the reliability of individual predictions of arbitrary prediction models in classification and regression, were achieved by my PhD students: the former goal by Erik Štrumbelj, and the latter goal by the work of Matjaž Kukar, Zoran Bosnić and Darko Pevec (see the overview by Kononenko et al., 2013). 6 References [1] Breiman L., Friedman J.H., Olshen R.A., Stone C.J. (1984) Classification and Regression Trees, Wadsforth International Group. [2] Cestnik, B., Kononenko, I., Bratko, I. ASSISTANT 86: a knowledge-elicitation tool for sophisticated users. In:Bratko, I., Lavrač, N. Progress in machine learning : proc. of European Working Session on Learning EWSL 87. Sigma Press, 1987, p. 31-45. [3] Cestnik, B. Estimating probabilities. In: Carlucci A. L. (ed.) Proc. ECAI 90. Pitman. 1990, p.147-149. [4] Cestnik, B., Bratko, I. On estimating probabilities in tree pruning. In: Proc. EWSL-91: European working session on learning, Porto, Portugal, March 6-8, 1991, Springer. p.138-150. [5] Good I.J., Probability and the Weighing of Evidence. London: Charles Griffin, 1950. [6] Good I.J., The Estimation of Probabilities - An Essay on Modern Bayesian Methods, Cambridge: The MIT Press, 1964. [7] Hopfield. J. J. Neural networks and physical systems with emergent collective computational abilities. Nat. Academy of Sc., 79:2554–2558, 1982. 6 Informatica 42 (2018) 3–6 I. Kononenko [8] Kira, K. and Rendell, L. A practical approach to feature selection. In D. Sleeman and P. Edwards, eds, Proc. ICML, Aberdeen, UK, 1992, p. 249–256. [9] Kononenko, I. ID3, Sequential Bayes, Naive Bayes and Bayesian Neural Networks. Proc. of European Working Session on Learning EWSL 1989, Montpellier: France, Dec. 4-6, 1989a, p.91-98. [10] Kononenko, I. Interpretation of neural networks decisions, IASTED Int. Conf. Expert systems & apps, Zurich, June 26-29 1989b, pp.224-227. [11] Kononenko, I. Bayesian Neural Networks, Biological Cybernetics Journal 61: 361-370, 1989c. [12] Kononenko, I. Semi-naive Bayesian classifier, Proc. of European Working Session on Learning EWSL- 91, Porto, March 4-6 1991, p.206-219. [13] Kononenko, I. Estimating attributes: analysis and extensions of RELIEF. In: Proc. ECML-94, Springer, 1994, p. 171-182. [14] Kononenko, I. On biases in estimating multi-valued attributes. In: Proc. IJCAI-95: Montréal, Canada, August 20-25, 1995. Volume 2, 1995, p. 1034- 1040. [15] Kononenko, I. The minimum description length based decision tree pruning. In Proc. PRICAI '98: Springer, 1998, p. 228-237. [16] Kononenko, I. Machine learning for medical diagnosis: history, state of the art and perspective. Artif. intell. med., 2001, 23(1) 89-109. [17] Kononenko, I., Bratko, I., Roškar, E.: Experiments in automatic learning of medical diagnostic rules, Proc. ISSEK workshop, Bled, august 1984, p. 1-16. [18] Kononenko, I. Štrumbelj, E., Bosnić, Z., Pevec, D., Kukar, M., Robnik Šikonja, M. Explanation and reliability of individual predictions. Informatica (Lj.), 2013, 37(1) 41-48. [19] Li, M. and Vitanyi, P. An Introduction to Kolmogorov Complexity and its Applications. Springer Verlag, 1993. [20] Mantaras. R. L. ID3 revisited: A distance based criterion for attribute selection. Methodologies for Intelligent Systems, Charlotte, U.S.A, 1989. [21] Pompe, U., Kononenko, I. Efficient induction and effective use of first-order knowledge. Appl. artif. intell., 1998, vol. 12, no. 5, p. 421-453. [22] Quinlan J.R. Discovering rules by induction from large collections of examples. Expert systems in the Micro Electronic Age, Edinburgh University, 1979. [23] Quinlan J.R. Induction of Decision Trees. Machine Learning, 1986, 1(1) 81-106. [24] Robnik Šikonja, M., Kononenko, I. An adaptation of RELIEF for attribute estimation in regression. Proc. ICML'97, Nashville, July 8-12, 1997, p.296- 304. Informatica 42 (2018) 7–11 7 AlphaZero – What’s Missing? Ivan Bratko University of Ljubljana, Faculty of Computer and Information Science, Večna pot 113, Ljubljana E-mail: bratko@fri.uni-lj.si Keywords: computer game playing, computer chess, machine learning, explainable AI Received: March 8, 2018 In December 2017, the game playing program AlphaZero was reported to have learned in less than 24 hours to play each of the games of chess, Go and shogi better than any human, and better than any other existing specialised computer program for these games. This was achieved just by self-play, without access to any knowledge of these games other than the rules of the game. In this paper we consider some limitations to this spectacular success. The program was trained in well-defined and relatively small domains (admittedly with enormous combinatorial complexity) compared to many real world problems, and it was possible to generate large amounts of learning data through simulated games which is typically not possible in real life domains. When it comes to understanding the games played by AlphaZero, the program’s inability to explain its games and the knowledge acquired in human- understandable terms is a serious limitation. Povzetek: Decembra 2017 so poročali, da se je program AlphaZero v manj kot 24 urah naučil igrati šah, go in shogi bolje, kot katerikoli človek in katerikoli drug računalniški program specializiran za to igro. To je dosegel kar z igranjem s samim seboj, brez dostopa do kakršnegakoli znanja o teh igrah, razen samih pravil igre. Vsiljuje se vprašanje, ali obstajajo kakšne omejitve tega neverjetnega podviga. Program se je učil v dobro definiranih in razmeroma enostavnih domenah (čeprav je res, da imajo te igre ogromno kombinatorično zahtevnost) v primerjavi z mnogimi problemi realnega sveta. Za te igre je bilo mogoče s simulacijo generirati ogromne količine učnih podatkov, kar navadno ni možno v domenah iz realnega življenja. Osnovna pomanjkljivost programa AlphaZero je tudi njegova nezmožnost, da bi svoje odigrane partije razložil na človeku razumljiv način. 1 Introduction In December 2017, an amazing achievement has been reported (Silver, Hubert et al. 2017). DeepMind's program AlphaZero was able to learn in less than 24 hours to play each of the games of chess, Go and shogi better than any human, and better than any other existing specialised computer program for these games. This was a third event in the success story at DeepMind with game playing programs with the word Alpha in their names. It started with the famous program AlphaGo (Silver et al. 2016) which convincingly defeated one of the best human go players in a match of five games. That was the first time ever that a computer program was able to defeat a leading human player at Go. AlphaGo was specialised at Go, and learned from exemplary high quality games of Go previously played by strong human players. AlphaGo Zero (Silver, Schrittwieser et al. 2017) was able to learn to play Go even better. The impressive difference between AlphaGo and AlphaGo Zero was that the latter can learn from games just played by itself, thus without having access to examples of well-played games or any other source of game-specific knowledge of the game, except the bare rules of the game. Finally, AlphaZero is a general game playing program not specialised to Go, so it can learn to play any game of this kind just by self-play. For example, to get to the strength level of the best human chess players, AlphaZero needed no more than one and a half hours of learning by self-play. The basic architecture of AlphaZero is as follows. AlphaZero learns by reinforcement learning from simulated games against itself. It uses a deep neural network that learns to estimate the values of positions and the probabilities of playing possible moves in a position. To select a move to play in the current board position, AlphaZero performs Monte Carlo Tree Search (MCTS). This search consists of simulating random games from the current positions, in which the probabilities of random moves increase with the move probabilities returned by the neural network, and decrease with the moves’ visit counts. The use of MCTS in chess is in contrast to search in other strong chess programs. They perform Alpha-Beta search which had been considered before AlphaZero much more appropriate for chess. 2 An interesting observation about AlphaZero training in chess To appreciate this achievement, let us consider some illustrative quantitative facts about AlphaZero at chess. As reported by Silver, Hubert et al. (2017), in chess training AlphaZero played about 44 million games against itself in nine hours of self-play. This took 700 8 Informatica 42 (2018) 7–11 I. Bratko thousand “steps” of training. According to the plots of chess rating improvement in time of training (Silver, Hubert et al., 2017), AlphaZero attained the chess strength of best human players after about 110 thousand training steps. By that time, AlphaZero had played about 6.9 million games with itself. Now let us consider some quantitative facts from the human history of chess. ChessBases’s Mega Database is a comprehensive collection of chess games played in all history of human chess. Mega Database is very representative of about all important chess games ever played by humans, so it is well representative of all chess concepts and ideas ever found by human players. The 2018 version of Mega Database contains 7.1 million games which quite amazingly matches AlphaZero’s estimated 6.9 million games needed to reach the best humans’ chess strength. Of course it may be argued that this is a mere coincidence. And it can be rightfully observed that this comparison is rather crude: it is not true that the best human players derive their skill from all 7.1 million games. It is certainly not true that all the games in Mega Database are needed to subsume the present chess knowledge by mankind. Therefore Mega Database, viewed as a kind of codification of total human chess knowledge, contains a lot of redundancy. Nevertheless, the numbers do seem to offer a first “feasibility check” of AlphaZero’s achievement. 3 Are there any limitations to AlphaZero approach? Given that the games of chess, Go and shogi are so difficult for humans, and that AlphaZero made the same progress at chess, say, in 1.5 hours of self-play as the mankind did in over a hundred years, this looks impressive indeed. If the problem of such difficulty for humans can be mastered in one and a half hours by a machine using AI techniques, then an impression is that AI can now do everything. But let us consider whether this impression is really so true in general. What are the limitations? Let us look at the problems dealt with by AlphaZero from a little broader perspective. (1) These games are limited to the board worlds, which amounts to 64 squares for chess, and 381 squares for Go. True, these small worlds give rise to combinatorial complexities of astronomic proportions. For chess, an old estimate by Claude Shannon (1950) is: there are over 1040 possible chess positions, and over 10120 possible games. The magnitude of these numbers is popularly illustrated by their comparison to the number of all atoms in the observable universe, which is of the order 1080. The number of possible games of chess is thus incomparably larger than the size of the universe. And, also true, both Go and shogi are in these terms much more complex than chess. On the other hand, the combinatorial complexity of these games is rather deceiving. Compared to many real world domains studied by biology, chemistry and physics, these games are small. (2) The rules of these games are simple and known. Therefore almost unlimited experimentation with these games through simulated games is possible. This gives rise to the automatic generation of very large numbers of training instances from which AlphaZero could learn. This is very different from many complex real-world domains in which learning data is collected through time consuming and expensive experiments, and therefore the amount of training data is much more limited. In contrast to machine learning from big data, in such domains the scarcity of data is often the problem. For example, Wiley et al. (2016) describe reinforcement learning by a tracked robot for which no sufficiently accurate simulation model was available. Therefore, experimentation had to be carried out with the actual physical robot, so the number of trials was severely limited due to time constraints and wear and tear of the robot. More elaborate methods of machine learning were needed to enable more effective use of available data. The situation with available data may be even more constrained, like in medicine where examples of patients with a disease under study can only be “generated by nature”. For machine learning to be successful with “small data”, different machine learning methods and algorithms are needed. In particular, it is desired that the learning method can make use of domain background knowledge. In this way lack of data can be compensated by prior knowledge. For example, the learning program may use the laws of physics that are already known prior to learning. 4 Does AlphaZero play chess “more like humans”? There have been some speculations that AlphaZero is not only by far the strongest chess playing program, but that it also plays chess in a way that is more similar to the way strong human players play chess. This conjecture is based on a particular comparison between AlphaZero and Stockfish, one of the strongest chess programs before AlphaZero. AlphaZero convincingly defeated Stockfish in a match of 100 games in which AlphaZero won 28 games and drew the rest. The particular point of comparison is the number of positions searched per second by the two programs. Stockfish searched 70 million of positions per second compared with 80,000 by AlphaZero. This was interpreted as indicative of a more human-like style by AlphaZero simply because in general computers base their strength on the brute force computational power which allows them to search deeper. By contrast, humans can only search typically of the order of a few tens of positions per move, or something like a few positions per second. Therefore the humans, to compensate for this inferior search ability, have to rely on deeper chess knowledge and intuition. The argument then is that AlphaZero with about thousand times lower search speed than Stockfish must have better chess knowledge to still be able to win. This argument is not completely convincing. In terms of search speed, AlphaZero is still incomparably faster than humans. Another big difference between AlphaZero and AlphaZero - What's Missing? Informatica 42 (2018) 7–11 9 human style of play is in the search method used. AlphaZero uses Monte Carlo tree search technique which is based on random simulations of possible games from the current position, and counting favourable outcomes resulting from moves tried. This is certainly not the way that humans perform their search. On the other hand, moves played as the result of MCTS indeed seem to resemble human players’ decisions more than moves played by typical chess engines. In particular, it appears that moves played by AlphaZero better reflect long-term positional judgement in chess that is attributed to strong human players’ deep understanding of the game. We will return to this question in the next section when analysing a surprising positional sacrifice by AlphaZero in one of the games against Stockfish. 5 Examples of super play by AlphaZero The world of chess was stunned by examples of play by AlphaZero from some of the published games between AlphaZero and Stockfish. Probably the most spectacular example comes from the following game in which AphaZero had White pieces. This example was discussed many times in numerous chess media, for example in (Guid 2018). After 18 moves, the position in Fig. 1 occurred in the game, with White to move. Here Black is threatening to capture White knight on h6 with the queen or the king. So it seems that White knight has to retreat to g4, which a reasonable human player would actually do. After that, if both sides played their best moves, White knight would eventually escape to safety, but Black would come out with a somewhat better position. However, in position in Fig. 1, AlphaZero played incredibly 19 Rf1-e1, leaving his knight on h6 to be captured by Black. In the game, Stockfish indeed took the knight and appeared to be winning. AlphaZero did have some positional compensation for the knight, but that did not appear to be anything nearly enough for the material disadvantage. But AlphaZero’s judgement turned out to be better in the long run. White managed to create threats virtually out of nothing, and 20 moves later managed to achieve a clear advantage. To appreciate the details of all this requires some chess knowledge, so further chess comments are given in the Appendix. It is very hard to clearly explain that 19 Rf1-e1 was really a good move, and how AlphaZero was able to find this decision. It seems that the combination of AlphaZero’s Monte Carlo Tree Search and AlphaZero’s move evaluation stored in its neural network somehow resulted in such a deep positional judgement. One possible explanation for this might be as follows. Positional evaluation in chess takes into account static features in the current position. Such features tend not to change quickly during play, so they have long- lasting effects. An example of such a positional feature is weak pawns that cannot move and are hard to defend, and can thus become targets for enemy pieces in the course of the game. Another example are chain formations of blocked pawns that create more space for one of the sides. More space gives to one side better chances to manoeuver their pieces and thus create chances for attack in the long run on the part of the board with space advantage. However, it usually takes many moves before such positional advantages can be exploited and turned into something more tangible like material advantage. It may also happen that positional advantage cannot be exploited at all. In such cases, the positional advantage simply evaporates in the long run. It is very difficult for humans to estimate whether positional advantage can eventually be converted or not because it is hard to see so far into the future of the game. It is often far enough that this may also be a problem for a typical chess engine that uses Alpha-Beta search. Here it is that Monte Carlo Tree Search might be much more appropriate because it is more selective and can therefore go much deeper than Alpha-Beta. Of course, for Monte Carlo search to be successful, it has to be well guided by the move probability estimates, which seems to be a major strength in AlphaZero. In position of Figure 1, the positional advantage of AlphaZero’s knight sacrifice was only converted into material gains after twenty moves. This is too deep for Alpha-Beta search, but possible to see by MCTS. Now although it seems that random trials of MCTS are quite absurd to be carried out by a human player, it can be imagined that something roughly similar is actually done by strong human players. When a good player tries to estimate how concrete the consequences of a positional advantage may become, he or she tries to calculate very deeply and selectively sample variations. Favourable results from these variations will increase the player’s confidence into the correctness of a positional sacrifice. 6 Can humans understand and learn from Alpha Zero? The chess moves played by AlphaZero in the example above call for an explanation. Ideally, AlphaZero would Figure 1: Position after Black’s move 18 ... g6-g5. AlphaZero here played the surprising 19 Rf1-e1, leaving White knight on h6 undefended. 10 Informatica 42 (2018) 7–11 I. Bratko be able of comment on its games and explain its decisions in human-understandable terms. So humans would be able to learn from AlphaZero new chess concepts and ideas, enhance their own chess knowledge and be able to use it in their own play. In this respect, the lack of explanation facility is a serious limitation of AlphaZero paradigm, and many forms of machine learning in general. Many of the present successful ML methods that can outperform humans have this same limitation. It is hard for humans, and human experts, to understand what has actually been learned by the program. Ideally, learning programs should be able to explain what they discovered through learning, so that this new knowledge could also be used by humans. This idea has been around in the area of machine learning almost from its beginning, at that time also known under the term “machine synthesis of expert knowledge”. This phrase was coined by Donald Michie in early 1980’s, some time before the idea became generally accepted. Donald also set up an international association called ISSEK (International School for the Synthesis of Expert Knowledge). The main activity of ISSEK was a series of workshops in 1980’s and 1990’s to enable a collaboration among research laboratories interested in developing machine learning techniques for the synthesis of new knowledge from data. As an attempt at precisely defining these aspects of machine learning, Michie (1988) defined three criteria for machine learning, and it will be useful to repeat them here. Essentially, these criteria were: (1) Weak criterion: the learning system improves its performance through learning from experience (2) Strong criterion: as (1), plus the system can output what it has learned in explicit symbolic form (3) Ultra-strong criterion: as (2), plus the explicit symbolic description produced can be used by a human operationally, that is to improve the human’s own performance at solving the task By far the strongest attention in machine learning has been devoted to criterion (1), and the imbalance of attention between the three criteria has probably been increasing over time. Importance of criteria (2) and (3) with relevant examples was discussed for example in (Bratko 1997). There has been recent renewed interest in relation to the latter two criteria within Explainable AI (XAI 2017; Miller 2017). A related issue is the question of comprehensibility of a description by humans. For a human to be able to use operationally what was learned, the human at least has to understand the result of learning. Therefore, for the ultra-strong criterion to be applicable in practice as a measure of success, a measure of comprehensibility by a human of a (machine- generated) description is required. Although such a need has been often observed in machine learning, little progress seems to have been made in this respect. (Muggleton et al. 2018) is a rare recent attempt at defining an operational measure of comprehensibility. In terms of Donald Michie’s criteria, AlphaZero has been a tremendous success in terms of the weak criterion for machine learning, but no attention seems to have been paid to the other two criteria in the development of AlphaZero. As a result, AlphaZero has miraculously acquired a lot of new game-specific knowledge, but at the moment it is hidden from humans in a black box. As described by Voosen (2017), a human interested in that knowledge can play a time consuming game of an AI detective to uncover small bits of that knowledge in the box. Fundamental progress in terms of Michie’s ultra- strong criterion with AlphaZero, and other similarly influential systems that will appear in the future, will be needed to increase their impact in the important direction of improving human knowledge. 7 Conclusion In this paper, we considered some limitations of AI techniques on which AlphaZero is based. These limitations are indicative of some directions for future research in AI. Many games played by AlphaZero are very interesting, and it seems that, at least in chess, AlphaZero has discovered new concepts that human players are not aware of. At the moment, humans can only make guesses about what these new concepts might be. Therefore, the development of explanation techniques, aiming at human-friendly conceptualisation of the automatically acquired game-playing knowledge would be very well motivated. Also, improving machine learning methods towards more data-efficient learning would be important for applicability in many real-world domains. 8 Acknowledgements I would like to thank Matej Guid, Martin Moina and Marjan Šemrl, a former correspondence chess world champion, for discussion. 9 References [1] I. Bratko, Machine learning: between accuracy and interpretability. In: Machine Learning, Networks and Statistics. Eds. G. Della Riccia, H.-J. Lenz, R. Kruse. Springer, 1997. pp. 163-178. [2] M. Guid, AlphaZero. Šahovska misel (Chess Thought Magazine), Februar 2018 (in Slovene). [3] D. Michie, Machine learning in the next five years. Proc. Third European Working Session on Learning, pages 107–122. Pitman, 1988. [4] T. Miller, Explanation in AI: Insights from the social sciences. 2017. arXiv.org > cs > arXiv:1706.07269 [5] S.H. Muggleton, U. Schmid, C. Zeller, A. Tamaddoni-Nezhad, and T. Besold. Ultra-strong machine learning - comprehensibility of programs learned with ILP. Machine Learning, 2018. In Press. [6] Claude Shannon (1950). Programming a computer for playing chess. Philosophical Magazine. 41 (314) [7] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. AlphaZero - What's Missing? Informatica 42 (2018) 7–11 11 Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016. [8] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis. Mastering the game of Go without human knowledge. Nature, 550:354–359, 2017. [9] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, D. Hassabis, Mastering chess and chogi by celf-clay with a ceneral Reinforcement Learning algorithm. 2017. arXiv.org > cs > arXiv:1712.01815 [10] T. Wiley, C. Sammut, B. Hengst, I. Bratko, A multi-strategy architecture for on-line learning of robotic behaviours using qualitative reasoning. Advances in Cognitive Systems Journal, 4 (2016), pp. 93-111. [11] P. Voosen, How AI detectives are cracking open the black box of deep learning. Science, July 2017. [12] XAI 2017 (Proc. IJCAI-17 Workshop on Explainable AI), 2017. http://www.intelligentrobots.org/files/IJCAI2017/IJ CAI-17_XAI_WS_Proceedings.pdf 10 Appendix: Detailed analysis of the game AlphaZero vs. Stockfish from position of Fig. 1 In position of Fig. 1, White knight is in trouble and it seems that he has to retreat from h6 to g4. This is the only safe square for the knight. The knight is now under attack of Black bishop on c8, but the knight is defended by White queen. However, White’s problem is not completely over because Black can try to chase White queen away from defending the knight on g4. Thus the following continuation is logical: 19 Nh6-g4 b6-b5 (attacking White queen), 20 Qa4-e4 (the only square from which the queen can still defend the knight, but now Black has double attack on White queen and knight with the next pawn move) f7-f5. Fortunately for White, White can check Black king and the following variation is more or less forced: 21 Qe4-e5+ Kg7-f7 22 Qe5xd6 Be7xd6 23 Rf1-d1 Bd6-c7 24 Ng4-e3. Now White knight has survived the trouble, but Black is a pawn up and the position is somewhat better for Black. This variation is also given as the best possibility for White by typical chess programs, and it is what every reasonable human player would do, accepting a worse position as the least possible damage. AlphaZero however very surprisingly played 19 Rf1-e1, leaving the unfortunate knight on h6 under threat. The knight can now be immediately captured by Black king: 19 ... Kg7xh6 which Stockfish actually did in the game. A typical chess program now evaluates the position as considerably better for Black. Black is a whole piece up. True, White can play 20 h2-h4 and Black king will be feeling a little uncomfortable, so White does have some compensation for the sacrificed piece. But is this compensation sufficient? The answer appears to be a clear “no” to practically any human player, as well as any chess program other than AlphaZero. Black has big material advantage, and White seems to have no tangible compensation in return. It is too complex to calculate all the possible continuations to sufficient depth in this position because there are no forced variations clearly favourable to White or Black. So in practice this position can only be evaluated through a kind of “intuitive positional judgement” (in quotes when it refers to a computer). In this case, AlphaZero was in fact capable of such deep positional judgement, something that is extremely difficult for humans, and so far has been considered even harder for machines. In the game, after 19 Kg7xh6, the following moves were played: 20 h2-h4 f7-f6 21 Bc1-e3 Bc8-f5 22 Ra1-d1 Qd6-a3 23 Qa4-c4 b6-b5 24 h4xg5+ f7xg5 25 Qc4-h4+ Kh6-g6 26 Qh4- h1. The position at this point is shown in Fig. 2. White queen now looks very passive in the corner, and thus White, still with a piece down, seems considerably worse. But the prospects of White queen on h1 are actually excellent. The idea is that the queen at h1 supports the move by White bishop g2–e4, and after the exchange of the light coloured bishops, White queen will threaten to enter the center via light squares with great force. This actually happened in the game and 15 moves later White achieved a clear advantage. So the controversial move 19 Rf1-e1 by AlphaZero in position of Fig. 1 turned out to be a brilliant positional sacrifice much admired by the chess world. Figure 2: Position after 26 Qh4-h1. 12 Informatica 42 (2018) 7–11 I. Bratko Informatica 42 (2018) 13–22 13 Explanation of Prediction Models with ExplainPrediction Marko Robnik-Šikonja University of Ljubljana, Faculty of Computer and Information Science, Večna pot 113, 1000 Ljubljana, Slovenia Email: marko.robnik@fri.uni-lj.si, https://fri.uni-lj.si/en/employees/marko-robnik-sikonja Keywords: machine learning, comprehensibility of models, explanation of models, perturbation methods Received: October 31, 2017 State-of-the-art prediction models are getting increasingly complex and incomprehensible for humans. This is problematic for many application areas, especially those where knowledge discovery is just as im- portant as predictive performance, for example medicine or business consulting. As machine learning and artificial intelligence are playing an increasingly large role in the society through data based decision ma- king, this is problematic also from broader perspective and worries general public as well as legislators. As a possible solution, several explanation methods have been recently proposed, which can explain predicti- ons of otherwise opaque models. These methods can be divided into two main approaches: gradient based approaches limited to neural networks, and more general perturbation based approaches, which can be used with arbitrary prediction models. We present an overview of perturbation based approaches, and fo- cus on a recently introduced implementation of two successful methods developed in Slovenia, EXPLAIN and IME. We first describe their working principles and visualizations of explanations, followed by the implementation in ExplainPrediction package for R environment. Povzetek: Najboljši napovedni modeli postajajo vse bolj zapleteni in nerazumljivi za ljudi. To je pro- blematično za številna aplikativna področja, zlasti tista, kjer je odkrivanje znanja enako pomembno kot napovedna točnost, npr. medicina ali poslovno svetovanje. Ker strojno učenje in umetna inteligenca preko na podatkih temelječega odločanja igrata vse večjo vlogo v družbi, je to problematično tudi s širšega vi- dika in vse bolj skrbi javnost in zakonodajalce. Kot možna rešitev se je v zadnjem času pojavilo več metod razlage za napovedne modele. Te metode lahko razdelimo na dve skupini: na gradientne metode, ome- jene predvsem na umetne nevronske mreže, in splošnejše pristope na osnovi perturbacij vhodov, ki jih je mogoče uporabiti pri poljubnih napovednih modelih. Predstavljamo pregled perturbacijskih pristopov in dve uspešni metodi razviti v Sloveniji, EXPLAIN in IME. Najprej opišemo njuno delovanje in vizualizacije razlag, nato pa še implementacijo v paketu ExplainPrediction za okolje R. 1 Introduction Machine learning methods and especially prediction mo- dels are becoming an essential ingredient of many modern products and services. Through a paradigm of data-based decisions, they impact mundane everyday tasks (e.g., shop- ping and entertainment recommendations), as well as life- changing decisions (e.g., medical diagnostics, credit sco- ring, or security systems). As societies are getting more and more complex, we can expect that their reliance on automatic decisions will increase. It is natural that those affected by various decisions of prediction models want to get feedback and understand the reasoning process and bi- ases of the underlying models. The impact and influence of automatic decisions are getting so ubiquitous that the whole area of artificial intelligence is receiving an increa- sing attention from lawmakers who demand that decisions of important models are made transparent. Besides public services, the areas where models’ transparency is of cru- cial importance are for example medicine, science, policy making, strategic planning, business intelligence, finance, marketing, law, and insurance. In these areas, users of mo- dels are just as interested in comprehending the decision process, as in the classification accuracy of prediction mo- dels. Unfortunately, most of the top performing machine learning models are black boxes in a sense that they do not offer an intrinsic introspection into their decision proces- ses or provide explanations of their predictions and biases. This is true for Artificial Neural Networks (ANN), Support Vector Machines (SVM), and all ensemble methods (for example, boosting, random forests, bagging, stacking, and multivariate adaptive regression splines). Approaches that do offer an intrinsic introspection such as decision trees or decision rules do not perform so well or are not applicable in many cases (17). To alleviate this problem two types of model explana- tion techniques have been proposed. The first type, which is not discussed in this work, is based on the internal wor- king of the particular learning algorithm. The explanation methods exploit model’s representation or learning process to gain insight into the presumptions, biases, and reasoning leading to final decisions. Two well-known models where 14 Informatica 42 (2018) 13–22 M. Robnik-Šikonja such approach works well are neural networks and random forests. Recent explanations for neural networks classifiers of images mostly rely on layer-wise relevance propagation (3) or gradients of output neurons with respect to the in- put (28) to visualize parts of images significant for parti- cular prediction. The random forest visualizations mostly exploit the fact that during bootstrap sampling, which is part of this learning algorithm, some of the instances are not selected for learning and can serve as an internal vali- dation set. With the help of this set important features can be identified and similarity between objects can be measu- red. The second type of explanation approaches are general and can be applied to any predictive model. The explana- tions provided by these approaches try to efficiently cap- ture the causal relationship between inputs and outputs of the given model. To this end, they perturb the inputs in the neighborhood of given instance to observe effects of perturbations on model’s output. Changes in the outputs are attributed to perturbed inputs and used to estimate their importance for a particular instance. Examples of this ap- proach are methods EXPLAIN (24), IME (31), and LIME (21). These methods can explain models’ decision process for each individual predicted instance as well as for the mo- del as a whole. We implemented the methods, proposed by our group, EXPLAIN and IME, in R package ExplainPre- diction (23). The objectives of the paper are twofold. First, we explain how general perturbation-based explanation methods work and second, we describe the implementation details, para- meters, and visualization of the ExplainPrediction package which implements them. The first aim is achieved through an explanation of their working principle and graphical ex- planation of models’ decisions on a well-known data set. The second aim is no less important. In machine lear- ning, open source implementations enable progress, empi- rical comparisons, and replicability of research. Two types of explanations are implemented in ExplainPrediction and demonstrated in the paper, individual predictions of new unlabeled cases and functioning of the model as a whole. This allows inspection, comparison, and visualization of otherwise opaque models. The structure of the remainder of the paper is as follows. In Section 2, we present the background and related work on perturbation based explanation approaches. In Section 3, we present methods EXPLAIN and IME. Their imple- mentation, parameters, and use with the ExplainPrediction package are covered in Section 4. In Section 5, we present conclusions and promising research directions. 2 Background and overview of perturbation approaches We first present different modes of explanation and pro- perties of model explanation approaches, followed by an overview of explanation approaches. True causal relationships between dependent and inde- pendent variables are typically hidden except in artificial domains where all the relations, as well as the probability distributions, are known in advance. Therefore only ex- planations of prediction process for a particular model is of practical importance. The prediction accuracy and the correctness of explanation for a given model may be ortho- gonal: the correctness of the explanation is independent of the correctness of the prediction. However, empirical ob- servations show that better models (with higher prediction accuracy) enable better explanations (31). We discuss two types of explanations: – Instance explanation explains predictions with the given model of a single instance and provides the im- pact of input feature values on the prediction. – Model explanation is usually an aggregation of in- stance explanations over many (training) instances, to provide top-level explanations of features and their va- lues. This aggregation over many instances enables identification of different roles attributes may play in the classifications of instances. In a typical data science problem setting, users are con- cerned with both prediction accuracy and the interpretabi- lity of the prediction model. Complex models have po- tentially higher accuracy but are more difficult to interpret. This can be alleviated either by sacrificing some prediction accuracy for a more transparent model or by using an ex- planation method that improves the interpretability of the model. Explaining predictions is straightforward for sym- bolic models such as decision trees, decision rules, and in- ductive logic programming, where the models give an over- all transparent knowledge in a symbolic form. Therefore, to obtain the explanations of predictions, one simply has to read the rules in the corresponding model. Whether such an explanation is comprehensive in the case of large trees and rule sets is questionable. Piltaver et al. (18) developed criteria for comprehensibility of decision trees and perfor- med a user study, which showed that the depth of the dee- pest leaf that is required when answering a question about a classification tree is the most important factor influencing the comprehensibility. For non-symbolic models, there are no intrinsic expla- nations. A lot of efforts have been invested in increasing the interpretability of complex models. For SVM, Hamel (12) proposed an approach based on self-organizing maps that groups instances then projects the groups onto a two- dimensional plane. In this plane, the topology of the groups is hopefully preserved and support vectors can be visua- lized. Many approaches exploit the essential property of additive classifiers to provide more comprehensible expla- nations and visualizations, e.g., (14) and (19). Visualization of decision boundaries is an important as- pect of model transparency. Barbosa et al. (6) present a technique to visualize how the kernel embeds data into a high-dimensional feature space. With their Kelp met- hod, they visualize how kernel choice affects neighborhood Explanation of Prediction Models with. . . Informatica 42 (2018) 13–22 15 structure and SVM decision boundaries. Schulz et al. (27) propose a general framework for visualization of classifiers via dimensionality reduction. Goldstein et al. (11) propose another useful visualization tool for classifiers that can pro- duce individual conditional expectation plots, graphing the functional relationship between the predicted response and the feature for individual instance. Some explanations methods (including the ones presen- ted in Section 3) are general in a sense that they can be used with any type of classification model (15; 21; 24; 30). This enables their application with almost any prediction model and allows users to analyze and compare outputs of different analytical techniques. Lemaire et al. (15) ap- plied their method to a customer relationship management system in the telecommunications industry. The method which successfully deals with high-dimensional text data is presented in (16). Its idea is based on general explana- tion methods EXPLAIN and IME and offers an explanation in the form of a set of words which would change the pre- dicted class of a given document. Bosnić et al. (9) adapt the general explanation methodology to data stream scenario and show the evolution of attribute contributions through time. This is used to explain the concept drift in their in- cremental model. In a real-life breast cancer recurrence prediction, Štrumbelj et al. (29) illustrate the usefulness of the visualizations and the advantage of using the general explanation method. Several machine learning algorithms were evaluated. Predictions were enhanced with instance explanations using the IME method. Visual inspection and evaluation showed that oncologists found the explanations useful and agreed with the computed contributions of fea- tures. Pregeljc et al. (20) used traditional modeling approa- ches together with data mining to gain insight into the con- nections between the quality of organization in enterprises and the enterprises’ performance. The best performing mo- dels were complex and difficult to interpret, especially for non-technical users. Methods EXPLAIN and IME explai- ned the influence of input features on the predicted econo- mic results and provided insights with a meaningful eco- nomic interpretation. The interesting economic relations- hips and successful predictions come mostly from complex models such as random forests and ANN. Without proper explanation and visualization, these models are often neg- lected in favor of weaker, but more transparent models. Ex- perts from the economic-organizational field, which revie- wed and interpreted the results of the study, agreed that such an explanation and visualization is useful and faci- litates comparative analysis across different types of pre- diction models. Bohanec et al. (7) present an innovative application of explanation methods EXPLAIN and IME in the context of B2B sales forecasting. They demonstrate how users can validate their assumptions with the presented explanations and test their hypotheses using the explanati- ons for a sort of what-if analysis. Bohanec et al. (8) address the problem of weak acceptance of machine learning mo- dels in business environments. The propose a framework of top-performing machine learning models coupled with general explanation methods to provide an additional in- formation to the decision-making process. This is shown to reduce error, efficiently support business decision ma- kers and builds a foundation for sustainable organizational learning. Demšar and Bosnić (10) use the general explana- tion methods EXPLAIN and IME to detect concept drift in data streams. Due to the generality of explanations, their drift detector can be combined with an arbitrary classifica- tion algorithm and features good drift detection, accuracy, robustness, and sensitivity. Many explanation methods are related to statistical sen- sitivity analysis and uncertainty analysis (26). In that met- hodology sensitivity of models is analyzed with respect to models’ input. A related approach, called inverse classifi- cation (1) tries to determine the minimum required change to a data point in order to reclassify it as a member of a dif- ferent class. An SVM model-based approach is proposed by Barbella et al. (5). Another sensitivity analysis-based approach explains contributions of individual features to a particular classification by observing (partial) derivatives of the classifiers prediction function at the point of interest (4). A limitation of this approach is that the classification function has to be first-order differentiable. For classifiers not satisfying this criterion (for example, decision trees) the original classifier is first fitted with a Parzen window-based classifier that mimics the original one and then the explana- tion method is applied to this fitted classifier. The method is practically useful with kernel-based classification met- hod to predict molecular features (13). Due to recent successes of deep neural networks in image recognition and natural language processing, several explanation methods specific to these two application areas emerged, recently. Methods working on images try to vi- sualize parts of images (i.e., groups of pixels) significant for a particular prediction. These methods mostly rely on the propagation of relevance within the network. For ex- ample, layer-wise relevance propagation (3), and computa- tion of gradients of output neurons with respect to the input (28). In language processing Arras et al. (2) applied layer- wise relevance propagation to a convolutional neural net- work and a bag-of-words SVM classifier trained on a topic categorization task. The explanations indicate how much individual words contribute to the overall classification de- cision. 3 Methods EXPLAIN and IME General explanation methods can be applied to any clas- sification model which makes them a useful tool both for interpreting models (and their predictions) and comparing different types of models. By modification of feature va- lues of interest, what-if analysis is also supported. Such methods cannot exploit any model-specific properties (e.g., gradients in ANN) and are limited to perturbing the inputs of the model and observing changes in the model’s output (15; 24; 30). 16 Informatica 42 (2018) 13–22 M. Robnik-Šikonja The presented general explanation methods provide two types of explanations for prediction models: instance ex- planations and model explanations (see Section 2). Model explanations work by summarizing a representative sample of instance explanations. The presented methods estimate the impact of particular explanation feature for a given in- stance by perturbing similar instances. The key idea of EXPLAIN and IME is that the contri- bution of a particular input value (or set of values) can be captured by “hiding” the input value (set of values) and ob- serving how the output of the model changes. As such, the key component of general explanation methods is the ex- pected conditional prediction – the prediction where only a subset of the input variables is known. Let Q be a subset of the set of input variables Q ⊆ S = {X1, . . . , Xa}. Let pQ(yk|x) be the expected prediction for x, conditional to knowing only the input variables represented in Q: pQ(yk|x) = E(p(yk)|Xi = x(i),∀Xi ∈ Q). (1) Therefore, pS(yk|x) = p(yk|x). In practical settings, the classification function of the model is not known - one can only access its prediction for any vector of input values. Therefore, exact computation of this equation is not possi- ble and sampling-based approximations are used. To produce model explanations we sum instance level explanations. The evidence for and against each class is collected and visualized separately. In this way, one can, for example, see that a particular value of an attribute sup- ports specific class but not in every context. 3.1 EXPLAIN, one-variable-at-a-time approach EXPLAIN method computes the influence of a feature va- lue by observing its impact on the model’s output. The EX- PLAIN assumes that the larger the changes in the output, the more important role the feature value plays in the mo- del. The shortcoming of this approach is that it takes into account only a single feature at a time, therefore it cannot detect certain higher order dependencies (in particular dis- junctions) and redundancies in the model. The EXPLAIN assumes that the characterization of the i-th input variable’s importance for the prediction of the instance x is the diffe- rence between the model’s prediction for that instance and the model’s prediction if the value of the i-th variable was not known. The source of explanations is therefore: p(yk|x)− pS\{i}(yk|x). (2) If this difference is large then the i-th variable is impor- tant. If it is small then the variable is less important. The sign of the difference reveals whether the value contributes towards or against class value yk. This approach was exten- ded in (24) to use log-odds ratios (or weight of evidence), or information gain instead of the difference in predicted class probabilities. The lack of information about Ai in pS\{i}(yk|x) is ap- proximated with several predictions. For nominal attribu- tes, we replace the actual value of Ai in each prediction with each of the possible values of attribute Ai, and weight each prediction with the prior probability of the value: p(yk|x \Ai) . = mi∑ s=1 p(Ai = as)p(yk|x← Ai = as) (3) Here mi is the number of nominal values of attribute Ai and the term p(yk|x← Ai = as) represents the predicted probability of yk when in instance x we replace the actual value of Ai with as. For numerical attributes, we use dis- cretization to split the values of numerical attribute Ai into intervals. The middle points of these intervals are taken as the representative replacement values in Eq. (3), for which we compute predictions p(yk|x← Ai = as). Instead of prior probabilities of individual values p(Ai = as), we use probabilities of the intervals for weighting the predictions. To demonstrate the behavior of the method an example of an explanation is given. Let a binary domain contain three important (A1, A2, and A3) and one irrelevant attri- bute (A4), so the set of attributes is S = {1, 2, 3, 4}. The class variable C is expressed as the parity (xor) relation of three attributes C = A1 ⊕A2 ⊕A3. Let us assume that we trained a perfect model for this problem. Our correct model classifies an instance x = (A1 = 1, A2 = 0, A3 = 1, A4 = 1) to class C = 0, and assign it the probability p(C = 0|x) = 1. When explaining classification for this particular instance p(C = 0|x), met- hod EXPLAIN simulates the lack of knowledge of a single attribute at a time, so it has to estimate pS\{1}(C = 0|x), pS\{2}(C = 0|x), pS\{3}(C = 0|x), and pS\{4}(C = 0|x). Without the knowledge about the values of each of the attributes A1, A2, and A3, the model cannot correctly determine the class value, so the correct estimates of class probabilities are pS\{1}(C = 0|x) = pS\{2}(C = 0|x) = pS\{3}(C = 0|x) = 0.5 The differences of probabilities pS(yk|x)−pS\{i}(yk|x) therefore equal 0.5 for each of the three important attributes, which indicate that these attribu- tes have positive impact on classification to class 0 for the particular instance x. The irrelevant attribute A4 does not influence the classification, so the classification probability remain unchanged pS\{4}(C = 0|x) = 1. The difference of probabilities pS(C = 0|x) − pS\{4}(C = 0|x) = 0 so the explanation of the irrelevant attribute’s impact is zero. In reality, the trained models are rarely perfect, so the obtained probabilities used in Eq. (2) contain a certain amount of error which translates to an error of explanati- ons. The empirical evaluation (24) has shown that better models produce better explanations. 3.2 IME, all-subsets approach The one-variable-at-a-time approach is simple and compu- tationally less intensive but has some disadvantages. The main disadvantage is that disjunctive concepts or redundan- cies between input variables may result in unintuitive con- Explanation of Prediction Models with. . . Informatica 42 (2018) 13–22 17 tributions for variables (31). A solution was proposed in (30), where all subsets of values are observed. Such proce- dure demands 2a steps, where a is the number of attributes and results in the exponential time complexity. However, the contribution of the variable corresponds to the Shapley value for the coalitional game of a players. This allows an efficient approximation based on sampling. As sampling takes values for attribute Ai from the exis- ting set of values, we don’t need any approximation similar to Eq. (3) for numerical attributes in IME. 3.3 Presenting explanations The working and practical utility of the one-variable-at-a- time contributions and their visualization are illustrated on the well-known Titanic data set. The task is to classify the survival of a passenger in the disaster of the Titanic ship. The three input variables report the passengers’ status du- ring travel (first, second, third class, or crew), age (adult or child), and gender. Note the similarity of the problem with many business decision problems, such as churn pre- diction, mail response, insurance fraud, etc. As an exam- ple, random forest (rf) classifier is used. This model is ro- bust and usually provides good prediction accuracy but it is incomprehensible. The Fig. 1a shows an example of an instance explanation for the prediction of the instance with id 2 (a first class adult male passenger). The text at the top includes the class value in question, the instance id, and the type of model. At the bottom, the description contains the type of explanation technique used, the model’s prediction for the selected class value, and the actual correct class va- lue for the instance. The input variables’ names are shown on the left-hand side and their values for the particular in- stance are on the right-hand side. The thick dark shaded bars indicate the contributions of the instance’s values for each corresponding input variable towards (or against) the class value “survived=yes”. The thinner and lighter bars above indicate average contributions of these values across all instances. For the given instance one can observe that “sex=male” speaks against survival and “status=first class” speaks in favor of survival while being an adult has little in- fluence. Thinner average bars above them reveal that being male can be both favorable and dangerous while being in the first class is on average even more beneficial than in the selected case. Note that the same visualization can be used even if some other classification method is applied. A more general view of the model is provided by averaging the explanations for the training data and their visualiza- tion in a summary form, which shows the average impor- tance of each input variable and its values. For numerical attributes, explanations for intervals of values are shown; to get sensible intervals, numerical attributes are discreti- zed. An example of such a model explanation for Titanic data set is presented in Fig. 1b. On the left-hand side, the input variables and their values are shown. For each value, the average negative and the average positive contributions across all instances is displayed. Note that negative and positive contributions would cancel each other out if sum- med together, so it is important to keep them separate. The lighter bars shown are equivalent to the lighter bars in the instance explanation on Fig. 1a. For each input variable, the average positive and negative contributions for all va- lues and instances are shown (darker bars). The visualiza- tion reveals that the sex has the strongest effect in random forest model. Traveling in the first or second class has a predominantly positive contribution towards the survival, being a child or female has greater positive than negative contribution, while traveling in the third class has a nega- tive contribution. 4 Implementation in R package ExplainPrediction The methods EXPLAIN and IME are implemented in the R package ExplainPrediction. The top level entry is the explainVis function which explains predictions of a given model and visualizes the explanations. In this section, we explain the parameters of explainVis and show how to call it. We also share some useful tips for using the explanati- ons. 4.1 Controlling explanations The function explainVis enables fine control over compu- tation and visualization of explanations through its argu- ments listed in Listing 1 and explained below. Parameters controlling input/output model specifies the input prediction model. trainData is the input data set which is used to compute average explanations, discretization, and other infor- mation needed for explanation of instances and mo- del. testData is the input data set containing instances that are going to be explained. fileType determines the graphical format of the output vi- sualization file: pdf, eps, emf, jpg, png, bmp, or tif. If “none” is specified, the visualization is directed to a graphical window. dirName specifies the output folder where resulting visu- alization files will be saved. fileName contains the file name of the resulting output vi- sualization files. The parameters of both explanation methods method specifies the explanation method, either EX- PLAIN or IME. classValue specifies the class value for which explanations are generated. 18 Informatica 42 (2018) 13–22 M. Robnik-Šikonja attribute valueattribute −3 −2 −1 0 1 2 3 class age sex 1st adult male Explaining survived=yes instance: 2, model: rf Method: EXPLAIN, type: WE P(survived=yes)=0.32, true survived=yes a) −5 −4 −3 −2 −1 0 1 2 3 4 5 class 1st 2nd 3rd crew age adult child sex female male Explaining survived=yes model: rf Method: EXPLAIN, type: WE attributes/values b) Figure 1: An instance explanation a) and a model explanation b) for the random forest model classifying the Titanic data set. The tiny bars in the instance explanation represent the average positive and average negative contributions of the values and are equal to the corresponding value-bars in the model explanation (note the difference in scale). visLevel determines the level of explanations desired, i.e. the model level or instance level explanations. estimator specifies the feature evaluation method used to greedily discretize attributes needed when averaging explanation over intervals of numeric attributes. The default value NULL invokes discretization with at- tribute evaluation algorithms ReliefF (classification) or RReliefF (regression) from R package CORElearn (25). recall can provide the list with all explanations data retur- ned by one of the previous calls to function explainVis, which speeds-up the computations. Parameters specific to EXPLAIN (see (24)) explainType specifies for the EXPLAIN method how the prediction with knowledge about given feature and the prediction without knowledge of this feature are com- bined into the final explanation. The values ”WE”, ”infGain”, and ”predDiff” mean that the difference is interpreted as the weight of evidence, information gain, or plain difference of predictions, respectively. For regression problem only the difference of predicti- ons is available. naMode specifies for the EXPLAIN method how the im- pact of missing information about certain feature va- lue is estimated. It can be estimated by the weighted average of predictions over all possible feature’s va- lues, or by inserting NA value as a feature value. nLaplace specifies for the EXPLAIN method and classi- fication problems the value to be used in Laplace cor- rection of estimated probabilities. Parameters specific to IME (see (30)) pError specifies for the IME method the estimated proba- bility of an error in explanations. Together with the err parameter, this determines the number of needed samples. batchSize specifies for the IME method the number of samples processed for each explanation in one ba- tch. To reduce processing overhead in calls to pre- dict function we process several samples at once. This strategy reduces the overhead but may process more samples than required. maxIter sets the maximal number of iterations in IME method allowed for a single explanation. Explanation of Prediction Models with. . . Informatica 42 (2018) 13–22 19 Listing 1: Top level call to explanation methods and their visualization. explainVis(model, trainData, testData, method=c(”EXPLAIN”, ”IME”), classValue=1, fileType=c(”none”,”pdf”,”eps”,”emf”,”jpg”,”png”,”bmp”,”tif”,”tiff”), dirName=getwd(), fileName=”explainVis”, visLevel=c(”both”,”model”,”instance”), explainType=c(”WE”,”infGain”,”predDiff”), naMode=c(”avg”, ”na”), nLaplace=nrow(trainData), estimator=NULL, pError=0.05, err=0.05, batchSize=40, maxIter=1000, genType=c(”rf”, ”rbf”, ”indAttr”), noAvgBins=20, displayAttributes=NULL, modelVisCompact=FALSE, displayThreshold=0.0, colors=c(”navyblue”, ”darkred”, ”blue”, ”red”, ”lightblue”, ”orange”), normalizeTo=0, noDecimalsInValueName=2, recall=NULL modelTitle=ifelse(model$noClasses==0,”Explaining \%R\nmodel: \%M”, ”Explaining \%R=\%V\\nmodel: \%M”), modelSubtitle=”Method: \%E, type: \%X”, instanceTitle=ifelse(model$noClasses==0, ”Explaining \%R\\ninstance: \%I, model: \%M”, ”Explaining \%R=\%V\\ninstance: \%I, model: \%M”), instanceSubtitle=ifelse(model$noClasses==0, ”Method: \%E\\nf(\%I)=\%P, true \%R=\%T”, ”Method: \%E, type: \%X\\nP(\%R=\%V)=\%P, true \%R=\%T”) ) genType specifies the type of data generator used to gene- rate random part of instances in method IME. The ge- nerators from R package semiArtificial(22) are used: ”rf” stands for the random forest based generator, ”rbf” invokes RBF network based generator, and in- dAttr” assumes independent attributes and generates values for each attribute independently. noAvgBins specifies for the IME method the number of discretization bins used to present model level expla- nations and average explanations. Visualization parameters: displayAttributes is the vector of attribute names which are visualized in model level visualization. modelVisCompact determines if attribute values are dis- played in model level visualization. displayThreshold specifies the threshold value for abso- lute values of explanations below which feature con- tributions are not displayed in instance and model ex- planation graphs. normalizeTo determines the value for instance level visu- alization to which the sum of the absolute values of feature contributions are normalized (e.g., 1 or 100). colors determines colors used in visualization. noDecimalsInValueName specifies how many decimal places will numeric feature values use in visualizati- ons. modelTitle, modelSubtitle, instanceTitle, and instance- Subtitle are string templates for various titles of dif- ferent graphs. The template uses several variables, which are inserted at the appropriate place: response variable %R, the selected class value for explanation %V, type of model %M, explanation method %E, ex- planation type %X, instance name %I, predicted valu- e/probability of the response %P, and the true value of the response %T. 4.2 Producing explanations The function explainVis generates explanations and their visualizations given the trained model, its training data, and data for which we want explanations. This is the front- end explanation function which takes care of everything, internally calling other functions. The produced visualiza- tions are output to a graphical device or saved to a file. The function returns a list with explanations, average explana- tions, and additional data like discretization used and data generator. An example of a call is presented in Listing 2. The explanations support several models implemented in packages CORElearn, randomForest, nnet, and e1071. Adding support for new predictors is easy and involves preparation of class names and class values in the for- mat expected by the package ExplainPrediction when cal- ling the predictor. This is demonstrated in the function wrap4Explanation, which is part of the ExplainPrediction package. 4.3 Tips for using the explanations The presented explanation techniques have many success- ful applications (shortly reviewed in Section 2). Here we present a few tips for successful practical use of explanati- ons. For many real-world problems gaining the trust of users is essential to assure successful application of machine le- arning models. Instance and model explanation can serve as convenient ice-breakers. If a user can check for some instances that the generated explanations match his/her un- derstanding of the problem, this greatly increases chances of success and is more convincing than reporting high clas- sification accuracy. This is true even for mispredicted in- stances as long as the model’s reasoning is sensible for users. For larger data sets with many attributes, time to pro- duce explanations with the IME method can be substantial. However, in spite of theoretical advantages of IME over EXPLAIN, in practice, these two methods mostly produce similar explanations. This indicates that in real-world pro- 20 Informatica 42 (2018) 13–22 M. Robnik-Šikonja Listing 2: A code that generates explanations of model and instances. require(ExplainPrediction) require(CORElearn) # use iris data set, split it randomly into a training and testing set trainIdxs <− sample(x=nrow(iris), size=0.7∗nrow(iris), replace=FALSE) testIdxs <− c(1:nrow(iris))[−trainIdxs] # build random forests model with certain parameters modelRF <− CoreModel(Species ˜ ., iris[trainIdxs,], model=”rf”, rfNoTrees=100, selectionEstimator=”MDL”, minNodeWeightRF=5) # generate model and instance explanations and visualize them in a graphical window explainVis(modelRF, iris[trainIdxs,], iris[testIdxs,], method=”EXPLAIN”, fileType=”none”, naMode=”avg”, explainType=”WE”, classValue=1) blems there are few redundant attributes and even less re- dundant attributes of exactly the same strength (if redun- dant attributes are not of the same strength, learning selects the stronger ones and there is no redundancy in the model). In practice, we can compare the behavior of EXPLAIN and IME on a subsample of instances and attributes. If explana- tions are similar, the EXPLAIN method can be used instead of IME. To reach a desired graphical design (e.g., colors and he- adings) and show only the most impactful attributes requi- res some tweaking of visualization parameters. To avoid regeneration of explanations for each user interaction with explanations, we provide the recall parameter. In the first call to the explainVis function, we have to store the in- visibly returned list to a variable and supply this varia- ble as the value of parameter recall in subsequent calls to explainVis. In this case the function reuses already com- puted explanations, average explanations, discretization, etc., and only display data differently according to sup- plied input/output and visualization parameters (visLevel, dirName, fileType, displayAttributes, modelVisCompact, displayThreshold, normalizeTo, colors, noDecimalsInVa- lueName, modelTitle, modelSubtitle, instanceTitle, and in- stanceSubtitle). Using this hint can make user interactions with explanations instantaneous even for large data sets. 5 Conclusions We presented two general methods for explanation of pre- diction models and their implementation in the ExplainPre- diction package. The methods allow explanation of indivi- dual decisions as well as the prediction model as a whole. The explanations provide information on how the indivi- dual input variables influence the outcome of prediction models, thus improving their transparency and comprehen- sibility. The general methods allow users to compare diffe- rent types of models or replace their existing model without having to replace the explanation method. The explanation methods can be efficiently computed and visualized, and their implementation offers several parameters that cont- rol the speed and precision of the computed explanations, convergence rate and visualization of explanations. Several models are supported and adding support in almost any pre- diction model is easy. The simplicity and elegance of the perturbation based explanations coupled with efficient implementations and visualization of instance- and model-based explanations al- low application of general explanation approaches to new areas. We expect that broader practical use will spur ad- ditional research into explanation mechanisms and impro- vements in the visual design of explanations. There are also many possibilities for methodological improvements. An idea worth pursuing seems integration of game theory based sampling and formulation of explanations as an opti- mization problem. The implementation of IME in the Ex- plainPrediction package could be improved by rewriting it in C language and using better, context-dependent, sam- pling method. Acknowledgment We acknowledge the support of the Slovenian Research Agency, ARRS, through research programme P2-0209 (Ar- tificial Intelligence and Intelligent Systems) and projects J6-8256 (New grammar of contemporary standard Slo- vene: sources and methods) and L1-7542 (Advancement of computationally intensive methods for efficient modern general-purpose statistical analysis and inference). Fruitful discussions with Erik Štrumbelj improved the implementa- tion of the IME method. Literature [1] Charu C Aggarwal, Chen Chen, and Jiawei Han. The inverse classification problem. Journal of Computer Science and Technology, 25(3):458–468, 2010. [2] Leila Arras, Franziska Horn, Grégoire Montavon, Klaus-Robert Müller, and Wojciech Samek. What is relevant in a text document?: An interpretable ma- chine learning approach. PloS ONE, 12(8):e0181142, 2017. [3] Sebastian Bach, Alexander Binder, Grégoire Monta- von, Frederick Klauschen, Klaus-Robert Müller, and Explanation of Prediction Models with. . . Informatica 42 (2018) 13–22 21 Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS ONE, 10(7):e0130140, 2015. [4] David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert Müller. How to explain individual classification de- cisions. Journal of Machine Learning Research, 11 (Jun):1803–1831, 2010. [5] David Barbella, Sami Benzaid, Janara M Christensen, Bret Jackson, X Victor Qin, and David R Musicant. Understanding support vector machine classifications via a recommender system-like approach. In R. Stahl- bock, S. F. Crone, and S. Lessmann, editors, Procee- dings of International Conference on Data Mining, pages 305–311, 2009. [6] Adriano Barbosa, FV Paulovich, Afonso Paiva, Si- ome Goldenstein, Fabiano Petronetto, and LG Non- ato. Visualizing and interacting with kernelized data. IEEE transactions on visualization and computer graphics, 22(3):1314–1325, 2016. [7] Marko Bohanec, Mirjana Borštnar Kljajić, and Marko Robnik-Šikonja. Explaining machine learning mo- dels in sales predictions. Expert Systems with Appli- cations, 71:416–428, 2017. [8] Marko Bohanec, Marko Robnik-Šikonja, and Mirjana Kljajić Borštnar. Decision-making framework with double-loop learning through interpretable black-box machine learning models. Industrial Management & Data Systems, 117(7):1389–1406, 2017. [9] Zoran Bosnić, Jaka Demšar, Grega Kešpret, Pedro Pe- reira Rodrigues, João Gama, and Igor Kononenko. Enhancing data stream predictions with reliability es- timators and explanation. Engineering Applications of Artificial Intelligence, 34:178–192, 2014. [10] Jaka Demšar and Zoran Bosnić. Detecting concept drift in data streams using model explanation. Expert Systems with Applications, 92:546 – 559, 2018. [11] Alex Goldstein, Adam Kapelner, Justin Bleich, and Emil Pitkin. Peeking inside the black box: Visuali- zing statistical learning with plots of individual con- ditional expectation. Journal of Computational and Graphical Statistics, 24(1):44–65, 2015. [12] Lutz Hamel. Visualization of support vector machines with unsupervised learning. In Proceedings of 2006 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, 2006. [13] Katja Hansen, David Baehrens, Timon Schroeter, Matthias Rupp, and Klaus-Robert Müller. Visual in- terpretation of kernel-based prediction models. Mole- cular Informatics, 30(9):817–826, 2011. [14] Aleks Jakulin, Martin Možina, Janez Demšar, Ivan Bratko, and Blaž Zupan. Nomograms for visualizing support vector machines. In Robert Grossman, Ro- berto Bayardo, and Kristin P. Bennett, editors, Pro- ceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mi- ning, pages 108–117. ACM, 2005. [15] Vincent Lemaire, Raphael Féraud, and Nicolas Voi- sine. Contact personalization using a score understan- ding method. In Proceedings of International Joint Conference on Neural Networks (IJCNN), 2008. [16] David Martens and Foster Provost. Explaining do- cuments’ classifications. Technical report, Center for Digital Economy Research, New York Univer- sity, Stern School of Business, 2011. Working paper CeDER-11-01. [17] David Meyer, Friedrich Leisch, and Kurt Hornik. The support vector machine under test. Neurocomputing, 55:169–186, 2003. [18] Rok Piltaver, Mitja Luštrek, Matjaž Gams, and Sanda Martinčić-Ipšić. What makes classification trees com- prehensible? Expert Systems with Applications, 62: 333–346, 2016. [19] Brett Poulin, Roman Eisner, Duane Szafron, Paul Lu, Russell Greiner, David S. Wishart, Alona Fyshe, Brandon Pearcy, Cam Macdonell, and John Anvik. Visual explanation of evidence with additive classi- fiers. In Proceedings of AAAI’06. AAAI Press, 2006. [20] Marko Pregeljc, Erik Štrumbelj, Miran Mihelčič, and Igor Kononenko. Learning and explaining the impact of enterprises’ organizational quality on their econo- mic results. In R. Magdalena-Benedito, M. Martinez- Sober, J. M. Martinez-Martinez, P. Escandell- Moreno, and J. Vila-Frances, editors, Intelligent Data Analysis for Real-Life Applications: Theory and Practice, pages 228–248. Information Science Refe- rence, IGI Global, 2012. [21] Marco Tulio Ribeiro, Sameer Singh, and Carlos Gue- strin. Why should I trust you?: Explaining the pre- dictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Know- ledge Discovery and Data Mining, pages 1135–1144. ACM, 2016. [22] Marko Robnik-Šikonja. Data generators for learning systems based on rbf networks. IEEE Transactions on Neural Networks and Learning Systems, 27(5):926– 938, May 2016. [23] Marko Robnik-Šikonja. ExplainPrediction: Explana- tion of Predictions for Classification and Regression, 2017. URL http://cran.r-project.org/ package=ExplainPrediction. R package version 1.3.0. 22 Informatica 42 (2018) 13–22 M. Robnik-Šikonja [24] Marko Robnik-Šikonja and Igor Kononenko. Explai- ning classifications for individual instances. IEEE Transactions on Knowledge and Data Engineering, 20(5):589–600, 2008. [25] Marko Robnik-Šikonja and Petr Savicky. COREle- arn - classification, regression, feature evaluation and ordinal evaluation, 2017. URL http://cran. r-project.org/package=CORElearn. R package version 1.52.0. [26] Andrea Saltelli, Karen Chan, and E. Marian Scott. Sensitivity analysis. Wiley, Chichester; New York, 2000. [27] Alexander Schulz, Andrej Gisbrecht, and Barbara Hammer. Using discriminative dimensionality re- duction to visualize classifiers. Neural Processing Letters, 42(1):27–54, 2015. [28] Karen Simonyan, Andrea Vedaldi, and Andrew Zis- serman. Deep inside convolutional networks: Visua- lising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013. [29] Erik Štrumbelj, Zoran Bosnić, Igor Kononenko, Branko Zakotnik, and Cvetka Grašič Kuhar. Expla- nation and reliability of prediction models: the case of breast cancer recurrence. Knowledge and informa- tion systems, 24(2):305–324, 2010. [30] Erik Štrumbelj and Igor Kononenko. An Efficient Explanation of Individual Classifications using Game Theory. Journal of Machine Learning Research, 11: 1–18, 2010. [31] Erik Štrumbelj, Igor Kononenko, and Marko Robnik- Šikonja. Explaining instance classifications with in- teractions of subsets of feature values. Data & Kno- wledge Engineering, 68(10):886–904, 2009. Informatica 42 (2018) 23–32 23 Semantic Annotation of Documents Based on Wikipedia Concepts Janez Brank, Gregor Leban and Marko Grobelnik Jožef Stefan Institute, Jamova 39, Ljubljana, Slovenia E-mail: janez.brank@ijs.si, gregor.leban@ijs.si, marko.grobelnik@ijs.si Keywords: semantic annotation, wikification, disambiguation, text mining Received: January 30, 2017 Semantic annotation is the task of augmenting an unstructured textual document with semantic information, such as concepts from an ontology. In wikification, the Wikipedia is used as an ontology and its pages (articles) are regarded as (representations of) concepts. We describe an efficient approach for annotating a document with relevant concepts from the Wikipedia. A global disambiguation method based on constructing a mention-concept graph and computing pagerank over it is used to identify a coherent set of relevant concepts considering the input document as a whole. The presented approach is suitable for parallel processing and can support any language for which a sufficiently large Wikipedia is available. Several heuristics involved in the disambiguation of candidate annotations are discussed and an experimental evaluation of their influence is presented. Povzetek: Semantično anotiranje je postopek, s katerim bi radi nestrukturirano besedilo dopolnili s semantičnimi informacijami, na primer s koncepti iz neke ontologije. Pri wikifikaciji se kot ontologijo uporablja Wikipedijo, pri čemer strani oz. članke v njej obravnavamo kot predstavitve konceptov. Opisujemo učinkovit pristop za anotiranje besedila z relevantnimi koncepti iz Wikipedije. Pri tem uporabljamo globalen pristop k razdvoumljanju, ki temelji na izgradnji grafa omemb in konceptov ter računanju pageranka na tem grafu, kar je nato podlaga za določitev nabora konceptov, ki se lepo skladajo med seboj in so relevantni glede na vhodno besedilo kot celoto. Opisani pristop je primeren za paralelno procesiranje in deluje za poljuben jezik, v katerem je na voljo dovolj velika Wikipedija. V članku predstavljamo in eksperimentalno ovrednotimo tudi več hevristik, ki se jih lahko uporabi pri razdvoumljanju kandidatov za anotacije. 1 Introduction Recent years have seen a growth in the use of semantic technologies. However, in many contexts we still deal with largely unstructured textual documents that lack explicit semantic information which might be required for further processing with semantic technologies. This leads to the problem of semantic annotation or semantic enrichment as an important preparatory step before further processing of a document. Given a document and an ontology covering the domain of interest, the challenge is to identify concepts from that ontology that are relevant to the document or that are referred to by it, as well as to identify specific passages in the document where the concepts in question are mentioned. A specific type of semantic annotation, known as wikification, involves using the Wikipedia as a source of possible semantic annotations [1][2]. In this setting, the Wikipedia is treated as a large and fairly general-purpose ontology: each page is thought of as representing a concept, while the relations between concepts are represented by internal hyperlinks between different Wikipedia pages, as well as by Wikipedia’s category memberships and cross-language links. The advantage of this approach is that the Wikipedia is a freely available source of information, it covers a wide range of topics, has a rich internal structure, and each concept is associated with a semi-structured textual document (i.e. the content of the corresponding Wikipedia article) which can be used to aid in the process of semantic annotation. Furthermore, the Wikipedia is available in a number of languages, with cross-language links being available to identify pages that refer to the same concept in different languages, thus making it easier to support multilingual and cross-lingual annotation. The remainder of this paper is structured as follows. In Section 2, we present the pagerank-based approach to wikification used in our wikifier. In Section 3, we describe our implementation and some experimental evaluaton. Section 4 contains conclusions and a discussion of possible future work. 2 Pagerank-based Wikification The task of wikifying an input document can be broken down into several closely interrelated subtasks: (1) identify phrases (or words) in the input document that refer to a Wikipedia concept; (2) determine which concept exactly a phrase refers to; (3) determine which concepts are relevant enough to the document as a whole that they should be included in the output of the system (i.e. presented to the user). We follow the approach described by Zhang and Rettinger [1]. This approach makes use of the rich internal structure of hyperlinks between Wikipedia pages. A hyperlink can be thought of as consisting of a 24 Informatica 42 (2018) 23–32 J. Brank et al. source page, a target page, and the link text (also known as the anchor text). If a source page contains a link with the anchor text a and the target page t, this is an indication that the phrase a might be a reference to (or representation of) the concept that corresponds to page t. Thus, if the input document that we’re trying to wikify contains the phrase a, it might be the case that this occurrence of a in the input document also constitutes a mention of the concept t, and the concept t is a candidate annotation for this particular phrase. 2.1 Disambiguation In the Wikipedia, there may be many different links with the same anchor text a, and they might not all be pointing to the same target page. For example, in the English- language Wikipedia, there are links with a = “Tesla” that point to pages about the inventor, the car manufacturer, the unit in physics, a band, a film, and several other concepts. Thus, when such a phrase a occurs in an input document, there are several concepts that can be regarded as candidate annotations for that particular mention, and we have to determine which of them is actually relevant. This is the problem of disambiguation, similar to that of word sense disambiguation in natural language processing. There are broadly two approaches to disambiguation, local and global. In the local approach, each mention is disambiguated independently of the others, while the global approach aims to treat the document as a whole and disambiguate all the mentions in it as a group. The intuition behind the global approach is that the document that we’re annotating is about some topic, and the concepts that we use as annotation should be about that topic as well. If the document contains many mentions that include, as some of their candidate annotations, some car-related concepts, this makes it more likely that we should treat the mention of “Tesla” as a reference to Tesla the car manufacturer as opposed to e.g. a reference to Nikola Tesla or to Tesla the rock band. 2.2 The mention-concept graph To implement the global disambiguation approach, our Wikifier begins by constructing a mention-concept graph for the input document. (Some authors, e.g. [2], refer to this as a mention-entity graph, but we prefer to use the term “mention-concept graph” as some of the Wikipedia pages do not necessarily correspond to concepts that we usually think of as entities, and our wikifier does not by default try to exclude them.) This can be thought of as a bipartite graph in which the left set of vertices corresponds to mentions and the right set of vertices corresponds to concepts. A directed edge a  c exists if and only if the concept c is one of the candidate annotations for the mention a (i.e. if, in the Wikipedia, there exists a hyperlink with the anchor text a and the target c). A transition probability is also assigned to each such edge, P(a  c), defined as the ratio [number of hyperlinks, in the Wikipedia, having the anchor text a and the target c] / [number of hyperlinks, in the Wikipedia, having the anchor text a]. This graph is then augmented by edges between concepts, the idea being that an edge c  c' should be used to indicate that the concepts c and c' are “semantically related”, in the sense that if one of them is relevant to a given input document, the other one is also more likely to be relevant to that document. (For example, the semantic relationship between the concepts “Electric vehicle” and “Tesla Inc.” should be much stronger than between the concepts “Electric vehicle” and “Tesla (rock band)”. This measure of semantic relatedness will be used subsequently to encourage the formation of a group of annotations that are semantically related in the sense that they refer to the same topic, which is hopefully also the topic of the document. This would encourage mentions of “Tesla” in a document about electric cars to be annotated with the concept “Tesla Inc.” rather than “Tesla (rock band)”. Following [1], the internal link structure of the Wikipedia is used to calculate a measure of semantic relatedness. Informally, the idea is that if c and c' are closely related, then other Wikipedia pages that point to c are likely to also point to c' and vice versa. Let Lc be the set of Wikipedia pages that contain a hyperlink to c, and let N be the total number of concepts in the Wikipedia; then the semantic relatedness of c and c' can be defined as 𝑆𝑅(𝑐, 𝑐′) = 1 − log max{|𝐿𝑐|, |𝐿𝑐′|} − log|𝐿𝑐 ∩ 𝐿𝑐′| log 𝑁 − log min{|𝐿𝑐|, |𝐿𝑐′|} . Intuitively, this formula considers two concepts to be semantically related if pages that link to one of them typically also link to the other one (and vice versa). More specifically, SR will be higher if the overlap (i.e. the intersection) of Lc and Lc' is large (relative to the size of Lc and Lc'), and the formula also rewards situations where the sets Lc and Lc' are themselves large (relative to the overall number of documents N), as this means that the dataset includes more evidence of a semantic relationship between c and c'. In the graph, we add an edge of the form c  c' wherever the semantic relatedness SR(c, c') is > 0. The transition probability of this edge is defined as proportional to the semantic relatedness: P(c  c') = SR(c, c') / c'' SR(c, c''). This graph is then used as the basis for calculating a vector of pagerank scores [3], one for each vertex. This is done using the usual iterative approach where in each iteration, each vertex distributes its pagerank score to its immediate successors in the graph, in proportion to the transition probabilities on its outgoing edges: PRnew(u) =  PR0(u) + (1 – ) v PRold(v) P(v  u). The baseline distribution of pagerank, PR0, is used both to help the process converge and also to counterbalance the fact that in our graph there are no edges pointing into the mention vertices. In our case, PR0(u) is defined as 0 if u is a concept vertex; if u is a mention vertex, we use PR0(u) = z  [number of Semantic Annotation of Documents … Informatica 42 (2018) 23–32 25 Wikipedia pages containing the phrase u as the anchor- text of a hyperlink] / [number of Wikipedia pages containing the phrase u], where z is a normalization constant to ensure that u PR0(u) = 1. We used  = 0.1 as the stabilization parameter. The intuition behind this approach is that in each iteration of the pagerank calculation process, the pagerank flows into a concept vertex c from mentions that are closely associated with the concept c and from other concepts that are semantically related to c. Thus after a few iterations, pagerank should tend to accumulate in a set of concepts that are closely semantically related to each other and that are strongly associated with words and phrases that appear in the input document, which is exactly what we want in the context of global disambiguation. 2.3 Using pagerank for disambiguation Once the pagerank values of all the vertices in the graph have been calculated, we use the pagerank values of concepts to disambiguate the mentions. If there are edges from a mention a to several concepts c, we choose the concept with the highest pagerank as the one that is relevant to this particular mention a. We say that this concept is supported by the mention a. At the end of this process, concepts that are not supported by any mention are discarded as not being relevant to the input document. The remaining concepts are then sorted in decreasing order of their pagerank. Let the i’th concept in this order be ci and let its pagerank be PRi, for i = 1, …, n. Concepts with a very low pagerank value are less likely to be relevant, so it makes sense to apply a further filtering step at this point and discard concepts whose pagerank is below a user-specified threshold. However, where exactly this threshold should be depends on whether the user wants to prioritize precision or recall. Furthermore, the absolute values of pagerank can vary a lot from one document to another, e.g. depending on the length of the documents, the number of mentions and candidate concepts, etc. Thus we apply the user-specified threshold in the following manner: given the user- specified threshold value   [0, 1], we output the concepts c1, …, cm, where m is the least integer such that i=1..m PRi2 ≥  i=1..n PRi2. In other words, we report as many top-ranking concepts as are needed to cover  of the total sum of squared pageranks of all the concepts. We use  = 0.8 as a broadly reasonable default value, though the user can require a different threshold depending on their requirements. The motivation for using squares of pageranks instead of the pageranks themselves is to put a greater emphasis on the annotations with the highest values of pagerank, while culling the lower-scoring annotations more thoroughly. In our preliminary experiments, this led to a small improvement in performance compared to using the sums of pageranks without squaring them. For each reported concept, we also output a list of the mentions that support it. 2.4 Treatment of highly ambiguous mentions Our wikifier supports various minor heuristics and refinements in an effort to improve the performance of the baseline approach described in the preceding sections. As described above, anchor text of hyperlinks in the Wikipedia is used to identify mentions in an input document (i.e. words or phrases that may support an annotation). One downside of this approach is that some words or phrases occur as the anchor text of a very large number of hyperlinks in the Wikipedia and these links point to a large number of different Wikipedia pages. In other words, such a phrase is highly ambiguous; it is not only unlikely to be disambiguated correctly, but also introduces noise into the mention-concept graph by introducing a large number of concept vertices, the vast majority of which will be completely irrelevant to the input document. This also slows down the annotation process by increasing the time to calculate the semantic relatedness between all pairs of candidate concepts. (As an example of such a highly ambiguous mention, consider the word “Country”. Most of the time, when it appears as the anchor-text of a link, it’s a link to the concepts “Country” or “Country music”, but it also occurs in links to more than a hundred other concepts, mostly individual countries.) We use several heuristics to deal with this problem. Suppose that a given mention a occurs, in the Wikipedia, as the anchor text of n hyperlinks pointing to k different target pages, and suppose that ni of these links point to page ci (for i = 1, …, k). We can now define the entropy of the mention a as the amount of uncertainty regarding the link target given the fact that its anchor text is a: H(a) = – i=1..k (ni/n) log(ni/n). If this entropy is above a user- specified threshold (e.g. 3 bits), we completely ignore the mention as being too ambiguous to be of any use. For mentions that pass this heuristic, we sort the target pages in decreasing order of ni and use only the top few of them (e.g. top 20) as candidates in our mention-concept graph. A third heuristic is to ignore candidates for which ni itself is below a certain threshold (e.g. ni < 2), the idea being that if such a phrase occurs only once as the anchor text of a link pointing to that candidate, this may well turn out to be noise and is best disregarded. Optionally, the Wikifier can also be configured to ignore certain types of concepts based on their Wikidata class membership. This can be useful to exclude from consideration Wikipedia pages that do not really correspond to what is usually thought of as entities (e.g. “List of…” pages). Another heuristic that we have found useful in reducing the noise in the output annotations is to ignore any mention that consists entirely of stopwords and/or very common words (top 200 most frequent words in the Wikipedia for that particular language). For this as well as for other purposes the text processing is done in a case-sensitive fashion, which e.g. allows us to ignore spurious links with the link text “the” while processing those that refer to the band “The The”. 26 Informatica 42 (2018) 23–32 J. Brank et al. 2.5 Miscellaneous heuristics Semantic relatedness. As mentioned above, the definition of semantic relatedness of two concepts, SR(c, c'), is based on the overlap between the sets Lc, Lc' of immediate predecessors of these two concepts in the Wikipedia link graph. Optionally, our Wikifier can compute semantic relatedness using immediate successors or immediate neighbours (i.e. both predecessors and successors) instead of immediate predecessors. However, our preliminary experiments indicated that these changes do not lead to improvements in performance, so they are disabled by default. Extensions to disambiguation. Our Wikifier also supports some optional extensions of the disambiguation process. As described above, the default behavior when disambiguating a mention is to simply choose the candidate annotation with the highest pagerank value. Alternatively, after any heuristics from section 2.4 have been applied, the remaining candidate concepts can be re-ranked using a different scoring function that takes other criteria besides pagerank into account. This is an opportunity to combine the global disambiguation approach with some local techniques. In general, a scoring function of the following type is supported: score(c|a) = w1 f(P(c|a)) PR(c) + w2 S(c, d) + w3 LS(c, a) (1) Here, a is the mention that we’re trying to disambiguate, and c is the candidate concept that we’re evaluating. P(c|a) is the probability that a hyperlink in the Wikipedia has c as its target conditioned on the fact that it has a as its anchor text. f(x) can be either 1 (the default), x, or log(x). PR(c) is the pagerank of c’s vertex in the mention-concept graph. S(c, d) is the cosine similarity between the text of the input document d and of the Wikipedia page for the concept c. LS(c, a) is the cosine similarity between the context (e.g. previous and next 3 words) in which a appears in the input document d, and the contexts in which hyperlinks with the target c appear in the Wikipedia. Finally, w1, w2, w3 are weight constants. However, our preliminary experiments haven’t shown sufficient improvements from the addition of these heuristics, so they are disabled by default (f(x) = 1, w2 = w3 = 0) to save computational time and memory (storing the link contexts needed for the efficient computation of LS has turned out to be particularly memory intensive). 3 Implementation and evaluation 3.1 Implementation Our implementation of the approach described in the preceding section is running as a web service and can be accessed at http://wikifier.org. The approach is suitable for parallel processing as annotating one document is independent of annotating other documents, and any shared data used by the annotation process (e.g. the Wikipedia link graph, and a trie-based data structure that indexes the anchor text of all the hyperlinks) need to be accessed only for reading and can thus easily be shared by an arbitrary number of worker threads. This allows for a highly efficient processing of a large number of documents. The only need to modify shared data structures arises when a new dump of the Wikipedia becomes available (the Wikipedia publishes new dumps of its content twice per month). We use a separate process to periodically check the Wikipedia web site for new dumps, download them, parse them, and build indexes in a form that can be used by our wikifier. Once the wikifier web service is notified of the availability of a new index, it loads its contents into memory, temporarily stops issuing new requests to worker threads, waits for them all to finish processing their current requests, and then updates the shared data structures to include the new index and discard the old one. In this way, new indices can be brought online without shutting down the service and with a minimal interruption to its availability. Our implementation currently processes on average more than 500,000 requests per day (the total length of input documents averages about 1.2 GB per day), including all the documents from the JSI Newsfeed service [4]. The output is used among other things as a preprocessing step by the Event Registry system [5]. The wikifier currently supports all languages in which a Wikipedia with at least 1000 pages is available, amounting to a total of 134 languages. Admittedly, 1000 pages is much too small to achieve an adequate coverage; however, about 60 languages have a Wikipedia with at least 100,000 pages, which is enough for many practical applications. Annotations are returned in JSON format and can optionally include detailed information about support (which mentions support each annotation), alternative candidate annotations (concepts that were considered as candidates during the disambiguation process but were rejected in favour of some other higher scored concept), and WikiData/DbPedia class membership of the proposed annotations. Thus, the caller can easily implement any desired class-based post-processing. Our wikifier also allows the user to define custom vocabularies that can be used to generate annotations in addition to the Wikipedia-based annotations described so far. A custom vocabulary is a set of concepts where each concept consists of an ID and a set of one or more words of phrases which, if they appear in the input document, trigger the inclusion of this concept among the annotations. This allows the user to extend the system with custom sets of annotations, but the downside is that such custom annotations are not part of the Wikipedia and thus cannot be included in the usual wikification process, especially not in the pagerank-based disambiguation algorithm. As a preprocessing step, the user may specify one or more sets of “alternative labels”, which are really rewriting rules of the form “w1 w2 … wn → x1 x2 … xm” indicating that the sequence of the words w1 w2 … wn may, if it occurs in the input document, be replaced by the sequence x1 x2 … xm prior to the main part of the Semantic Annotation of Documents … Informatica 42 (2018) 23–32 27 wikification algorithm. (The word “may” in the preceding sentence means that the original sequence of words from the left-hand side of the rule is also kept in the document. Thus, the document is no longer a simple sequence of words, but may gradually turn into an arbitrary directed acyclic graph, the various paths through which indicate different alternatives into which the text of the document may be brought through the application of the rewriting rules.) Owing to such transformations, certain candidate mentions might appear in the document that did not appear in the original document. Several such rules may be applied one after another and may affect the same part(s) of a document. Theoretically, such rewriting rules form a Turing- complete formalism, and to keep the problem tractable our wikifier makes only three passes through the document to look for occurrences of left-side word sequences and replace them with the corresponding right- side word sequences. Currently the main use of this mechanism in our wikifier is to provide additional alternative spellings of some proper names in cases where these are not adequately covered in the Wikipedia. This has been found to be particularly useful in case of names transliterated from languages that use a different script and where several different transliteration schemes are in use. 3.2 Evaluation One way to evaluate wikification is to compare the set of annotations with a manually annotated gold standard for the same document(s). Performance can then be measured using metrics from information retrieval, such as precision, recall, and the F1-measure, which is defined as the harmonic mean of precision and recall. We used two manually annotated datasets: (Dataset 1.) A set of 1393 news articles that was made available from the authors of the AIDA system and was originally used in their experiments [2]. This manually annotated dataset excludes, by design, any annotations that do not correspond to named entities. Since our wikifier does not by default distinguish between named entities and other Wikipedia concepts, we have explicitly excluded concepts that are not named entities (based on their class membership in the WikiData ontology) from the output of our Wikifier for the purposes of this experiment. (Dataset 2.) A set of 491 news articles taken randomly from the JSI Newsfeed [4] on 29 June 2016 and annotated manually with relevant Wikipedia concepts. Unlike the first dataset, the annotations here included concepts that were not named entities. In addition to our wikifier, we obtained annotations from the following systems: AIDA [2], Waikato Wikipedia Miner [7], Babelfy [8], Illinois [9], and DbPedia Spotlight [10]. The Waikato system is not included in experiments involving dataset 2 as their web service was no longer available at the time. Tables 1(a) and 1(b) show the agreement not only between each of the wikifiers and the gold standard, but also between each pair of wikifiers (the lower left triangle of the matrix is left empty as it would be just a copy of the upper right triangle, since the F1-measure is symmetric). As this experiment indicates, on the first dataset (the AIDA dataset) our wikifier (“JSI” in the table) performs slightly worse than AIDA but significantly better than the other wikifiers. On the second dataset (the JSI dataset), the best performance was achieved by the Babelnet wikifier, ours is slightly worse while AIDA is significantly worse on this dataset. Thus, overall we can conclude that our wikifier has solid performance over a pair of two considerably different dataset. Furthermore, experiments on both datasets show that there is relatively little agreement between different wikifiers, which indicates that wikification itself is in some sense a vaguely defined task where different people can have very different ideas about whether a particular Wikipedia concept is relevant to a particular input document (and should therefore be included as an annotation) or not, which types of Wikipedia concepts can be considered as annotations (e.g. only named entities or all concepts), etc. Possibly the level of agreement could be improved by fine-tuning the settings of the various wikifiers; in the experiment described above, default settings were used. Gold JSI AIDA Waikato Babelfy Illinois Spotlight Gold 1.000 0.593 0.723 0.372 0.323 0.476 0.279 JSI 1.000 0.625 0.527 0.431 0.489 0.363 AIDA 1.000 0.372 0.352 0.434 0.356 Waikato 1.000 0.481 0.564 0.474 Babelfy 1.000 0.434 0.356 Illinois 1.000 0.376 Spotlight 1.000 Table 1(a): F1 measure of agreement between the various wikifiers and the gold standard on dataset 1. Gold JSI AIDA Babelfy Illinois Spotlight Gold 1.000 0.378 0.197 0.417 0.372 0.282 JSI 1.000 0.278 0.360 0.413 0.397 AIDA 1.000 0.206 0.283 0.383 Babelfy 1.000 0.380 0.282 Illinois 1.000 0.367 Spotlight 1.000 Table 1(b): F1 measure of agreement between the various wikifiers and the gold standard on dataset 2. We also conducted a small experiment on dataset 2 to compare two forms of the thresholding criterion: one is based on the sums of squares of pageranks (as currently described in Section 2.3) and one based on the sums of the pageranks themselves. The F1-measure between our annotations and the gold standard drops from 0.378 when using squared pageranks to 0.344 when using the pageranks directly. We used squared pageranks for thresholding in all other experiments in this section. Evaluation of disambiguation heuristics. In the following experiment, we evaluate some of the additional disambiguation heuristics described in Section 2.5. The purpose of the experiment was to find the best- performing combination of the following heuristics and parameters from that section: (i) Logarithmic link counts: in Section 2.2, we defined the transition probability a → c in the mention- concept graph as being proportional to the number of 28 Informatica 42 (2018) 23–32 J. Brank et al. links, in the Wikipedia, with the anchor text a and the target c (the “link count” of c given a). Alternatively, it can be defined as being proportional to the logarithm of this link count. The purpose of this heuristic is to provide a kind of smoothing and discourage too much of the pagerank score from flowing into just one candidate c for that particular mention a. The Wikipedia is known for having various biases in terms of how frequently certain topics are covered, so this sort of smoothing may soften the more extreme differences in the frequency of coverage while still preserving some information about which concepts c are associated more often with a phrase a. (ii) Set of links used in the computation of sematic relatedness (SR) between two concepts in the Wikipedia link graph: this can be the in-links (the default setting), out-links, or all neighbours. (iii) Threshold for re-ranking: in this scenario, the candidates c for a given mention a are first sorted by pagerank, the top few candidates are kept and are then re- ranked using the more detailed (and computationally- intensive) scoring function denoted by eq. (1). The question then is what counts as “top few candidates” to be included in the re-ranking. We define this by introducing a parameter ϑ  [0, 1] such that a candidate c proceeds to re-ranking if its pagerank is PR(c) ≥ ϑ maxc' PR(c'), where c' goes over all the candidates for the current mention a. (iv) Linearization of pagerank in the scoring function denoted by eq. (1) into a linear rank: instead of using the pagerank directly, all the candidate concepts c for a given mention a are sorted by pagerank and a linear rank is assigned to each. If there are k candidates, the i’th of them in this order gets a linear rank of i/k. This is then used instead of PR(c) in eq. (1), as well as in the ϑ-based thresholding criterion in the previous paragraph (where ϑ then simply becomes the proportion of candidates that proceeds to the re-ranking phase). The purpose is to make sure that the range [0, 1] is covered evenly, instead of the pagerank values possibly being clustered in a small part of that range. (v) Weight w2 of the cosine similarity between the input document and the Wikipedia page of a candidate concept, in the scoring function of eq. (1). (The weight w1 of the candidate concept’s pagerank value in the scoring function was then set to 1 – w2. The weight w3 of the link context similarity was kept to 0 throughout these experiments, because of the considerable additional memory and time consumption required for the link context computation and because preliminary experiments indicated that the results were not promising.) The possible values of these five parameters that were investigated in this experiment can be summarized as follows: (i) linkCounts  {normal, log} (ii) SR  {in, out, all} (iii) ϑ  {0, 0.2, 0.4, 0.6, 0.8, 1} (iv) PR  {normal, linearized} (v) w2  {0, 0.25, 0.5, 0.75, 1} The default settings are: normal link counts, SR = in, ϑ = 1 (no second-stage re-ranking), PR = normal, w2 = 0. Table 2 shows, for each possible value of each parameter, the best and the average performance (in terms of F1-measure relative to the gold standard) that can be achieved by fixing that parameter to that value and allowing the other parameters to range over all the possible values indicated above. For comparison, the last two rows show the performance with no parameters fixed (allowing us to tune the best possible combination of all parameters) and the performance with all parameters fixed at their default values. This experiment was done on Dataset 1 and used 10- fold cross-validation. Nine folds (the training set) were used to tune any parameters that were not held fixed and the best resulting combination of parameters was then evaluated on the tenth fold (the test set). This was repeated for all ten choices of the test fold. Table 2 shows the average and the standard deviation of the F1 performance on the test fold over all 10 choices of the test fold. As we can see from this experiment, it is indeed possible to achieve a small improvement in performance by employing some of the heuristics described here. The best-performing combination of parameters was {normal link counts, SR = in, ϑ = 0.6, PR = normal, w2 = 1}, which resulted in an F1 score of 0.6152, up from the score of 0.5917 achieved by the default parameter values. A paired t-test showed this difference to be significant at a p-value of 0.0005. However, we can also see that, in practical terms, this improvement is small and might not be noticed by the user. Furthermore, it is clear that several of the heuristics employed were in fact counterproductive: using logarithms of link counts to define the mention-concept transition probabilities; using out-links or all neighbors (instead of just in-links) in the definition of semantic relatedness; and using linearized pageranks. Shifting these parameters away from their default settings in fact led to a deterioration of F1 (in all these cases, the deterioration is statistically significant with a p-value of 0.001 or less.) Improvements in performance mostly came from re-ranking the most promising candidates (ϑ = 0.6) based on the cosine similarity between the input document and the Wikipedia pages of the candidate concepts. The “Average F1” column of Table 2 shows that no parameter by itself can ensure good performance unless the other parameters are also chosen suitably, as the average performance over all the combinations of other parameters is poor regardless of which parameter has been fixed and at what value. 4 Use in a real-life application Semantically annotating documents can be of high importance in several real-life applications. An example of such an application is Event Registry [5]. Event Registry is a system that collects and analyzes news content generated globally and identifies the world Semantic Annotation of Documents … Informatica 42 (2018) 23–32 29 events mentioned in the news. As the application aims to extract knowledge in structured form from the unstructured text, we will now describe how the Wikifier’s semantic annotations provide the critical input required by the system. For each news article, Event Registry stores the list of identified semantic annotations. Among other things, this allows the users to search for news content using the semantic tags and not keywords, as we are used to in the search engines. The main advantages of using tags versus keywords are that one can e.g. (a) specifically ask for articles about apple, the fruit, versus Apple, the company, (b) find articles about IBM regardless of how it’s mentioned in the news articles (“IBM”, “I.B.M.”, “International Business Machines”, etc.), and (c) find articles about White House in any language. The last use case is available because the Wikipedia also maintains information on which Wikipedia pages in different languages represent the same concept. Consequently, the tag for “White House” in an English article will be the same, as the tag for “Casa Blanca” in a Spanish article. When Event Registry identifies a group of news articles that represent the same event, it uses the semantic information in the news articles to determine the core event information. First, it analyzes all news articles in the event and calculates how frequently individual concepts appear in these articles. A ranked list of commonly mentioned concepts is then used as a semantic summary or a “fingerprint” of an event. Another critical piece of information about the event is its geographical location. In order to determine the location, Event Registry again analyzes concepts mentioned in the news articles about the event, and considers as possible candidates only those that refer to a geographical location. For each candidate location, a set of learning features is extracted. The learning features that we extract are as follows:  Mentions of the location in the articles about the event. The value is simply the ratio of the number of news articles about the event that mention the location somewhere in the text and the total number of articles about the event.  Mentions of the location in the dateline (beginning of the article). This feature is computed as the ratio of the number of articles about the event in which the location is mentioned in the dateline and the total number of articles about the event.  Normalized versions of the previous two features. In this case we compute a variation of the previous two features, where we don’t compute a simple ratio, but weight the contribution of an individual article by the cosine similarity of the article to the centroid of the event. Articles closer to the centroid (more relevant articles) therefore contribute more to the final feature value.  Commonality of the location — how frequently is the location generally present in the news articles. The value is computed as the ratio of the number of articles in Event Registry that mention this location and the total number of articles. Based on these features, a logistic regression model computes a probability for each of the candidate locations to be the location of the event. If the location with the highest probability is above the predetermined threshold, the location is chosen as the location of the event. The logistic model was trained on 1239 manually labeled events and has 96.2% classification accuracy. Semantic annotations are also of high importance when a search is performed and a large number of results need to be summarized. An example of such a summary is displayed in Figure 1, where we searched for events about hurricanes. The resulting list that contained over 23 000 events was summarized as shown in the figure to illustrate what are the top concepts mentioned in these events. Parameter Avg. F1 Max. F1 linkCounts = normal 0.5883 ± 0.0253 0.6152 ± 0.0239 linkCounts = log 0.5368 ± 0.0268 0.5833 ± 0.0221 SR = in 0.5800 ± 0.0232 0.6152 ± 0.0239 SR = out 0.5626 ± 0.0265 0.5945 ± 0.0253 SR = all 0.5451 ± 0.0283 0.5927 ± 0.0268 PR = normal 0.5644 ± 0.0261 0.6152 ± 0.0239 PR = linearized 0.5607 ± 0.0259 0.5978 ± 0.0223 ϑ = 0 0.5604 ± 0.0256 0.5974 ± 0.0223 ϑ = 0.2 0.5614 ± 0.0256 0.5982 ± 0.0221 ϑ = 0.4 0.5634 ± 0.0258 0.6054 ± 0.0225 ϑ = 0.6 0.5649 ± 0.0259 0.6152 ± 0.0239 ϑ = 0.8 0.5642 ± 0.0260 0.6004 ± 0.0216 ϑ = 1 0.5610 ± 0.0273 0.5939 ± 0.0280 w2 = 0 0.5610 ± 0.0273 0.5939 ± 0.0280 w2 = 0.25 0.5646 ± 0.0266 0.5979 ± 0.0209 w2 = 0.5 0.5655 ± 0.0265 0.6060 ± 0.0228 w2 = 0.75 0.5654 ± 0.0253 0.6128 ± 0.0248 w2 = 1 0.5563 ± 0.0245 0.6152 ± 0.0239 Nothing fixed 0.5626 ± 0.0260 0.6152 ± 0.0239 All fixed to default values 0.5917 ± 0.0226 0.5917 ± 0.0226 Table 2: F1 measure of agreement between our wikifier and the gold standard while keeping one parameter fixed and tuning the others. “Avg. F1” shows average performance over all possible combinations of non- fixed parameters; “Max. F1” shows the best performance achieved by tuning the non-fixed parameters on the training folds. Both columns show the F1 performance on the test fold. Since cross- validation was used, the performances are shown in the form “average ± standard deviation” over all 10 possible splits of the data into 9 training folds and 1 test fold. 30 Informatica 42 (2018) 23–32 J. Brank et al. 5 Conclusions and future work We have presented a practical and efficient approach to wikification that requires no external data except the Wikipedia itself, can deal with documents in any language for which the Wikipedia is available, and is suitable for a high-performance, parallelized implementation. The approach presented here could be improved along several directions. One significant weakness of the current approach concerns the treatment of minority languages. When dealing with a document in a certain language, we need hyperlinks whose anchor text is in the same language if we are to identify mentions in this input document. Thus, if the document is in a language for which the Wikipedia is not available, it cannot be wikified using this approach; and similarly, if the Wikipedia is available in this language but is small, with a small amount of text, low number of pages, and generally poor coverage, the performance of wikification will be low. One idea to alleviate this problem is to optionally allow a second stage of processing, in which Wikipedias in languages other than the language of the input document would also be used to identify mentions and provide candidate annotations. This might particularly improve the coverage of concepts that are referred to by the same words or phrases across multiple languages, as is the case with some types of named entities. For the purposes of pagerank-based disambiguation in this second stage, a large common link-graph would have to be constructed by merging the link-graphs of the Wikipedias for different languages. This can be done by using the cross-language links which are available in the WikiData ontology, providing information about when different pages in different languages refer to the same concept. Another interesting direction for further work would be to incorporate local disambiguation techniques as a way to augment the current global disambiguation approach. When evaluating whether a mention a in the input document refers to a particular concept c, the local approach would focus on comparing the context of a to either the text of the Wikipedia page for c, or to the context in which hyperlinks to c occur within the Wikipedia. Preliminary steps taken in this direction in Sec. 2.5 did not lead to improvements in performance, but this subject is worth exploring further. Instead of the bag-of-words representation of contexts, other vector representations of words could be used, e.g. word2vec [6]. 6 Acknowledgement This work was supported by the Slovenian Research Agency as well as the euBusinessGraph (ICT-732003- IA), EW-Shopp (ICT-732590-IA) and RENOIR (H2020- MSCA-RISE-691152) projects. 7 References [1] L. Zhang, A. Rettinger. Final ontological word- sense-disambiguation prototype. Deliverable D3.2.3, xLike Project, October 2014. [2] J. Hoffart, M. A. Yosef, I. Bordino, et al. Robust disambiguation of named entities in text. Proc. of the 2011 Conf. on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, 2011, pp. 782–792. Figure 1: Summary of top concepts in events about hurricanes. Semantic Annotation of Documents … Informatica 42 (2018) 23–32 31 [3] L. Page, S. Brin, R. Motwani, T. Winograd. The PageRank citation ranking: Bringing order to the web. Digital Libraries Project Report SIDL-WP- 1999-0120, Stanford University, 1998. [4] M. Trampuš, B. Novak. Internals of an aggregated web news feed. Proc. SiKDD 2012. [5] G. Leban, B. Fortuna, J. Brank, M. Grobelnik. Event registry: Learning about world events from news. Proc. of the 23rd Int. Conf. on the World Wide Web (WWW 2014), pp 107–110. [6] T. Mikolov, K. Chen, G. Corrado, J. Dean. Efficient estimation of word representations in vector space. Arxiv.org, 1301.3781 [cs.CL], 2013. [7] D. Milne, I. H. Witten. An open-source toolkit for mining Wikipedia. Artificial Intelligence, 194:222– 239 (January 2013). [8] A. Moro, A. Raganato, R. Navigli. Entity linking meets word sense disambiguation: A unified approach. Trans. of the Assoc. for Comp. Linguistics, 2:231–234 (2014). [9] L. Ratinov, D. Roth, D. Downey, M. Anderson. Local and global algorithms for disambiguation to Wikipedia. Proc. of the 49th Annual Meeting of the Assoc. for Comp Linguistics: Human Language Technologies (2011), pp. 1375–84. [10] J. Daiber, M. Jakob, C. Hokamp, P. N. Mendes. Improving efficiency and accuracy in multilingual entity extraction. Proc. of the 9th Int. Conf. on Semantic Systems, 2013. 32 Informatica 42 (2018) 23–32 J. Brank et al. Informatica 42 (2018) 33–42 33 Continuous Blood Pressure Estimation from PPG Signal Gašper Slapničar and Mitja Luštrek Joef Stefan Institute, Jamova cesta 39, 1000 Ljubljana E-mail: gasper.slapnicar@ijs.si, mitja.lustrek@ijs.si Matej Marinko Faculty of Mathematics and Physics, Jadranska cesta 19, 1000 Ljubljana E-mail: matejmarinko123@gmail.com Keywords: photoplethysmography, blood pressure estimation, regression analysis, m-health Received: November 11, 2017 Given the importance of blood pressure (BP) as a direct indicator of hypertension, regular monitoring is encouraged for healthy people and mandatory for patients at risk from cardiovascular diseases. We propose a system in which photoplethysmogram (PPG) is used to continuously estimate BP. A PPG sensor can be easily embedded in a modern wearable device, which can be used in such an approach. The PPG signal is first preprocessed in order to remove major noise and movement artefacts present in the signal. A set of features describing the PPG signal on a per-cycle basis is then computed to be used in regression models. The predictive performance of the models is improved by first using the RReliefF algorithm to select a subset of relevant features. Afterwards, personalization of the models is considered to further improve the performance. The approach is validated using two distinct datasets, one from a hospital environment and the other collected during every-day activities. Using the MIMIC hospital dataset, the best achieved mean absolute errors (MAE) in a leave-one-subject-out (LOSO) experiment were 4.47 mmHg for systolic and 2.02 mmHg for diastolic BP, at maximum personalization. For everyday-life dataset, the lowest errors in the same LOSO experiment were 8.57 mmHg for systolic and 4.42 mmHg for diastolic BP, again using maximum personalization. The best performing algorithm was an ensemble of regression trees. Povzetek: Krvni tlak je neposreden pokazatelj hipertenzije. Razvili smo sistem, ki krvni tlak ocenjuje iz fotopletizmograma (PPG), kakršen je že vgrajen v večino modernih senzorskih zapestnic. Signal PPG smo sprva predprocesirali in segmentirali na cikle. Predprocesiranje odpravi večino šuma, ki se pogosto pojavlja zaradi gibanja. Iz očiščenega signala smo nato izračunali množico značilk, ki smo jih uporabili v regresijskih modelih. Sistem smo izboljšali z uporabo algoritma RReliefF za izbor relevantnih značilk in z uporabo dela podatkov vsake osebe za učenje personaliziranih napovednih modelov. Sistem smo vrednotili na dveh podatkovnih množicah, eni iz kliničnega okolja in drugi zbrani med rutinskimi dnevnimi aktivnostmi posameznikov. V poizkusu smo model vsakič naučili na vseh osebah razen eni in ga nato testirali na izpuščeni osebi. Z uporabo klinične podatkovne množice smo v omenjenem poizkusu dosegli najnižji povprečni absolutni napaki (MAE) 4.47 mmHg za sistolični in 2.02 mmHg za diastolični krvni tlak, pri največji stopnji personalizacije. Za množico, zbrano med dnevnimi aktivnostmi, smo dosegli najnižji napaki 8.57 mmHg za sistolični in 4.42 mmHg za diastolični krvni tlak, ponovno pri največji stopnji personalizacije. Najbolje se je obnesel ansambel regresijskih dreves. 1 Introduction World Health Organization (WHO) listed cardiovascular diseases as the most common cause of death in 2015, re- sponsible for almost 15 million deaths combined [1]. Hy- pertension is one of the most common precursors of such diseases and can be easily detected with regular blood pres- sure (BP) monitoring, which is especially critical for pa- tients already suffering from hypertension or related cardi- ovascular diseases, as it can indicate potential vital threats to their health. While regular BP monitoring is important, it is also trou- blesome, as devices using inflatable cuffs are still consi- dered the “golden standard”. The cuff placement is cri- tical, as the sensor must be located directly above the main artery in the upper arm area, at approximately heart height [4]. These requirements impose relatively strict mo- vement restrictions on the subject and require substantial time commitment, thus causing low subject adherence to regular monitoring. Furthermore, when done by the subject him/herself in a home environment, this process can cause stress, which in turn influences the BP values, making the measurements less reliable. This problem is usually not al- leviated by having the medical personnel perform the me- asurement, as this can again cause anxiety in the subject, commonly known as the “white coat syndrome”. 34 Informatica 42 (2018) 33–42 G. Slapničar et al. Our work focuses on photoplethysmogram (PPG) analy- sis and the development of a robust non-obtrusive method for continuous BP estimation. It will be implemented and used in an m-health system based on a wristband with an embedded PPG sensor. This will allow the user to wear the device without any interference or limits imposed upon their daily routine, allowing for truly continuous measuring without stressing the user and thus potentially influencing the BP values. The rest of the paper is organized as follows. Section 2 gives a brief overview of the related work. Section 3 explains the methodology we have used, focusing on sig- nal pre-processing and machine learning features. Section 4 elaborates on the experimental setup and results, and Section 5 concludes with a summary and plans for future work. 2 Related work Photoplethysmography is a relatively simple technique ba- sed on inexpensive technology, which is becoming incre- asingly popular in wearable devices for heart rate estima- tion. It is based on the illumination of the skin and mea- surement of changes in its light absorption [5]. In its basic form it only requires a light source to illuminate the skin (typically a light-emitting diode – LED light) and a photo- detector (photodiode) to measure the amount of light either transmitted through, or reflected from the skin. Thus PPG can be measured in either transmission or reflectance mode. Both modes of operation are shown in Figure 1. Figure 1: Transmission and reflectance mode in which the PPG signal can be obtained. LED is the light source while PD is the photodetector [6]. With each cardiac cycle, the heart pumps blood towards the periphery of the body. This produces a periodic change in the amount of light that is either absorbed or reflected from the skin to the photodetector, as the tissue changes its tone based on the amount of blood in it. Exploring the recent applications of PPG, we can see that it is becoming more widely used in BP estimation. One of two common approaches are typically used: 1. BP estimation using two sensors (PPG + Electrocardi- ogram (ECG)) 2. BP estimation using the PPG sensor only The first approach requires the use of two sensors, typi- cally an ECG and a PPG sensor, in order to measure the time it takes for a single heart pulse to travel from the he- art to a peripheral point in the body. This time is com- monly known as pulse transit time (PTT) or pulse arrival time (PAT), and its correlation with BP changes is well es- tablished. The more recent approach is focused on the PPG signal only; however, the relationship between the PPG and BP is only postulated and not as well established as the relations- hip between the PTT and BP. This approach is, however, notably less obtrusive, especially since PPG sensors have recently become very common in most modern wristbands. BP is commonly measured in millimeters of mercury (mmHg), which is a manometric unit routinely used in me- dicine and many other scientific fields. A mercury mano- meter is a curved tube containing mercury, which is closed at one end while pressure is applied on the other end. 1 mmHg of pressure means that the pressure is large enough to increase the height of the mercury in the tube for 1 mm. To put the values discussed in this paper into perspective, the normal healthy adult BP is considered to be around 120 mmHg (16 kPa) for systolic and 80 mmHg (11 kPa) for diastolic BP [2]. One of the earliest PPG-only attempts was conducted by Teng et. al. in 2003 [3]. The relationship between the arte- rial BP (ABP) and certain features of the PPG signals was analyzed. Data were obtained from 15 young healthy sub- jects in a highly controlled laboratory environment, ensu- ring constant temperature, no movement and silence. The mean differences between the linear regression estimations and the measured BP were 0.21 mmHg for systolic (SBP) and 0.02 mmHg for diastolic BP (DBP). The corresponding standard deviations were 7.32 mmHg for SBP and 4.39 mmHg. Using mean errors instead of mean absolute errors as the evaluation metric is questionable, since it does not reflect the actual performance of the derived model and the error can be extremely low, even if the actual predictions are high above and under the actual observed BP values. A paper was published in 2013 in which the authors used data from the Multiparameter Intelligent Monitoring in In- tensive Care (MIMIC) waveform database [7, 8] to extract 21 time domain features and use them as an input vector for artificial neural networks (ANNs) [9]. The results are not quite as good as with the linear regression model descri- bed earlier; however, the data was obtained from a higher number and variety of patients in a less controlled environ- ment. Mean absolute errors of less than 5 mmHg for both SBP and DBP were reported. While the environment was less controlled compared to the previous work, the patients were still within a hospital setting and hospital equipment was used for data collection. Furthermore, only an undis- closed subset of all the available data from MIMIC was used. Another research was conducted in 2013 in which the authors used a smartphone camera to capture the PPG sig- nal using the camera flash as the light source and the phone camera as the photodiode [10]. PPG features were again extracted and fed to a neural network, which estimated SBP Continuous Blood Pressure Estimation from. . . Informatica 42 (2018) 33–42 35 and DBP. All the data processing and BP evaluation was done in a cloud in order to reduce the computational burden on the device. It is not clear how many subjects participa- ted in the experiment, however, they reported the maximum error not exceeding 12 mmHg. The error metric is not ex- plained in detail, however, based on the given results table, we can presume that MAE was used. Such a method re- quires some user effort, as the user must place and hold his finger over the camera and LED light. This prevents any other activities during this time. It is clear that the PPG-only approach has potential, ho- wever, a robust unobtrusive method that works well on a general case is yet to be developed. 3 Methodology The proposed system consists of two main modules, na- mely the signal pre-processing and machine learning mo- dule. The former is responsible for cleaning the PPG signal of most noise and then segmenting it into cycles, where one PPG cycle corresponds to a single heart beat. The latter ex- tracts features describing the PPG signal on a per-cycle ba- sis, selects a subset of relevant features using the RReliefF algorithm [12], and finally feeds the subset into regression algorithms, which build the prediction models. 3.1 Signal pre-processing PPG sensors must be very sensitive in order to detect tiny variations in light absorption of the tissue. This also ma- kes them highly susceptible to movement artefacts. This problem is especially obvious when dealing with PPG col- lected via a wristband, as the contact between the sensor and the skin can be compromised during arm movements. This is partially alleviated by using green light, which is less prone to artefacts, however, major artefacts often re- main in the signal. Subsequently, substantial effort is di- rected towards PPG pre-processing. 3.1.1 Cleaning based on established medical criteria In the first phase, both the BP and PPG signal are roughly cleaned based on established medical criteria [13]. A 5- second sliding window is used to detect segments with ex- treme BP values or extreme changes of the BP in a short time period. Thresholds for extreme values and changes are selected based on established medical criteria in related work [13] and are given in Table 1. Some thresholds were slightly modified, since the criteria given in the referenced paper seem too strict for some subjects encountered in our datasets. We have thus loosened the criteria in accordance with empirical observations in our datasets (e.g., the ori- ginal criteria excludes all data with SBP > 180, while we observed some segments with SBP over 180 mmHg). After the cleaning of the clinical dataset, 85% of data is kept on average, while 15% is discarded. This is very subject dependent, as for some subjects nearly all the data Criterion Threshold SBP > 250 or < 80 DBP > 150 or < 40 SBP – DBP < 20 ∆SBP or ∆DBP in 5 sec > 50 Table 1: Established medical criteria and thresholds for rough signal cleaning. ∆ signifies a change in BP value. 5-second segments meeting any of these criteria are remo- ved from the signal. is removed (e.g., sensor anomaly which shows 0 ABP al- most all the time), while for majority of subject most of the data is kept. For everyday-life dataset, which contains a lot more noise, only 40% of data is kept, while 60% is discar- ded. This is the result of some subjects having long noisy segments of the PPG signal. It should be noted, that these percentages are also subject of the parameters for trade-off between the required quality and the amount of signal kept, which are discussed in 3.1.3. 3.1.2 Peak and cycle detection In order to do further cleaning and subsequent feature ex- traction, PPG cycle detection is mandatory. This is not tri- vial, as substantial noise in the PPG signal poses a signifi- cant problem, as mentioned earlier. This problem was tackled in several steps. First, a filte- ring transformation, which enhances the systolic upslopes of the pulses in the PPG signal, is used. It is designed to use the derivative of the PPG signal at lower frequencies, in order to detect the abrupt upslopes of the systolic pulse compared to the diastolic or dicrotic pulse in the PPG sig- nal. This is based on a low-pass differentiator (LPD) filter, which removes high frequency components and performs differentiation. Once the steepest points in the PPG signal are located, the following peak is chosen as the PPG sys- tolic peak. Afterwards, a time-varying threshold for peak detection is applied, which ensures that potential double peaks or diastolic peaks close to the systolic ones are not chosen. The procedure is explained in detail in a paper by Lzaro et al. [14]. After the peaks are detected, finding the cycle start-end points is simpler, as the dominant valleys between the de- tected peaks must be found. An example of detected peaks and cycle locations using the described method is shown in Figure 2. 3.1.3 Cleaning based on ideal templates After cycles are successfully detected, the second cleaning phase begins. A 30-second sliding window is used. First, the most likely length of a cycle L in the current window is determined using autocorrelation analysis. A copy of the PPG signal in the current window is taken and 36 Informatica 42 (2018) 33–42 G. Slapničar et al. Figure 2: The upper subplot shows a PPG segment. Lower subplot shows the LPD filtering transformation of the same PPG segment. Peaks of the transformation in the lower subplot correspond to the steepest systolic upslopes of the PPG in the upper subplot, and are denoted as nA. Actual detected PPG peaks are denoted as n∗A. shifted sample by sample up to a certain length that con- tains at least two heart beats. When the copy is shifted by the number of samples corresponding to exactly one cycle, the autocorrelation reaches its first peak, and this number of samples is chosen as L. Presuming that the majority of cycles within a 30-second window are not morphologically altered, we can create an “ideal cycle template” for this window. Such a template is created by always taking the next L samples from each cy- cle starting point and then computing the mean cycle. Each individual cycle is then compared to the computed template and its quality is evaluated using three signal quality indi- ces (SQIs), which are defined as follows [15]: 1. SQI1: First L samples of each cycle are taken and then each cycle is directly compared to the template using a correlation coefficient. 2. SQI2: Each cycle is interpolated to length L and then the correlation coefficient with the template is compu- ted. 3. SQI3: The distance between each cycle and the template is computed using dynamic time warping (DTW). Finally the thresholds for each SQI are empirically deter- mined. Each cycles’ SQIs are evaluated and if they reach the required quality threshold, that cycle is kept, otherwise it is removed. If more than half the cycles in the current 30-second window are under the thresholds, the whole win- dow is discarded as too noisy. An example of this cleaning is shown in Figure 3. Once the PPG signal is cleaned and only high-quality cycles with minimal morphological anomalies remain, fea- tures can be extracted from each cycle. 3.2 Machine learning In order to derive the relationship between the PPG and BP, features describing the PPG signal were computed and then the relevant subset of these features was selected to be used in the regression algorithms. 3.2.1 Features In accordance with the related work [3, 9, 10], several time- domain features were computed from the PPG signal, and the set of features was further expanded with some from the frequency [13] and complexity-analysis domains. Most fe- atures focus on describing the morphology of a given PPG cycle, as shown in Figure 4. Feature Description Tc Cycle duration Ts Time from start of cycle to systolic peak Td Time from systolic peak to end of cycle Tnt Time from systolic peak to diastolic rise Ttn Time from diastolic rise to end of cycle S1 Area under the curve (AUC) from start of cycle to max upslope point S2 AUC from max upslope point to systolic peak S3 AUC from systolic peak to diastolic rise S4 AUC from diastolic rise to end of cycle AUC syst S1 + S2 AAC syst Area above the curve (AAC) from start of cycle to systolic peak AUC diast S3 + S4 AAC diast AAC from systolic peak to end of cycle Table 2: Elaborations of some of the used features shown in Figure 4. In addition to the features focusing on the PPG cycle morphology, which were highlighted thus far, the following features were computed and considered: 1. AI – Augmentation Index: a measure of wave re- flection on arteries. AI = diastolic rise amplitude systolic peak amplitude 2. LASI – Large Artery Stiffness Index: an indicator of arterial stiffness, which is denoted as Tnt in Figure 4 and Table 2. 3. Complexity analysis: signal complexity and mobility are computed for the 30-second PPG segment contai- ning the current cycle. Mobility represents an estimate of the mean frequency and is proportional to the stan- dard deviation of the power spectrum. Complexity gi- ves an estimate of change in frequency by comparing the signal similarity to a pure sine wave. They are Continuous Blood Pressure Estimation from. . . Informatica 42 (2018) 33–42 37 Figure 3: Example of the cleaning algorithm in the second phase of the signal pre-processing. Comparing the top (uncle- aned) and bottom (cleaned) PPG signal, we see that the obvious artefact segments are removed. given by Najarian and Splinter [11] as follows (presu- ming a zero-mean signal): S0 = √∑N i=1 x 2 i N , S1 = √∑N−1 j=2 d 2 j N − 1 , S2 = √∑N−2 k=3 g 2 k N − 2 , where x is the PPG signal, d is the first order derivative of x and g is the second order derivative of x. Mobility = √ var(d) var(x) = S1 S0 Complexity = mobility(d) mobility(x) = √ S22 S21 − S 2 1 S20 , 4. FFT features: amplitudes and phases of the frequency-domain representation of the 30 second PPG segment containing the current cycle. The length of the window was chosen such that it contains enough cycles (expected 1 cycle per second) for the frequen- cies in the segment to be reliably determined. Considering all the time and frequency-domain featu- res along with the complexity-analysis features, and the amount of instances (cycles) available, we are often dea- ling with a very large matrix of training data. The number of rows (instances) is on the order of magnitude 105 and the number of columns (features) is on the order of magni- tude 102, thus dimensionality reduction through selection of a subset of relevant features is feasible, but not manda- tory. More importantly, feature selection allows us to de- termine which features are useful for the learning process, and which are irrelevant, allowing us to obtain a smaller subset containing only the relevant features. 3.2.2 Feature selection The RReliefF algorithm was chosen for feature selection. It is a modification of the ReliefF algorithm, suitable for regression problems with continuous target variables. The algorithm was applied to a subset of 10% of all data cho- sen randomly. This was repeated 10 times. All the features with non-zero relevance, as chosen by the algorithm, were considered in each iteration and their importance was sa- ved. Looking at the final scores of the algorithm across all the iterations, we notice that quite a few features are con- sidered irrelevant, while the same features are commonly chosen as important for both SBP and DBP, as shown in Figure 5. Noting the fact that the same features were se- lected in each of the 10 iterations, we can assume that the relevant features are not dependent on the selected subset of the available data. As mentioned, all the features with non-zero importance were taken, as more than half were discarded as irrelevant by the RReliefF algorithm. Among the non-zero impor- tance features, some features from each of the groups men- tioned earlier (temporal, frequency and complexity analy- sis) are present. Most area-based features were marked as irrelevant, while certain times (Tc, Ts and Td), both complexity-analysis (signal complexity and signal mobi- lity) as well as some frequency-domain (amplitudes and phases at low frequencies) features were marked as impor- tant. These non-zero importance features were then used in the regression algorithms. The relevant features were determined using the larger and more varied dataset from the MIMIC database. The same subset of features was also used with the smaller everyday-life dataset. Both datasets are described in more detail in the following section. Since the feature selection procedure only slightly im- proved the results, we have not considered experiments with other or additional feature selection methods. 38 Informatica 42 (2018) 33–42 G. Slapničar et al. Figure 4: Time and area based features that describe the morphology of the PPG signal on a per-cycle basis. The features are listed and elaborated in Table 2. 4 Experiments and results In an effort to make the proposed method as general as pos- sible, two datasets were considered for the experimental evaluation. The data from all subjects, which met the requi- rements of having both the PPG and BP signals recorded, were always used in the experiments. 4.1 Data The first dataset is from the publicly accessible MIMIC da- tabase, which is commonly used for experiments and com- petitions in the signal processing field. The original version contains data from 72 hospitalized patients. All patients with both the PPG and BP signal were originally conside- red, however, after the filtering and pre-processing, only 41 patients had enough high-quality data remaining to be used in the experiments. All the data was collected in a hospital environment using hospital measuring equipment, including an ABP measuring device. The ABP is measu- red by inserting a cathether in an artery, making it highly invasive, however, it offers the most precise BP monitoring. The second dataset was collected at Jožef Stefan Insti- tute (JSI) using the Empatica E4 wristband for the PPG and a digital cuff-based Omron BP monitoring device for the ground truth BP, as is common in such experimental settings in related work. This device is reported to be cli- nically validated according to the British Hypertension So- ciety and the Association for the Advancement of Medi- cal Instrumentation (AAMI) protocols [17], which means that the mean errors do not exceede 5 mmHg. The col- lection procedure at JSI was conducted in accordance with the standardized clinical protocol. The correct placement of the cuff on the upper arm area with the sensor above the main artery was ensured. The measurements were done in an upright sitting position, making sure the cuff was located at approximately heart height. The recommended protocol was followed as best as possible, however, in an ideal si- tuation the ground truth BP should be measured as ABP within an artery. Due to the invasive nature of ABP measu- rement, this is not feasible in an everyday-life situation, so the digital cuff-based monitor was used as a good replace- ment. An upper-arm cuff-based monitor was chosen over a wrist-based one, as the latter is less accurate and extremely sensitive to body position. In the first completed phase of the data collection, 8 he- althy subjects were considered, 5 male and 3 female. Each subject wore the wristband PPG measuring device for se- veral hours during their everyday activities. They measu- red their BP every 30 minutes or more often. Finally, only parts of the PPG signal 3 minutes before and after each BP measurement point were taken into consideration, as the measured BP value is only relevant for a short time. Ideally, the BP would be measured more often, however, this would place further stress on the subjects and was not possible during their everyday routine. Furthermore, addi- tional physiological variations (e.g., breathing rate) could be obtained from the PPG and used for the BP estimation, however, this was not yet considered but might be a subject of future work. Continuous Blood Pressure Estimation from. . . Informatica 42 (2018) 33–42 39 Figure 5: The output of RReliefF algorithm, which shows the feature importance for each of the considered features. 4.2 Experimental setup Two experimental setups were considered, 5-fold cross va- lidation and LOSO. The purpose of the first was to establish initial observations about the selected features and perfor- mance of different regression algorithms. The second ex- periment was conducted to evaluate the generalization per- formance of the algorithms and subsequently determine po- tential requirement for personalization. 4.2.1 K-fold cross validation The MIMIC dataset consisted of roughly 200 000 instances post filtering, which correspond to 41 patients. The instan- ces were obtained by uniformly taking 20 3-minute seg- ments from the whole recording for a given patient. Each instance (cycle) in a given 3-minute segment was assigned the mean SBP and mean DBP of this segment. This simu- lates the patients measuring their BP periodically, but not more than once in 3 minutes. K-fold cross validation (k = 5) was conducted, where in- stances were first shuffled randomly and then all the data was split into nearly equal folds. Then k – 1 folds were taken for learning and the remaining fold was used for tes- ting. This was repeated k times. The random shuffling of instances makes it so that instances belonging to a given subject might appear in both training and testing sets. This was taken into account (a sort of implicit personalization), as this experiment was merely a starting point to determine the initial performance of the algorithms and was later com- plemented by a Leave-one-subject-out experiment. Several regression algorithms were compared using the full set of features. The algorithm that performed best using all the features was additionally evaluated using only the subset of best features as selected by the RReliefF algo- rithm. The predictive performance of these options in 5- fold cross validation is discussed in detail in the Results section. 4.2.2 Leave-one-subject-out Due to increased computational complexity of a leave- one-subject-out experiment compared to k-fold cross- validation, data was additionally sub-sampled, by taking 500 uniformly selected cycles from each patient’s data. During the initial attempt, a regression model was trai- ned in each iteration on all the subjects, except the one left out. It was ensured that no instances from the testing subject appeared in the training set. This yielded poor re- sults. Notable improvements can be made by using a small amount of each patient’s data for training, most likely due to each patient having a subtly unique cardiovascular dyn- amic and relation between PPG and BP. This was additio- nally confirmed by doing cycle morphology analysis, du- ring which it was established that similar cycle shapes do not necessarily signify similar BP values. Due to the men- tioned factors, personalization of the trained models was considered in an attempt to improve the predictive perfor- mance of the models. In the second attempt, the regression models were again trained using all the subjects except the one left out. This time, however, the models were further personalized by using some instances from the left out subject. The instan- ces of the left out subject were grouped by their BP values. These groups were then sorted from lowest to highest BP. Afterwards, every n-th group (n = 2, 3, 4, 5, 6) of instances was taken from the testing data and used in training in order to personalize the model to the current patient. This ensu- res personalization with different BP values, as taking just a single group of instances gives little information, since the BP will be constant within this group. Given the fact that the MIMIC data consists of roughly 5x the number of patients compared to everyday-life data, the personaliza- tion data for it was multiplied 5 times, making it noticeable within the large amount of training data from the remaining patients. During both attempts, several regression algorithms were once again considered, as given in Tables 3 and 4. The MAE was used as the evaluation metric. All models were compared with a dummy regressor, which always predicted the mean BP value of the same combination of general and personalization data as the other models used for training. Finally, the regressor with the lowest MAE was chosen as 40 Informatica 42 (2018) 33–42 G. Slapničar et al. best. For successful personalization, the user should measure their PPG continuously and also make a few periodic mea- surements of their BP using a reliable commercial device. This allows the model to personalize to the user, learning from a small sample of their labeled data, thus improving its predictive performance. 4.3 Results Using the personalization approach, notable improvements have been made over the dummy regressor in both experi- ments. The results are discussed in detail in the following sections. 4.3.1 K-fold cross validation results MAE with corresponding standard deviations in the 5-fold cross validation experiment for the MIMIC data are given in Table 3, while the results for the everyday-life data are given in Table 4. Algorithm MAESBP [mmHg] Dummy (predicts mean) 19.70 ± 16.07 Linear regression 18.47 ± 15.91 Ensemble (all feat.) 5.83 ± 7.74 Ensemble (relevant feat.) 4.90 ± 6.59 Algorithm MAEDBP [mmHg] Dummy (predicts mean) 8.73 ± 6.77 Linear regression 8.14 ± 7.98 Ensemble (all feat.) 2.92 ± 4.09 Ensemble (relevant feat.) 2.21 ± 3.70 Table 3: MAE of different algorithms for SBP and DBP estimation in 5-fold cross validation using the MIMIC hos- pital dataset. Algorithm MAESBP [mmHg] Dummy (predicts mean) 11.46 ± 7.51 Linear regression 11.21 ± 8.00 Ensemble (all feat.) 9.12 ± 7.90 Ensemble (relevant feat.) 7.87 ± 7.47 Algorithm MAEDBP [mmHg] Dummy (predicts mean) 5.01 ± 3.99 Linear regression 5.01 ± 8.00 Ensemble (all feat.) 4.38 ± 3.74 Ensemble (relevant feat.) 3.84 ± 3.63 Table 4: Mean absolute errors of different algorithms for SBP and DBP estimation in 5-fold cross validation using the JSI-collected everyday-life dataset. Ensemble of shallow regression trees has shown the best predictive performance in the 5-fold cross validation for both SBP and DBP using both datasets. We also notice a slightly better performance when only the relevant fea- tures, as given by RReliefF, are used in comparison to the default feature set. As the ensemble of regression trees has shown the best performance, its hyperparameters were optimized using Bayesian optimization. All the available hyperparameters were optimized using the MATLAB built-in Bayesian Op- timization Workflow [16]. It optimizes both the hyperpara- meters of the ensemble as well as the hyperparameters of the weak learners, which are chosen to be shallow Regres- sion Trees. The optimization is ran for 30 iterations, trying to minimize the objective cross-validation loss function. Bootstrap aggregation was chosen as superior over gradient boosting strategy, and the optimal number of weak learners was determined to be 77. The maximum number of splits in the weak learner was determined to be 1, meaning that the regression trees are in fact regression stumps. 4.3.2 Leave-one-subject-out results Figure 6: MAE for SBP and DBP for the MIMIC dataset, at different amounts of personalization. Continuous Blood Pressure Estimation from. . . Informatica 42 (2018) 33–42 41 The lowest error using the MIMIC data was again achie- ved using the hyperparameter tuned Ensemble of regres- sion trees algorithm with RReliefF selected subset of fea- tures. The highest amount of personalization (50%) gave the best results. 50% personalization corresponds to 10 BP measurements conducted by the subject, given the fact that 20 segments with 20 different BP values were taken. Obtai- ning 10 BP measurements by the subject, in order to per- sonalize the model, seems like a reasonable requirement. Figure 7: MAE for SBP and DBP for the everyday-life da- taset, at different amounts of personalization. The JSI-collected everyday-life data has proven to be more problematic, as there were only a few different BP values recorded in the first phase of data collection. Furt- hermore, due to the high amount of movement artefacts, a lot of data was removed by the cleaning algorithm, lea- ving a very small amount of usable data with a very low variation in BP. This further enhanced the performance of the dummy regressor, which achieved much lower MAE compared to the MIMIC dataset, however, improvements were again achieved by using personalization, as shown in Figure 7. 5 Conclusion We have developed a system for BP estimation using only the PPG signal, and have evaluated its performance on two distinct datasets using two experimental setups. The first module of the system deals with signal pre- processing, removing most movement artefacts and ano- malies from the PPG signal. It then detects PPG cycles corresponding to heart beats and feeds them to the second module, which computes a number of features describing each cycle. This is followed by feature subset selection using the RReliefF algorithm and finally the features are fed into several regression algorithms. Predictive models were created and evaluated on a hospital MIMIC dataset as well as an everyday-life dataset collected at JSI. The lowest MAE achieved for the MIMIC hospital dataset in 5-fold cross validation were 4.90± 6.59 mmHg for SBP and 2.21 ± 3.70 mmHg for DBP. The best performing algorithm was an Ensemble of shallow regression trees. Its hyperparame- ters were optimized using Bayesian optimization. Finally, the same models were evaluated on the same dataset using the leave-one-subject-out validation, achieving the lowest MAE of 4.47 ± 5.85 mmHg for SBP and 2.02 ± 2.94 mmHg for DBP, again using the same hyperparameter- tuned Ensemble and the subset of features selected by the RReliefF algorithm. These results were achieved using the maximum, 50% personalization. Similar trends can be observed for the everyday-life JSI-collected dataset. The lowest MAE in 5-fold cross validation were 7.87 ± 7.47 mmHg for SBP and 3.84 ± 3.63 mmHg for DBP. Ensem- ble of shallow regression trees with optimized parameters prevailed again. In LOSO validation, the lowest MAE of of 8.57 ± 7.93 mmHg for SBP and 4.42 ± 3.61 mmHg for DBP were achieved. 5.1 Interpretation of results Comparing the results of the 5-fold cross-validation to those of the LOSO evaluation, we first notice, that the best performing algorithm is the same. In each fold in the 5- fold cross validation, 80% of randomly shuffled instances were taken for training, which translates to 80% persona- lization for each subject. This is the reason behind the lo- wer MAE in the 5-fold cross-validation, however, similar MAE was also achieved with higher amounts of persona- lization in the LOSO experiment. The developed system shows promising results and could be used by both regu- lar people and hypertensive patients during their everyday routine, by wearing an unobtrusive wristband. It could in- form them of their current medical condition regarding BP. Further testing with more field-collected data is required to more accurately determine its performance, however, it already achieves low MAE when personalization is consi- dered. 42 Informatica 42 (2018) 33–42 G. Slapničar et al. 5.2 Future work We plan to expand our data collection experiment at JSI, which will give us more data and more variety within the collected BP data. Once enough data is collected, we plan to upgrade the machine learning part of our pipeline using deep-learning algorithms. These are well-suited for pro- blems dealing with signal analysis and represent the state of the art approach in signal processing in recent years, ma- king them a suitable candidate for our domain. Acknowledgement The HeartMan project has received funding from the Eu- ropean Unions Horizon 2020 research and innovation pro- gramme under grant agreement No 689660. Project part- ners are Jožef Stefan Institute, Sapienza University, Ghent University, National Research Council, ATOS Spain SA, SenLab, KU Leuven, MEGA Electronics Ltd and European Heart Network. References [1] The World Health Organization. “The top 10 causes of death”, 2015. [2] Mayo Foundation for Medical Education and Rese- arch (MFMER). “Blood pressure chart: What your reading means”. Accessed online: 2nd March, 2018. [3] Teng et. al. “Continuous and noninvasive estima- tion of arterial blood pressure using a photoplet- hysmographic approach”, Proceedings of the 25th Annual International Conference of the IEEE Engi- neering in Medicine and Biology Society (IEEE Cat. No.03CH37439), 2003. [4] Frese et al. “Blood Pressure Measurement Guidelines for Physical Therapists”, Cardiopulmonary Physical Therapy Journal, 2011. [5] Shelley et al. “Pulse Oximeter Waveform: Photoelec- tric Plethysmography”, Clinical Monitoring: Practi- cal applications for anesthesia and critical care, 2001. [6] Tamura et al. “Wearable Photoplethysmographic Sen- sorsPast and Present”, Electronics, 2014. [7] Goldberger et al. “PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals”, Circulation, 2000. [8] Moody et al. “A Database to Support Development and Evaluation of Intelligent Intensive Care Monito- ring”, Computers in Cardiology, 1996. [9] Lamonaca et al. “A neural network-based method for continuous blood pressure estimation from a PPG sig- nal”, IEEE International Congress I2MTC, 2013. [10] Lamonaca et al. “Application of the Artificial Neural Network for blood pressure evaluation with smartp- hones”, 2013 IEEE 7th International Conference on Intelligent Data Acquisition and Advanced Compu- ting Systems (IDAACS), 2013. [11] Najarian et al. “Biomedical Signal and Image Proces- sing, 2nd Edition”, CRC Press, 2012. [12] Robnik-Šikonja et al. “Theoretical and Empirical Analysis of ReliefF and RReliefF”, Machine Lear- ning, 2003. [13] Xing et al. “Optical Blood Pressure Estimation with Photoplethysmography and FFT-Based Neural Net- works”, Biomedical Optics Express, 2016. [14] Lzaro et al. “Pulse Rate Variability Analysis for Dis- crimination of Sleep-Apnea-Related Decreases in the Amplitude Fluctuations of Pulse Photoplethysmo- graphic Signal in Children”, IEEE Journal of Biome- dical and Health Informatics, 2014. [15] Li et al. “Dynamic time warping and machine lear- ning for signal quality assessment of pulsatile sig- nals”, Physiological Measurement, 2012. [16] The MathWorks Inc., Natick, Massachusetts, Uni- ted States. “MATLAB 2017a Optimization Toolbox”, 2017. [17] Coleman et al. “Validation of the Omron M7 (HEM- 780-E) oscillometric blood pressure monitoring de- vice according to the British Hypertension Society protocol”, Blood Pressure Monitoring, 2008. Informatica 42 (2018) 43–52 43 Quantitative Score for Assessing the Quality of Feature Rankings Ivica Slavkov, Matej Petković, Dragi Kocev and Sašo Džeroski Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia Jožef Stefan International Postgraduate School, Jamova cesta 39, 1000 Ljubljana, Slovenia E-mail: saso.dzeroski@ijs.si Keywords: feature ranking, feature selection, evaluation methodology, high-dimensional data Received: January 18, 2018 Feature ranking is a machine learning task that is related to estimating the relevance (importance) of indi- vidual features in a dataset. Relevance estimates can be used to induce an ordering of the features from a dataset, also called a feature ranking. In this paper, we consider the problem of the evaluation of dif- ferent feature rankings. For that purpose, we propose an intuitive evaluation method, based on iterative construction of feature sets and their evaluation by learning predictive models. By plotting the obtained predictive performance of the models, we obtain error curves for each feature ranking. We then propose a scoring function to quantitatively assess the quality of the feature ranking. To evaluate the proposed met- hod, we first define a synthetic setting in which we analyse the method and investigate its properties. By using the proposed method, we next perform an empirical comparison of several feature ranking methods on datasets from different domains. The results demonstrate that the proposed method is both appropriate and useful for comparing feature rankings of varying quality. Povzetek: Rangiranje značilk je naloga strojnega učenja, povezana z ocenjevanjem pomembnosti značilk v podatkih. Značilke lahko nato uredimo glede na dobljene ocene in tako dobimo ureditev, ki ji prav tako pravimo rangiranje značilk. V tem delu obravnavamo problem evalvacije različnih metod za urejanje značilk. Predlagamo postopek, ki temelji na iterativni konstrukciji množic značilk ter njihovi evalvaciji s pomočjo napovednih modelov. Če dobljene ocene natančnosti modelov narišemo na graf, dobimo krivulje natančnosti za vsako rangiranje značilk. Te krivulje s predlaganim postopkom pretvorimo v kazalec, ki poda kakovost rangiranja številsko. Metodo najprej evalviramo na sintetičnih podatkih, nato pa jo preiz- kusimo še na realnih podatkih iz različnih domen. Rezultati pokažejo, da je predlagana metoda uporabna za razločevanje rangiranj značilk različnih kvalitet. 1 Introduction In many application domains, such as bioinformatics and computer vision, supervised learning methods are beco- ming more frequently applied to high-dimensional pro- blems. In such problems, one typically expects only a re- latively small proportion of all input features to be relevant for predicting the output. Also, all relevant features are not equally important. In many practical applications, the pro- blem of discovering the relevant features and/or qualitati- vely assessing their relative importance can be the main ob- jective of the application of machine learning techniques, even taking precedence over the need to obtain the best pos- sible predictive model. In bioinformatics, for example, the main objective of the analysis of microarray datasets is to identify genes whose expression is, individually or jointly, indicative of some biological state of interest (e.g., a dise- ase), with the goal of improving the understanding of this biological state. There are two machine learning tasks related to the ana- lysis of feature relevance, namely feature selection (FS) and feature ranking (FR) [9]. The purpose of feature se- lection is typically to solve the so-called minimal-optimal problem [15], i.e., to find a minimal subset of features that best explain the output [8]. Feature ranking, on the other hand, solves the so-called all-relevant problem [15], i.e., providing an ordered list of the features from the most to the least important according to a given notion of relevance. Feature ranking methods range from univariate techniques, that assess the relevance of each feature independently of the others (e.g., using mutual information or p-values deri- ved from some statistical test), to multivariate techniques, that derive more complex feature importance scores taking into account potential interactions among the features (e.g., ReliefF [18] or Random forests [2]). These two problems of feature selection and feature ranking are linked: A solu- tion to the feature selection problem can be found by setting a cut-off point on a feature ranking. The present paper focuses on feature ranking, and more specifically addresses the challenging problem of the eva- luation of the output of feature ranking algorithms. Fea- ture selection as stated above is a well-defined optimiza- tion problem and as a consequence, the output of two dif- ferent feature selection methods can be directly compared according to the predictive performance of a model trained from the selected features and/or according to to the size 44 Informatica 42 (2018) 43–52 I. Slavkov et al. of the selected subsets. The problem of feature ranking, on the other hand, can not be as easily formulated as an op- timisation problem, mainly because there is no commonly accepted notion of feature relevance. Actually, feature ran- king methods typically correspond to different definitions of relevance or assumptions of dependence (e.g., univa- riate versus multivariate, linear versus non-linear). As a consequence, when run on the same dataset, different met- hods will typically provide different rankings of the featu- res. Determining the best ranking among several ones for a problem at hand is thus a practically very relevant ques- tion. For specific problems, this question can be addressed by using domain specific knowledge. In the general case, however, this is an unsolved problem that we would like to address in this paper. The remainder of this paper is organized as follows. We start by discussing related work in Section 2. We then propose a new algorithmic procedure for evaluating fea- ture rankings that does not require any prior knowledge and can thus be applied on real problems. Following previous works, this method is based on the evaluation of the pre- dictive performance of models trained from nested feature subsets derived from the rankings (described in Section 3). More precisely, two error curves are constructed: the for- ward feature addition curve (FFA) and the reverse feature addition curve (RFA). They depict the performance of mo- dels built on nested feature subsets obtained by taking fea- tures from either the top or the bottom of the ranking. Next, we propose a score based on the differences between the FFA and RFA curves as a way to compare different feature ranking methods. We investigate the performance of the proposed method on a wide range of datasets. We start by experiments on the synthetic datasets (Section 4) and pro- ceed with a description of its use on real-world benchmark datasets (Section 5). Section 6 concludes with a summary of our contributions and an outline of possible directions for further work. 2 Related work The evaluation of feature ranking methods has been typi- cally performed on artificial problems, where the relevant and irrelevant features are known by construction. In such a setting, feature ranking algorithms can be evaluated ba- sed on their capability to delineate relevant from irrelevant features. This capability can be measured, for example, through a ROC curve showing the trade-off achieved by the algorithm between assigning high ranks to relevant fe- atures and low ranks to irrelevant ones [11]. In the context of the ReliefF algorithm [18] the concepts of separability and usability are defined to evaluate feature rankings. Se- parability measures how well the algorithm separates the relevant from the irrelevant features by the difference bet- ween the lowest estimated relevance of the relevant features and the highest relevance of the irrelevant features. Usabi- lity, on the other hand is defined as the difference between the highest estimated relevance of the relevant features and the highest estimated relevance of the irrelevant features. When a ground truth ranking of the features is known (and not only which features are relevant/irrelevant), finer mea- sures can be used to compare a learnt feature ranking to the ground truth, such as the Spearman’s rank correlation. Evaluating feature ranking methods on artificial pro- blems is very useful to assess a newly proposed ranking algorithm or to highlight overall differences between met- hods. In practice however, the best method is expected to be problem dependent. This stresses the need for methods to quantitatively assess feature ranking methods in real- world scenarios, where it is not known a-priory which fe- atures are relevant and which are irrelevant. In such set- tings, the performance of feature ranking algorithms has been mostly evaluated from the point of view of their pre- dictive performance associated to feature subsets derived from the rankings. A way to assess feature rankings is to estimate the pre- dictive performance obtained by using subsets of feature derived from these rankings. For example, for a given num- ber of features k, a ranking A could be considered better than a ranking B if a model trained from the top-k features of rankingA is more accurate than a model trained from the top-k features of ranking B. Variations of this evaluation procedure are given in [9, 7, 16, 21] where the predictive models are compared for different numbers of top-k featu- res. 3 Evaluation method for feature rankings In general, the purpose of feature ranking algorithms is to solve the all-relevant feature selection problem [15]. Ho- wever, besides delineating relevant from irrelevant features, a feature ranking algorithm should also provide a proper ordering of features according to their relevance to some target concept. An ideal feature ranking algorithm should produce the ground truth ranking. In reality however, the ranking methods provide only an approximation of it. A good feature ranking method would produce a ranking that is well ordered. This means that the more relevant fe- atures would have a higher rank, i.e., all of them are con- centrated at the beginning of the feature ranking. In con- trast, a bad feature ranking method is not necessarily the one that produces an inverse ground truth ranking. Namely, we consider as a worst-case scenario if the feature ranking produces a random ranking. In this case the relevant featu- res are uniformly distributed in the ranked list. Estimating and comparing this distribution of relevant features across a ranking is the intuition on which we base our evaluation approach. Quantitative Score for Assessing the Quality of. . . Informatica 42 (2018) 43–52 45 3.1 Evaluation method definition Formally, we would like to evaluate a feature ranking al- gorithm r(·). The input to the algorithm is a dataset D, consisting of a set of n input features F and a target Ft, while the output is a feature ranking R = r(D), i.e., a list whose i-th component gives us the rank of i-th feature. For an arbitrary feature subset S ⊆ F , we can assess if it contains relevant features by constructing and evaluating predictive modelsM(S, Ft). We evaluate them, obtain the value of error measure err(M(S, Ft)), and decide whether the set S contains important features or not. The error estimates should provide insight into the cor- rectness of the feature ranking and constitute an evaluation thereof, thus we construct the feature subsets of two types. The sets of the first type, denoted by Si, contain the top i ranked features, 1 ≤ i ≤ n. The sets of the second type, denoted by Si, contain the bottom i features. Note that in the special case where i = n, i.e., the number of features, we have Sn = Sn For each constructed feature subset S, S = Si or S = Si, we build predictive models M(S, Ft), and eva- luate their prediction errors. In that way, we obtain two curves: the forward feature addition (FFA) curve consists of points (i,FFA(i)) = (i,M(Si, Ft)) (see Fig. 1a), while the reverse feature addition (RFA) curve consists of points (i,RFA(i)) = (i,M(Si, Ft)) (see Fig. 1b). In practical scenarios, if the number of features n is high, running the algorithm might be costly. One simple way for a speed up would be to avoid forming all the feature sub- sets, and instead add ∆i > 1 features to the set Si to obtain Si+∆i. The rationale behind this is that in real-world sce- narios involving high-dimensional data, only a small por- tion of the features are relevant. Therefore, the values of FFA(i) would not change much when adding more fea- tures to a relatively large number of features i. Also, the number of features added can be dependent on i. In the same manner, we can form the set Si+∆i from the set Si. 3.2 Quantitative comparison of two rankings A visual inspection of the curves can only provide a quali- tative intuition about which ranking method is better. For quantification purposes, it would be necessary to have a function which provides a cumulative assessment of the differences between the error estimates at different points of the curves. In the most general case, this would be an aggregation function agg : Rn → R, which would take a sequence of the weighted point-wise differences between two curves for its argument. For the FFA curve, we would have FFAδ(rA, rB) = agg i wi(FFArA(i)− FFArB (i)), (1) while for the RFA curve we would have RFAδ(rA, rB) = agg i wi(RFArA(i)− RFArB (i)). (2) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 10 15 20 0 .5 0 .6 0 .7 0 .8 ranking a c c u ra c y ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● rA rB (a) Comparison of FFA curves ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 10 15 20 0 .5 0 .6 0 .7 0 .8 ranking a c c u ra c y ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● rA rB (b) Comparison of RFA curves Figure 1: Comparison of different ranking methods rA and rB There are several sensible choices for instantiations of the aggregation function agg. The choice depends on the specific task at hand. Considering that we are comparing feature rankings, two aspects are important. The first is the position of most of the relevant features in the ranking. The second relates to the position of the “most” relevant features. In a comparative sense, the first aspect relates to the position of the FFA/RFA curves differences, while the second relates to the magnitude of these differences. Differences between the FFA/RFA curves of two ranking methods at the beginning of the curves are more important than differences at the end of the curves. Namely, if two FFA curves are different at the beginning, this means that one of the ranking methods is not putting the most relevant features at the top of the ranking. Correspondingly, for the RFA curves, differences at the beginning of the curve (at the bottom of the ranking), mean that one of the feature ranking methods is giving low ranks to features which are relevant. The second aspect is related more to the magni- tude of the differences between the FFA/RFA curves than to their position. The intuition is that if “more” relevant features are misranked, then this is worse than “fewer” re- levant features being misranked. From a technical perspective, in order to emphasise the importance of position, the weighting function from 46 Informatica 42 (2018) 43–52 I. Slavkov et al. Eqs. 1 and 2, should be a function of the position, i, namely wi = f(i). In the same manner, in order to emphasise the importance of magnitude, the weighting function should depend on the size of the difference, namely wi = f(δi) with δi the difference between the two compared curves at i. In addition, it is also possible to construct a weighting function that takes into account both position and magni- tude, wi = f(i, δi). To this end, we define four instanti- ations of Eq. 1 and Eq. 2, which we use to calculate the difference between the FFA/RFA curves from Fig. 1. We consider the following weighting functions: – wi = 1, equal weight for all differences; – wi = f(i) = 1/|Si|, weight inverse to feature subset size; – wi = f(δi) = |δi|, weight proportional to the diffe- rence magnitude; – wi = f(i, δi) = |δi|/|Si|, weight which includes both position and magnitude. The aggregation function used for summarising the dif- ferences (in all of the four instantiations) is the weighted average: agg i wiδi = ∑n i=1 wiδi∑n i=1 wi . (3) The obtained values are given in Table 1. They are cal- culated for the FFA/RFA examples in Fig. 1a and Fig. 1b. The difference is calculated for rA with respect to rB . As seen in Table 1, the values for the FFA curves are positive, which can be interpreted as “rA is better than rB”. While the values for the RFA curves are negative, the interpreta- tion is the same: “ranking method rA is better than ranking method rB”. In order to obtain a single number that quantifies the dif- ference between two feature ranking algorithms, we can combine both values into a single value by calculating the so-called error curve average (ECA) ECAδ(rA, rB) = FFAδ(rA, rB)− RFAδ(rA, rB) 2 . (4) Note that the minus sign in the equation is due to the in- verse interpretation of negative values for the RFA curve. Namely, if rA is better than rB , then the differences of the RFA curves should be negative. This places the overall in- terpretation of the ECAδ value on the positive scale. Na- mely, if rA is better than rB , then the overall score should be positive. wi 1 1/|Si| |δi| |δi|/|Si| FFA δ 0.018 0.019 0.032 0.03 RFA δ -0.042 -0.054 -0.08 -0.077 Table 1: Different quantitative comparisons of error curves 3.3 Quantitative score for a single ranking In real-world scenarios, the ground truth ranking is not known. Therefore, when evaluating just a single ranking algorithm, the FFA/RFA curve of the algorithm can not be compared to the one of the ground truth ranking. However, the opposite to the ground truth ranking is the uniformly random ranking, for it is the least informative. The motiva- tion for introducing the random ranking FFA/RFA curves is the following: If we can not say how good a single ranking R is, maybe we can say how close to random it is. At the point i, the expected value of the FFA/RFA curve, which belong to the uniformly random ranking RRND, pro- duced by the algorithm rR, is dependent solely on the i and properties of the dataset under consideration. Moreover, it is the same for both the FFA and the RFA curve. For sim- plicity reasons, we refer to these curves as expected curves. The expected value of the error measure err, is the average of the error estimations of all possible subsets S ⊆ F , whose cardinality equals i, i.e., E[err(M(S, Ft))] = 1( n i ) ∑ S⊆F |S|=i err(M(S, Ft)) (5) Calculating the expected curves by following Eq. 5 to the letter is intractable, especially for high-dimensional spaces, as we have to consider an exponentially high number of feature subsets. However, for practical purposes, this pro- blem can be circumvented by sampling the space of possi- ble feature subsets for each i. Suppose we have somehow calculated or approximated the expected FFA/RFA curve. If we have a ranking algo- rithm r that produces a good (mostly correct) ranking, its FFA curve would be above the expected FFA curve. For the RFA curve, the opposite would apply and the algorithm’s curve would be below the expected RFA curve. The score ECA δ(r, rR) between the FFA/RFA curves of this ranking versus the expected curves can thus be used as an abso- lute quantitative measure of the quality of this ranking. It should be noted that when calculating ECA δ(r, rR) by using wi = 1, it is not necessary to compute the ex- pected curve in order to calculate thisECAδ score. Indeed, ECA δ can be simply computed as the sum over all positi- ons of the difference between the FFA and RFA curves we want to evaluate: ECA δ(r, rR) = (FFA δ(r, rR)− RFA δ(r, rR)) 2 = 1 2 ( n∑ i=1 FFAr(i)− RFAr(i) n ) , since FFArR(i) = RFArR(i). 4 Evaluation on synthetic data The goal of the experiments presented in this section is to demonstrate the usefulness of our feature ranking evalu- ation method. As previously mentioned, feature ranking Quantitative Score for Assessing the Quality of. . . Informatica 42 (2018) 43–52 47 methods provide an approximation of the ground truth ran- king that can be viewed as a noisy ground truth. A noisier ranking is more distant from the ground truth ranking and therefore of worse quality. An evaluation method should be sensitive to the amount of noise and should provide a corresponding quality esti- mate of the feature ranking. For that purpose, we design experiments to demonstrate that our evaluation method is sensitive to the addition of noise to the ground truth ran- king. We first generate noisy feature rankings and then construct FFA/RFA curves from them. 4.1 Generating synthetic data We first perform an empirical evaluation of the proposed notion of FFA/RFA curves in a controlled setting by using synthetic datasets. The main advantage of using synthetic data is the possibility of defining a good baseline ranking that allows us to assess our proposed feature ranking eva- luation method. The complete statistics of the generated datasets and their feature interaction sets are summarized in Table 2. All of the datasets consisted of 1000 instances and 100 features in total. Among the 100 features, the “single” dataset has 9 relevant features, the “pair” dataset contains 18 relevant features and the “combined” dataset contains 27 relevant features. In all three datasets, every set Fint of relevant fea- tures has two additional redundant copies. Irrelevant featu- res are realized independently of each other. More details on the generation of the datasets are available in [20]. For each dataset, we would like to define a good base- line ranking against which to compare feature ranking met- hods. We define this ranking from feature relevance scores rel(Fi, Ft) for each feature Fi, calculated directly from the specified feature interaction structure, by using the follo- wing equation: rel(Fi,Ft) = I(Fint;Ft) |Fint| , where Fint is the (unique) interaction set that contains Fi and I(Fint;Ft) is the mutual information between features in Fint and the target Ft. By dividing the mutual informa- tion by the number of features, we distribute the informa- tion equally between all features in an interaction set. As a consequence, features that brings information about the target Ft individually are considered more relevant than fe- atures that bring the same amount of information about the target only in conjunction with other features. Note that this baseline ranking is not guaranteed to be optimal in terms of the FFA and RFA curves for a given le- arning algorithm, but is nevertheless expected to be close to optimal. In our experiments, we will consider this ranking as a ground truth ranking, denoted RGT, against which we will compare other rankings. n |Fint| f(Fint) P 3 1 Fi 0.8 3 1 Fi 0.7 3 1 Fi 0.6 91 1 Fi 0.5 (a) “single” dataset n |Fint| f(Fint) P 3 2 XOR(Fi, Fj) 0.8 3 2 XOR(Fi, Fj) 0.7 3 2 XOR(Fi, Fj) 0.6 82 1 Fi 0.5 (b) “pair” dataset n |Fint| f(Fint) P 3 2 XOR(Fi, Fj) 0.8 3 2 XOR(Fi, Fj) 0.7 3 2 XOR(Fi, Fj) 0.6 3 1 Fi 0.8 3 1 Fi 0.7 3 1 Fi 0.6 73 1 Fi 0.5 (c) “combined” dataset Table 2: Synthetic datasets statistics: The feature inte- raction sets (Fint) contained in each dataset; The inte- raction function for the feature sets (f(Fint)); The values P (f(Fint) = Ft) are denoted by P . The value of n in the last row of each table corresponds to the number of irrele- vant features in a dataset. In the other rows, n denotes the number of copies of each interaction set, which are identi- cally defined but independently realized (and differ in the random component). 4.2 Adding noise to the ground truth ranking The noise is introduced into the ranking by selecting a pro- portion, θ, of the features, which are randomly selected. For these features, random relevance values are assigned while the remaining features preserve their ground truth re- levance. By considering these partially changed relevance values, a new noisy feature ranking, Rθ, is defined. As the random relevance values can be distributed dif- ferently throughout the ranking, different FFA/RFA curves can be constructed for the same amount of noise. We estimate the expected error values by sampling the space of possible FFA/RFA curves for a given θ. First, we generate N different noisy feature rankings and then con- struct N FFA/RFA curves based on them. The expected values of FFA/RFA curve are estimated by averaging the 48 Informatica 42 (2018) 43–52 I. Slavkov et al. N individual curves, namely E[FFA]θ = 1 N N∑ i=1 FFAθ,i E[RFA]θ = 1 N N∑ i=1 RFAθ,i for a specified N and θ. For estimating the error values of the FFA/RFA curves, SVMs with a polynomial (quadratic) kernel were used and a 10-fold cross validation was performed on the dataset un- der consideration. The epsilon parameter of the SVMs was set to 1.0E-12, while the complexity parameter was set to 0.1. For our experiments, we consider several different amounts of noise θ, namely: 5%, 10%, 15%, 20%, 30% and 50%, as well as the completely random ranking (100% of noise). Each noisy error curve was produced by sum- marizing the errors of 100 noisy rankings produced for a given θ. We additionally constructed error curves based on the ground truth ranking. The experiments were performed on the three synthetic datasets described in Section 4.1, each with its correspon- ding ground truth ranking, RGT . 4.3 Results on synthetic data The results of our experiments are first plotted as graphs containing error curves. In Fig. 2, we only show the cur- ves obtained on the “combined” dataset. These curves are representative of the curves obtained on the other datasets. The FFA/RFA curves plotted on each graph are for ran- kings with different noise levels θ, as well as for the ground truth RGT and random rankings. In both Figs. 2a and 2b, the FFA and the RFA curves seem to be sensitive to the addition of noise. To begin with, the FFA/RFA curves of all the noisy rankings are located between the ground truth ranking FFA/RFA curve and the random ranking FFA/RFA curve. As noise is added to the ground truth ranking, the FFA/RFA estimates are slowly moving away from the ground truth FFA/RFA curve and towards the random ran- king FFA/RFA curve. Next, for performing quantitative analysis of the feature rankings, we begin by summarising the differences of the noisy rankings error curves w.r.t. the ground truth error curve. Additionally, some kind of baseline is required for comparison. As the ground truth ranking is known, the dis- tance between the ground truth ranking and the noisy ran- kings can serve as a baseline. For summarising the differences between the noisy ran- kings FFA/RFA curves we use the ECA difference, calcu- lated by using Eq. 4 from Section 3.2. For comparative purposes,when calculating the ECA differences, we use the different weighting functions as discussed in Section 3.2. For calculating the baseline values, i.e., the distance bet- ween the ground truth ranking RGT and the noisy rankings Rθ,i, we use the average Spearman rank correlation coeffi- cient ρ between the vectors RGT and Rθ,i. The i-th com- ponent of such a vector gives the rank of the i-th feature in dataset. The distance between rankings is then computed as dist(RGT ,Rθ) = 1− ρ̄GT,θ = 1− 1 N N∑ i=1 ρ(RGT ,Rθ,i) where N is the number of different noisy rankings consi- dered for a given θ. We obtain the results for all of the three synthetic da- tasets. Since there are no major differences among them, we show summarised results only for the “combined” da- taset. Table 3 contains values calculated with respect to the ground truth ranking. The first row of the table re- fers to the distance dist(RGT ,Rθ). The other rows are the ECA differences between the FFA/RFA curves of the GT ranking and the FFA/RFA curves of the noisy rankings. Each row containing the ECA differences refers to diffe- rent weighting functions. All columns, except the last one, refer to different levels of noise, θ. The final column gi- ves the correlation between dist(RGT ,Rθ) (row one) and the FFA/RFA curve distances (rows 2 to 5), across different ●● ●●●● ● ● ●●●●● ● ●●●●● ●●●●●●● ●● ●●● ● ● ●●● ●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● 0 20 40 60 80 100 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a c c u ra c y ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●●●●● ●●● ●● ●●●●●●●●●●●● ●● ●● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ●●●●●● ●●●● ●●●● ●● ●●●● ●● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ground truth 5% noise 10% noise 15% noise 20% noise 30% noise 50% noise random (a) FFA curves for the “combined” dataset ● ● ● ● ●● ●● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●●● ●●● ● ●● ● ●● ● ● ●●● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● 0 20 40 60 80 100 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 feature ranking a c c u ra c y ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ●●●●●●●●●●● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ground truth 5% noise 10% noise 15% noise 20% noise 30% noise 50% noise random (b) RFA curves for the “combined” dataset Figure 2: Plots comparing the FFA (on the left) and RFA (on the right) curves for the “combined” dataset. Each fi- gure contains error curves for the ground truth ranking, ran- kings with different noise levels θ and the random ranking. Quantitative Score for Assessing the Quality of. . . Informatica 42 (2018) 43–52 49 θ = 0.05 θ = 0.1 θ = 0.15 θ = 0.2 θ = 0.3 θ = 0.5 θ = 1 dist 0.1 0.171 0.252 0.32 0.432 0.652 1.048 corr. w = 1 0.009 0.02 0.027 0.037 0.061 0.117 0.223 0.992 w = 1/r 0.018 0.042 0.047 0.064 0.084 0.132 0.178 0.991 w = |δ| 0.029 0.061 0.070 0.09 0.115 0.174 0.263 0.998 w = |δ| /r 0.044 0.091 0.095 0.121 0.142 0.199 0.254 0.982 Table 3: Comparison of different ECA values obtained by different weighting functions w. The ECA values are compared with the distance between the noisy rankings Rθ and the GT ranking RGT . The final column of each table “corr.” is the value of the correlation coefficient calculated between the ranking distance (first row) and each of the ECA difference rows. noise levels θ. The final column gives an indication of how well the ECA differences relate to the distance between the ground truth ranking and the noisy rankings. As it can be seen, the curve distances correlate very well to the rank distances, regardless of which weighting function is used. From this quantitative analysis, it can be concluded that the ECA difference derived from the error curves has the same sensitivity to noise as the actual distance between the ground truth and the noisy rankings. This implies that our method can be used in practical scenarios not just to quali- tatively distinguish between different rankings, but also to quantify the difference between them. As for the specific weights used for calculating the ECA differences, it can be concluded that any of the considered weighting schemes can be used to properly compare the error curves. 5 Evaluation on real data Thus far, our analysis only involved artificially generated problems. In this section, we want to illustrate the use of our feature ranking evaluation method on datasets origina- ting from various real-life domains. The purpose of the ex- periments is to examine the quality of the feature rankings produced by several feature ranking methods on data with different characteristics. The analysis is primarily a comparative one, performed solely by calculating the numeric scores derived from the FFA and RFA error curves. The datasets we consider are quite diverse, with unknown interaction structure and the- refore unknown ground truth ranking. However, for each dataset, it is possible to generate the expected error curves of random rankings. These expected curves are used as a baseline for comparing the different feature ranking met- hods. 5.1 Datasets description For our experiments, 28 diverse classification datasets with a single target class were selected. Most of them origi- nate from the UCI data repository [14] and are from vari- ous domains. Of the remaining 3 datasets, one is from a medical study of acute abdominal pain in children (aapc) [4], while the remaining two (“water” and “diversity”) are from an ecological study of river water quality [5]. Besides covering different domains (including biology, medicine, ecology etc.) these datasets have a wide range of different properties, including number/type of features and number of instances. The main characteristics of each dataset are summarised in Table 4. Dataset #Inst. #Feat. #Cl. aapc 335 84 3 amlPrognosis 54 12625 2 arrhythmia 452 279 16 australian 690 14 2 bladderCancer 40 5724 3 breast-cancer 286 9 2 breast-w 699 9 2 breastCancer 24 12625 2 car 1728 6 4 chess 3196 36 2 childhoodAll 110 8280 2 cmlTreatment 28 12625 2 colon 62 2000 2 diversity 292 86 5 dlbcl 77 7070 2 german 1000 20 2 heart 270 13 2 heart-c 303 13 2 heart-h 294 13 2 ionosphere 351 34 2 leukemia 72 5147 2 mll 72 12533 3 prostate 102 12533 2 sonar 208 60 2 srbct 83 2308 4 tic-tac-toe 958 9 2 water 292 80 5 waveform 5000 21 3 Table 4: Statistics for the benchmark datasets 5.2 Experimental setup Four feature ranking methods were applied to each dataset: – Information gain, calculating the information gain of each feature Fi as IG(Ft, Fi) = H(Ft)−H(Ft|Fi). This does not require any specific parameter setting. – SVM-RFE is the recursive feature elimination (RFE) procedure that employs an SVM to evaluate the fea- ture weights at each iteration. A linear SVM was em- ployed [9] with the epsilon parameter set to 1.0E-12, while the complexity parameter was set to 0.1. – ReliefF algorithm as proposed in [18]. The number of neighbours was set to 10 and all of the instances were used for estimating the relevance values. 50 Informatica 42 (2018) 43–52 I. Slavkov et al. – Random forests, which can be used for estimating feature relevance as described in [2]. A forest of 100 trees was used, constructed by randomly choosing a log2 of the number of features. To generate the error curves, SVMs with polynomial (qua- dratic) kernel, were employed as classifiers. The epsilon parameter was set to 1.0E-12, while the complexity para- meter was set to 0.1. This classifier was showed to be ap- propriate in our previous experiments [20]. As a baseline of the comparison, expected FFA and RFA curves were used. They were produced by generating 100 random rankings for each dataset under consideration. This was done in a similar manner as described in Section 3.3. 5.3 Results on real data The results summarizing the error curves average (ECA) differences are given in Table 5. The ECA differences are calculated by using Eq. 4 and the weighting function wi = 1, i.e., as a standard mean value. Each row of Table 5 refers to a single dataset, while each column corresponds to a single feature ranking method. The ECA values in the table are calculated w.r.t. the baseline error curve, namely, the expected error curve. This gives an indication of how much each feature ranking method is better than a random ranking generator, but also allows for comparison between the quality of the feature rankings of the different methods. A positive value of an ECA difference indicates that a feature ranking method performs better than the random ranking generator. The negative values, however, do not necessarily indicate that it performs worse than random, but that it provides a non-random ranking that is inverse to the correct one. A value close to zero means the feature ranking method provides rankings that are more (or less) random. An intial inspection of the results in Table 5 reveals that random forests often have negative ECA values. The FFA and RFA curves of random forests, for these particular da- tasets, are below/over the expected FFA/RFA curves of random rankings. Upon closer inspection of their feature rankings (results not shown here due to space limitations) we find that they are inverse to those of the other feature ranking methods. In order to summarise the results from Table 5 and to draw meaningful conclusions about the performance of the different ranking methods across the different datasets, we use statistical tests. We adopt the recommendations of Demšar [3] and use the Friedman [6] test for statistical significance with the correction by Iman [10]. If the null hypothesis H0 that all ranking methods perform equally well, can be rejected, we use the Nemenyi post-hoc test [13] and additionally check between which feature ranking methods the statistically significant differences appear. The level of significance p = 0.05 was used. When comparing the four feature ranking methods, sta- tistically significant differences occur. We present the re- sults with a critical distance diagram [3] in Fig. 3. In the 1 2 3 4 ReliefF IG RF SVM-RFE crit ical distance: 0.8864 Figure 3: Critical distance diagrams representing the sta- tistical comparison of the ECA differences of three ranking methods on the 28 datasets. The critical distance is calcula- ted for a p value of 0.05 and is represented by a horizontal line. If the feature ranking methods are connected by a line, then their performance is not statistically significantly different. diagram, the feature ranking methods are ordered accor- ding to which one is better on average (across all datasets). A method is better if it is positioned closer to the value one on the axis. It can be observed that ReliefF, Info Gain and SVM-RFE significantly outperform Random Forests, while not differing significantly among each other. dataset IG RF ReliefF SVM-RFE aapc 0.269 0.299 0.316 0.297 amlPrognosis 0.056 0.007 0.027 0.043 arrhythmia 0.041 0.041 0.057 0.053 australian 0.277 0.260 0.266 0.209 bladderCancer 0.125 0.059 0.167 0.161 breast-cancer 0.025 0.013 0.012 −0.003 breast-w 0.246 0.206 0.190 0.194 breastCancer 0.050 0.037 0.128 0.110 car 0.085 −0.081 0.079 0.066 chess 0.279 −0.056 0.283 0.248 childhoodAll 0.083 0.040 0.033 0.154 cmlTreatment 0.028 −0.009 −0.026 0.004 colon 0.099 0.049 0.163 0.116 diversity 0.167 0.192 0.215 0.149 dlbcl 0.032 0.008 0.067 0.086 german 0.023 −0.002 0.013 0.022 heart 0.159 0.039 0.150 0.130 heart-c 0.178 0.057 0.163 0.163 heart-h 0.146 0.058 0.110 0.147 ionosphere 0.116 0.088 0.041 0.136 leukemia 0.140 0.056 0.175 0.164 mll 0.118 0.045 0.355 0.281 prostate 0.212 0.067 0.236 0.232 sonar 0.066 0.060 0.096 0.070 srbct 0.142 0.084 0.292 0.261 tic-tac-toe 0.072 −0.052 0.082 0.069 water 0.193 0.181 0.217 0.144 waveform 0.180 −0.190 0.188 0.210 Table 5: ECA differences between the FFA/RFA curves of four feature ranking methods w.r.t. the curves of a random ranking. The missing values are due to SVM-RFE’s in- ability to handle multi-valued discrete/nominal attributes. Boldfaced values are the largest ECA differences in each row. Quantitative Score for Assessing the Quality of. . . Informatica 42 (2018) 43–52 51 6 Conclusions In this paper, we focus on the problem of evaluating the output of feature ranking algorithms. We define and for- malize an intuitive evaluation method for quantitative com- parison of feature rankings. The method is based on ite- rative construction and evaluation of predictive models, re- sulting in so-called error curves: forward feature addition curve (FFA), starting from the top of a feature ranking, and the reverse feature addition curve (RFA), starting from the bottom of a ranking. From these two curves, we calcu- late the error curves average (ECA) difference that we pro- pose as a numerical indicator for comparing different fea- ture rankings. We first test our method in a controlled environment on synthetic data. We compare feature rankings with different amount of added noise, starting from the known ground truth ranking and ending with completely random rankings. By comparing the different ECA values obtained for the different noise levels, we show that our method is sensitive to changes in the quality of the feature ranking. In order to demonstrate the practical application of our evaluation method, we consider a collection of classifica- tion datasets from various domains with different proper- ties. We compare the performance of four feature ranking methods across these different datasets and evaluate their outputs by using our proposed method. The analysis of the comparative evaluation shows that the best algorithm is of- ten domain dependent and often simple approaches such as info gain can be used to produce a proper feature ranking. Several directions of work can be taken to further deve- lop the proposed evaluation methodology. The first is to directly use the feature relevance values produced by the ranking algorithm when inducing predictive models. This can be easily done in feature-weighted classifiers, such as weighted kNN. The second concerns feature ranking sta- bility, another important aspect of the feature ranking pro- cess. Although we have not considered it explicitly in this work, we would like to include it in the feature ranking eva- luation process, in a manner similar to that of [19]. Also, as structured data [1] are becoming increasingly common, we would like to adapt and investigate our method for diffe- rent types of structured targets. To this end, we need to use a feature ranking method for structured targets and couple it with a predictive model for structured outputs [12, 17]. Acknowledgements IS would like to gratefully acknowledge the financial sup- port of The Ad Futura Slovene Human Resources Develop- ment and Scholarship Fund. SD, DK, and MP have been supported by the Slovenian Research Agency through the program P2-0103, the project L2-7509, and a young rese- archer grant, respectively. The work has also been suppor- ted by the European Commission through the H2020 grant number 720270 (HBP SGA1). References [1] Gökhan H. Bakır, Thomas Hofmann, Bernhard Schölkopf, Alexander J. Smola, Ben Taskar, and S. V. N. Vishwanathan, editors. Predicting Structu- red Data. The MIT Press, Cambridge, Massachusetts, 2007. [2] Leo Breiman. Random forests. Machine Learning, 45:5–32, 2001. [3] Janez Demšar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30, 2006. [4] Saso Džeroski, George Potamias, Vassilis Moustakis, and Giorgos Charissis. Automated revision of expert rules for treating acute abdominal pain in children. In Artificial intelligence in medecine - AIME, LNCS 1211, pages 98–109, 1997. [5] Sašo Džeroski, Damjan Demšar, and Jasna Grbović. Predicting chemical parameters of river water quality from bioindicator data. Applied Intelligence, 13:7–17, 2000. [6] Milton Friedman. A comparison of alternative tests of significance for the problem of m rankings. Annals of Mathematical Statistics, 11:86–92, 1940. [7] Cesare Furlanello, Maria Serafini, Stefano Merler, and Giuseppe Jurman. Entropy-based gene ranking without selection bias for the predictive classification of microarray data. BMC Bioinformatics, 4:54, 2003. [8] Isabelle Guyon and André Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157–1182, March 2003. [9] Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. Gene selection for cancer classifi- cation using support vector machines. Machine Lear- ning, 46:389–422, March 2002. [10] Ronald Iman and James Davenport. Approximati- ons of the critical region of the Friedman statistic. Communications in Statistics - Theory and Methods, 9:571–595, 1980. [11] Kees Jong, Jérémie Mary, Antoine Cornuéjols, Elena Marchiori, and Michèle Sebag. Ensemble feature ran- king. In PKDD - LNCS 2302, pages 267–278, 2004. [12] Dragi Kocev, Ivica Slavkov, and Sašo Džeroski. More is better: ranking with multiple targets for biomarker discovery. In Proc. Second International Workshop on Machine Learning in Systems Biology, page 133, University of Liege, Belgium, 2008. [13] Peter Bjorn Nemenyi. Distribution-free Multiple Comparisons. PhD thesis, Princeton University, Prin- ceton, NY, USA, 1963. 52 Informatica 42 (2018) 43–52 I. Slavkov et al. [14] C.L. Blake D.J. Newman and C.J. Merz. UCI repository of machine learning databases, 1998. https://archive.ics.uci.edu/ml/datasets.html. Accessed on: 2015-12-13. [15] Roland Nilsson, José M. Peña, Johan Björkegren, and Jesper Tegnér. Consistent feature selection for pattern recognition in polynomial time. Journal of Machine Learning Research, 8:589–612, December 2007. [16] Silvano Paoli, Giuseppe Jurman, Davide Albanese, Stefano Merler, and Cesare Furlanello. Semisupervi- sed profiling of gene expressions and clinical data. In Proc. Sixth International Conference on Fuzzy Logic and Applications, pages 284–289, 2005. [17] Matej Petković, Sašo Džeroski, and Dragi Kocev. Fe- ature ranking for multi-target regression with tree en- semble methods. In Discovery Science, pages 171– 185, 2017. [18] Marko Robnik-Šikonja and Igor Kononenko. Theore- tical and empirical analysis of ReliefF and RReliefF. Machine Learning, 53:23–69, 2003. [19] Yvan Saeys, Thomas Abeel, and Yves Van de Peer. Robust feature selection using ensemble feature se- lection techniques. In ECML/PKDD, LNCS 5212, pa- ges 313–325, 2008. [20] Ivica Slavkov. An Evaluation Method for Feature Rankings. PhD thesis, IPS Jožef Stefan, Ljubljana, Slovenia, 2012. [21] Antanas Verikas, Adas Gelzinis, and Marija Bacaus- kiene. Mining data with random forests: A survey and results of new tests. Pattern Recognition, 44:330– 349, 2011. Informatica 42 (2018) 53–59 53 Arguments in Interactive Machine Learning Martin Možina Faculty of Computer and Information Science, University of Ljubljana, Slovenia E-mail: martin.mozina@fri.uni-lj.si Keywords: argumentation, interactive machine learning, argument-based machine learning Received: November 7, 2017 In most applications of machine learning, domain experts provide domain specific knowledge. From pre- vious experience it is known that domain experts are unable to provide all relevant knowledge in advance, but need to see some results of machine learning first. Interactive machine learning, where experts and ma- chine learning algorithm improve the model in turns, seems to solve this problem. In this position paper, we propose to use arguments in interaction between machine learning and experts. Since using and under- standing arguments is a practical skill that humans learn in everyday life, we believe that arguments will help experts to better understand the models, facilitate easier elicitation of new knowledge from experts, and can be intuitively integrated in machine learning. We describe an argument-based dialogue, which is based on a series of steps such as questions and arguments, that can help obtain from a domain expert exactly that knowledge which is missing in the current model. Povzetek: V strojnem učenju je pridobivanje domenskega znanja pogosto prvi korak, ključen za defini- cijo učnih primerov, njihovih opisov in cilja učenja. Težava je, da eksperti večinoma niso sposobni dobro izraziti svojega znanja. Lažje je, če jim najprej pokažemo preliminarne, čeprav napačne rezultate strojnega učenja, saj eksperti tako lažje uvidijo, katero domensko znanje strojno učenje potrebuje. Postopek, kjer strojno učenje in ekspert izmenjaje izboljšujeta naučeni model, se imenuje interaktivno strojno učenje. V tem članku predlagamo uporabo argumentov v komunikaciji med računalnikom in ekspertom. Ljudje se argumentiranja naučimo zgodaj in ga veliko uporabljamo. Če bi računalniki znali svoje znanje predstaviti s pomočjo argumentov ter znali upoštevati človeške argumente pri svojem učenju, bi to vodilo do lažje komunikacije in posledično do bolj točnih in bolj razumljivih računalniških modelov. V članku pokažemo, kako vključiti argumentacijo v strojno učenje in opišemo ključna vprašanja ter odgovore v dialogu med strojnim učenjem in ekspertom, ki vodijo do tistega domenskega znanja, ki naučenemu modelu manjka. 1 Introduction Domain experts are often involved in the development of a machine learning application. They help define the ma- chine learning problem, provide learning examples, labels, and attributes of these examples. In some cases, they are even able to provide prior knowledge that is then incorpo- rated into machine learning algorithms, which often results in more accurate and comprehensible models. Acquiring domain knowledge is therefore one of the key tasks in machine learning, unfortunately a very difficult one, known also as the Feigenbaum knowledge acquisi- tion bottleneck [4]. Domingos [2] identified several rea- sons why combining machine learning and expert know- ledge often fails and how it should be approached. One of the reasons is that the results of machine learning are ra- rely optimal on the first attempt. An iterative improvement, where experts and computer improve the model in turns is needed. Furthermore, some knowledge is hard to make ex- plicit. It turns out that humans are much better at explaining particular cases than eliciting general knowledge. There are more and more machine learning studies using iterative improvements. Fails et al. [3] used the term inte- ractive machine learning to describe an iterative system for correcting errors of an image segmentation system. Since then, researchers have presented many advantages of sy- stems that allow users to interact with machine learning. Beside having better final performance, such as accuracy score, these works report that users also gain trust and un- derstanding of their systems. A particularly interesting one was introduced by Stumpf [18], where a user can comment on automatically generated explanations provided by a le- arned model. These comments are then used as constraints in the system when relearning the model. Kulesza [9] called such an interaction explanatory debugging, because users identify “bugs” in a system by inspecting explanati- ons and then explain necessary corrections back to the sy- stem. We propose a similar approach that targets domain ex- perts instead of end users. Explanatory debugging aims at building flexible applications, which can easily conform to the preferences of a user. In a spam filtering application, for example, an explanation might include the words that contributed to the prediction. When a user disagrees with the prediction, she can select some of these words and mark them as not being indicative of spam. The system must then 54 Informatica 42 (2018) 53–59 M. Možina reduce the influence of these words in the future. In our case, we focus on enabling the domain experts to elicit their knowledge in the development of a machine learning application. Our approach is less constrained, be- cause experts can use general arguments to explain and to provide feedback back to machine learning. It seems that argument is the right tool for this problem, as humans have a lot of experience with arguments. We are using them every day to convince, negotiate, express and explain our opinions. An argument in its simplest form is expressed as a set of premises that support a conclusion. In most cases, the link between premises and conclusion is not deductive, but pre- sumptive. Consider, for example, the following argument: Premise 1: Raising taxes increases government revenues. Premise 2: Government needs money. Conclusion: Government should raise taxes. This argument is plausible, because raising taxes can in- crease revenues. It is also possible that it does not, if the taxes are already too high. However, when such an argu- ment is put forward, involved parties understand that the conclusion might not be correct. If domain experts used ar- guments to express their knowledge, it would not be abso- lutely correct, however they would be able to express their domain knowledge more easily and in a natural way. In this position paper we do not present any machine le- arning algorithms, experiments or results. Instead, we mo- tivate the use of arguments in interactions between dom- ain experts and machine learning. The motivation is ba- sed on the following two reasons: a) humans are already well practiced in argumentation and b) with some changes machine learning algorithms can communicate using argu- ments. In this paper, we focus on the second reason, since human argumentation is already well covered in the litera- ture [19]. We identify what modifications of machine lear- ning algorithms are needed to enable the use of arguments and which questions should we ask the domain experts to receive the most relevant information. The main contribution of this work are instructions how to enable a general machine learning algorithm to use ar- guments. This includes presenting explanations in terms of arguments and the definition of the constraint that argu- ments impose on learning. In the previous work [14] we presented an actual implementation of learning rules from arguments. In this paper, this idea is generalized. Another contribution is a description of the refinement loop (a list of steps) for obtaining the most relevant knowledge from the domain expert. We have already presented several versions of this loop in our previous publications (see [14, 8]). Here, we unify these versions and provide more detailed explana- tions of the steps with practical examples. Finally, this pa- per motivates the use of arguments in interactive machine learning and supports this motivation with arguments. 2 Explaining classifications and arguments Explaining decision or actions of intelligent systems to end users has many benefits [11]. It can positively affect the systems use, enable better understanding of the system and result in making people trust it. Some machine learning models have the inherent capa- bility of generating explanations, such as decision trees or classification rules [5]. Similarly, additive models, such as naı̈ve Bayes or logistic regression, can use weights given to features to provide explanations of their decisions [15]. However, most of the contemporary machine learning research focuses on optimizing some abstract evaluation measure, such as classification accuracy or root mean squa- red error, and does not consider explanations at all. There have been some attempts to explain the decisions of of such methods. For example, Štrumbelj and Kononenko [17] suggested an algorithm for generating explanations of in- comprehensible methods. They evaluate prediction impor- tance of each feature by computing the difference between the classifier’s prediction of an example and the prediction of the same example when this feature is omitted. This dif- ference is then used in the explanation of the classifier’s decision for this particular example. We shall now define the relation between explanations in machine learning and arguments. We mentioned in the introduction that an argument contains a set of premises to support a conclusion. Since classification is the conclusion of the machine learning system, and the explanation con- tains the main reasons for this conclusion, it seems that an explained classification is already an argument. Explanations of classifications rarely contain one argu- ment only. Usually, an explanation provides reasons for and against the predicted class. For example, in the case of classification rules, we can present all rules covering the classifying example. Similarly, in nomograms [15] or in the general feature-based explanation framework [17], in- fluences of features can be either positive or negative. Sho- wing reasons for and against predicted class is beneficial to a domain expert, since it shows all relevant information that the underlying system used to infer a decision, which increases expert’s understanding of the system [9]. An explanation thus contains arguments for and argu- ments against the predicted class value, without explaining the actual details of the algorithm for inferring the final de- cision. Knowing positive and negative factors is often suffi- cient for human understanding, as it is similar to how argu- ments are used in a human conversation. In a dialogue bet- ween two persons, arguments supporting one side are often challenged with the opposing arguments. It is not rare that the same set of arguments will lead to different conclusi- ons of the participants in the dialogue, because they have different viewpoints and employ different internal reaso- ning mechanisms. Yet, knowing the opposing arguments is still beneficial, because they increase our understanding of the opposing viewpoints and therefore deepen our under- Arguments in Interactive Machine Learning Informatica 42 (2018) 53–59 55 standing of the issue. By analogy, it is more important for experts to understand which factors, both positive and ne- gative, influenced the machine learning decision, and less how it was derived. 3 Argument-based machine learning Argument-based machine learning (ABML) is a special case of learning from data and prior knowledge, where prior knowledge is represented with arguments [14]. A spe- cific property of arguments is that they relate to a single example only and are not general as prior knowledge usu- ally is. Several reviews and comparisons of different appli- cations of prior knowledge are available, see for example [6, 12, 20]. The problem with domain knowledge occurs when ex- perts are asked to provide general knowledge. Consider, for example, asking a physician to write down general ru- les for diagnosing pneumonia. A very difficult task. On the other hand, this physician can easily diagnose a cer- tain patient and explain why he has pneumonia. For this reason we suggest using arguments to elicit and represent background knowledge. While asking experts to provide general background knowledge can be a difficult task, as- king them to articulate their knowledge through arguments has proved to be much more efficient [2, 8]. In ABML, arguments are used to enhance learning ex- amples. Each argument is attached to a single learning example, while one example can have several arguments. There are two types of arguments: positive arguments are used to explain (or argue) why a certain learning example is in the class as given, and negative arguments are used to explain why it should not be in the class as given. Examples with attached arguments are called argumented examples. An ABML method needs to induce a model that will ex- plain the classification of an example using the arguments provided by the expert. An ABML method, therefore, needs to be able to explain its decisions with arguments for and against, as we described in the previous section. More- over, it needs to be able to accept input arguments and use these arguments in explanation, which allows the expert to immediately see the impact. The reasons from the posi- tive arguments should become a part of the arguments for the class value in the explanation, and the reasons from the negative arguments should be mentioned within the against arguments in the explanation. Such explanation is therefore more comprehensible from the expert’s perspective, since it uses the same terms as the expert [8]. For example, a diagnostic machine learning system might argue that a patient probably has pneumonia, be- cause he is a male and he is coughing. A medical expert could then counter argue that this person has pneumonia, because he has high temperature. Then, ABML should in- duce a new model for automatic diagnosis, which would state high temperature (among others) as the reason for this particular patient with pneumonia. Such instance-based constraints are different from how constraints are usually implemented in machine learning, because they relate to one example only. The system does not need to mention temperature in explanations of other examples, in fact, it could even mention low temperature in explanations of other patients with pneumonia, and that would still not violate the constraint. Arguments are presumptive by nature and that is the main reason why arguments can not be applied generally, but to specific examples only. When a medical doctor ex- plains a diagnosis of a patient, his or her argument contains many unstated premises that seemed unimportant at a time or were simply forgotten. Maybe fever is typical only for a certain type of pneumonia or only for a certain part of the population. To implement an argument-based variant of a machine learning algorithm, one needs to take care that the argu- ments from experts are mentioned in explanations. This is easier achieved with models that are a composition of se- veral parts. For example, an argument-based random forest could simply select only trees that are consistent with argu- ments. In our research group we implemented the ABCN21 algorithm [14], an extension of the CN2 algortihm [1], which learns classification rules from argumented exam- ples. The main difference between the original CN2 and ABCN2 algorithms is in the definition of the covering re- lation. In the standard definition, a rule covers an example if the condition part is true for this example. In ABCN2, a rule covers an argumented example if the condition part is true and rule is consistent with positive arguments and not consistent with negative arguments. 4 A dialogue between a domain expert and a knowledge engineer Although interactive machine learning assumes that end users (domain experts) directly interact with machine le- arning algorithms, we shall assume that a knowledge engi- neer acts as an intermediate between a domain expert and the algorithm, since some of the suggested steps in this section would be difficult to implement automatically. Having a machine learning algorithm that can generate arguments and can accept expert’s arguments, we will now define how a domain expert and a knowledge engineer can interact with arguments. We propose a series of moves that defines a dialogue between a domain expert and a know- ledge engineer. A dialogue is a goal-directed conversa- tion between two parties, in which parties are taking turns. In each turn, a participant makes a move that responds to the previous move. In this information-seeking dialogue, a knowledge engineer is trying to elicit relevant information from a domain expert by selecting relevant examples, using explanations of these examples, and asking the right ques- tions. In our previous previous applications of ABML, we 1The latest version of ABCN2 can be found at https://github. com/martinmozina/orange3-abml. 56 Informatica 42 (2018) 53–59 M. Možina called this the ABML refinement loop [8]. In this paper, we unify several versions of this loop and present it in the context of the ideas from the previous two sections. Furt- hermore, the descriptions of the steps cover many different situations where the dialogue might lead us to. Given that arguments always relate to a single example, an engineer and an expert talk about one example at a time. As it is unlikely that experts will have time to discuss all learning examples, selecting relevant examples is impor- tant. We call these examples critical examples. A discus- sion about a single critical example has the following seven steps. Step 1: Selecting a critical example Critical examples are those learning examples that would have a considerable positive influence on the quality of the model if some arguments were provided. Initially, we took misclassified examples with the highest predictive error as critical examples [8]. However, in our recent experiments, we discovered that it is better to select prototypical mis- classified examples, as examples with the highest error are more likely to be outliers and are therefore hard to explain. There are various algorithms available for obtaining pro- totypical examples, one option is to use clustering and take centers of these clusters [16]. It should be noted that this procedure misses a whole group of potentially critical examples: examples that are correctly classified, however the model produces incorrect or unacceptable explanations. Until now we have not yet found a good criteria for selecting such examples. Step 2: Presenting the critical example to the expert In this step, a critical example with explanation from the machine learning model is presented to the domain expert. As critical examples are misclassified by the current model, the current explanation is likely wrong. Then, the domain expert is asked the following question: ”Why is this ex- ample in the class as given?” The answer to this question should not contain the reasons that are mentioned in the current machine’s exlanation. Example. In one of the first applications of ABML, the goal was to distinguish between a good and a bad bishop in a chess position [13]. The learning data contained chess positions with one bishop only. Instances were described with attributes that are typically used in chess evaluation functions and each instance was classified as bad bishop or good bishop. One of the descriptive attributes was mobi- lity, which counted the number of possible moves for the bishop. The algorithm initially learned that good bishops have high mobility. The first critical example was a po- sition with a good bishop, which was blocked by a knight and was therefore not able to move (had low mobility). The expert was thus asked: ”Why is the black bishop in this po- sition good if it has low mobility?” Step 3: The expert provides arguments for the critical example The domain expert needs to provide at least one argument (a set of reasons) why the example’s class value is as given. The argument must contain at least one reason, which was not in the original explanation provided by the machine le- arning method, otherwise this argument will not influence learning. If the expert can not give such an argument, we have to return to step 1 and provide another critical exam- ple. In our previous experiments, a domain expert was unable to provide an argument due to the following two reasons. In the first case, an expert might find the example an outlier, because he or she cannot explain why the example is in this class. We can then remove the example from the data set, or, if not, prevent this example to become a critical example again. In the other case, which is also quite common, the expert discovers an error in the data. For example, it might turn out that the label of the example is wrong or that there is an error in the value of one of the descriptive attributes. Then we have to correct the error and start with another critical example. Example. We used ABML to learn a diagnostic model for distinguishing between different types of tremor in pa- tients with a neurological disease [7]. The patients were classified as essential tremor or parkinsonian tremor or mixed tremor (having both). In most cases, the expert (a physician) was able to explain critical examples. In one of the critical cases, however, the expert realized that some strong symptoms were overlooked at the time of diagnosis and, after a careful deliberation, decided to change the class value. In another critical case, the expert could not provide an argument, because the value of the attribute containing qualitative assessment of a physician had a incorrect value. After the value was corrected, the example was not critical anymore. Step 4: Adding arguments to the critical example A domain expert usually expresses arguments in natural language without considering domain description of the le- arning data set. The knowledge engineer then needs to rephrase provided arguments using domain description lan- guage (attributes). However, expert’s reasons in arguments are sometimes not covered with the current set of attributes. A domain ex- pert often implicitly refers to an attribute missing from the current set of attributes. A knowledge engineer then needs to implement the new attribute, or change the definition of an old one. When the expert refers to unavailable attribu- tes, which can not be added into the domain, we need to continue with another critical example. Explaining examples with arguments has shown to be an effective tool for suggesting new attributes, since domain experts do not need to explicitly propose a set of relevant attributes, but can implicitly suggest new attributes in ex- planations. Arguments in Interactive Machine Learning Informatica 42 (2018) 53–59 57 Example. In the case with the bishop, the expert respon- ded that the bishop’s mobility is not limited, because the blocking knight can easily move to another square. We the- refore had to redesign the mobility attribute by considering only pawns as blocking pieces. In the tremor application, the expert also suggested several new attributes. When a patient with essential tremor was selected as critical, the expert mentioned the presence of harmonics (a certain pat- tern in drawings of patients) as a clear signal of essential tremor. There was no attribute in the domain that would explicitly define the presence of harmonics. However we could compute a new boolean attribute (from four existing attributes) representing whether the harmonics were pre- sent or not. Step 5: Discovering counter examples After arguments are added to the critical example, ABML relearns the model. Arguments often apply to many ot- her examples, and not just to the critical example, therefore these arguments will be mentioned in explanations of other examples. When these examples come from the same class as the critical example, such behavior is not problematic, it is in fact favorable, since more examples are now explained using the expert terms. On the other hand, if the model uses positive arguments of the critical example to explain examples from the oppo- site class, we should check the validity of these explanation with the expert. A counter example is an example from the opposite class that is consistent with the positive argument provided by the expert, the induced model mentions this positive argument in the explanation of this counter exam- ple, and the inclusion of this argument in the data resulted in a higher prediction error for this example (e.g. the exam- ple is now misclassified or has a higher probabilistic error). Example. After attaching the above argument to the bis- hop critical example, it turned out that high mobility is not enough, as a position with a bad and highly mobile bishop turned out as the counter example. The provided argument was consistent with the counter example (the mobility of the bishop was high), however it was from the opposite class (the bishop was bad). Step 6: Refining arguments using counter examples When a counter example was found, the expert needs to revise the initial arguments with respect to the counter ex- ample. The expert is now asked ”Why is critical example in one class and why counter example in the other?” The expert may now revise the original argument and explain the difference between these two examples. The procedure then returns to the previous step and seeks for more counter examples. Example. Comparing critical and counter positions in the chess example, the expert decided that the counter ex- ample had a noticeably worse pawn structure. This reason was added to the original argument of the critical exam- ple. Therefore, the initial argument (high mobility) was ex- tended with an argument specifying good pawn structure. Afterwards, no counter examples were found. Step 7: Pruning arguments with similar examples In argumentation, to make an argument stronger and less susceptible to counter-arguments, humans often provide more reasons that are actually needed. In ABML, howe- ver, too many reasons will result in poor generalization. As the last step in the discussion of the particular cri- tical example, we should evaluate reasons in the provided argument whether they are necessary. A reason is unne- cessary when its removal a) does not negatively affect the prediction accuracy of the critical example, b) does not in- troduce new counter-examples, and c) generalizes the argu- ment to similar examples. Given a reason and an argument, a critical example is similar to another example, when they are from the same class and the argument would also apply to the similar example if it was removed. A single similar example is then shown to the expert, who needs to decide whether the same argument could be used for both exam- ples. Example. Although we encountered too specific argu- ments in almost every application of ABML, we have not yet used pruning. For example, when we tried to classify student Prolog programs as correct or incorrect [10] and the expert was asked to provide arguments for a correct critical program, he would often mention many syntactical patterns that are indicative of a correct program. After evaluating the rules that were learned from these arguments, we found out that many of the mentioned reasons were redundant. Therefore, in that application pruning would lead to a sim- pler and less fragmented model. The above seven steps are repeated until the system can not find any new critical examples or some goal (such as accuracy or comprehensibility of the model) is achieved. 5 Conclusion When a knowledge engineer is faced with a machine le- arning problem, she usually needs to first sit down with a domain expert and try to define the problem. As mentioned in the paper, this process is not trivial, since experts usually can not give us all the answers in advance, but an interactive process is preferred. Even with an interactive process, the communication can still be difficult, when domain experts do not understand machine learning, and knowledge engi- neers do not understand the domain. In this paper, we proposed to use arguments as a commu- nication method for bridging the gap between domain ex- perts and machine learning. Argumentation is a skill used in everyday communication that everyone learns to a cer- tain extent. Therefore, if machine learning results and dom- ain experts’ knowledge were represented as arguments, it would facilitate smoother communication. We first showed how machine learning can interact with domain experts by explaining its decisions using arguments 58 Informatica 42 (2018) 53–59 M. Možina for and against. Such explanations resemble argumentative reasoning and should thus be good enough for experts. Af- terwards, we demonstrated how experts can express their knowledge by explaining particular learning examples with positive and negative arguments. The learning algorithm then uses these arguments to guide learning towards a mo- del that is consistent with data and provided arguments. Fi- nally, to close the loop, we described a dialogue between a domain expert and a knowledge engineer designed to drive the expert to provide useful knowledge. When we first presented the ABML idea [14], the argu- ments were only used to explain learning examples. In one of the following experiments [13], we defined the ABML refinement loop, where arguments turned out to be an ef- fective tool for elicitation of new attributes. This refine- ment loop was then further revised through many appli- cations [8]. In this paper, we presented an extended ver- sion of the ABML refinement loop, where communication between a domain expert and a domain engineer compri- ses of several questions and arguments. This involves ma- chine generated arguments, asking expert to give counter- arguments to machine learning arguments, and refining ar- guments given counter examples and similar examples. Acknowledgement This work was partly supported by the Slovene Agency for Research and Development (ARRS). We would also like to thank the two anonymous reviewers for valuable suggesti- ons on this paper and colleagues from the Artificial Intelli- gence Laboratory, who contributed a lot in the past in the development of the ABML idea. References [1] Peter Clark and Robin Boswell. Rule induction with CN2: Some recent improvements. In Machine Le- arning - Proceeding of the Fifth Europen Conference (EWSL-91), pages 151–163, Berlin, 1991. [2] Pedro Domingos. Toward knowledge-rich data mi- ning. Journal of Data Mining and Knowledge Disco- very, 15:21–28, 2007. [3] Jerry Alan Fails and Dan R. Olsen, Jr. Interactive machine learning. In Proceedings of the 8th Interna- tional Conference on Intelligent User Interfaces, IUI ’03, pages 39–45, 2003. [4] Edward A. Feigenbaum. Knowledge engineering: the applied side of artificial intelligence. In Proc. of a symposium on Computer culture: the scientific, intel- lectual, and social impact of the computer, pages 91– 107, New York, NY, USA, 1984. New York Academy of Sciences. [5] Alex A. Freitas. Comprehensible classification mo- dels - a position paper. SIGKDD Explorations New- sletter, 15(1):1–10, 2014. [6] Valerio Grossi, Andrea Romei, and Franco Turini. Survey on using constraints in data mining. Data Mining and Knowledge Discovery, 31(2):424–464, 2017. [7] Vida Groznik, Matej Guid, Aleksander Sadikov, Mar- tin Možina, Dejan Georgiev, Veronika Kragelj, Samo Ribari, Zvezdan Pirtoek, and Ivan Bratko. Elicita- tion of neurological knowledge with argument-based machine learning. Artificial intelligence in medicine, 57(2):133–144, 2013. [8] Matej Guid, Martin Možina, Vida Groznik, Aleksan- der Sadikov, Dejan Georgijev, Zvezdan Pirtoek, and Ivan Bratko. Abml knowledge refinement loop: A case study. In Proceedings of the 2012 IEEE 20th In- ternational Symposium (ISMIS 2012), pages 41–50, 2012. [9] Todd Kulesza, Margaret Burnett, Weng-Keen Wong, and Simone Stumpf. Principles of explanatory debug- ging to personalize interactive machine learning. In Proceedings of the 20th International Conference on Intelligent User Interfaces, IUI ’15, pages 126–137, 2015. [10] Timotej Lazar, Martin Možina, and Ivan Bratko. Au- tomatic extraction of ast patterns for debugging stu- dent programs. In International Conference on Artifi- cial Intelligence in Education, pages 162–174. Sprin- ger, 2017. [11] Brian Y. Lim, Anind K. Dey, and Daniel Avrahami. Why and why not explanations improve the intelligi- bility of context-aware intelligent systems. In Procee- dings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’09, pages 2119–2128, 2009. [12] Violeta Mirchevska, Mitja Lustrek, and Matjaz Gams. Combining domain knowledge and machine learning for robust fall detection. Expert Systems, 31:163–175, 2014. [13] Martin Možina, Matej Guid, Jana Krivec, Aleksan- der Sadikov, and Ivan Bratko. Fighting knowledge acquisition bottleneck with argument based machine learning. In Proceedings of the 2008 Conference on ECAI 2008: 18th European Conference on Artificial Intelligence, pages 234–238, 2008. [14] Martin Možina, Jure Žabkar, and Ivan Bratko. Argument-based machine learning. Artificial Intelli- gence, 171(10/15):922–937, 2007. [15] Martin Možina, Janez Demšar, Michael Kattan, and Blaž Zupan. Nomograms for visualization of naive bayesian classifier. In Jean-François Boulicaut, Flo- riana Esposito, Fosca Giannotti, and Dino Pedreschi, editors, Knowledge Discovery in Databases: PKDD Arguments in Interactive Machine Learning Informatica 42 (2018) 53–59 59 2004: 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, Pisa, Italy, September 20-24, 2004. Proceedings, pages 337–348, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg. [16] J. Arturo Olvera-López, J. Ariel Carrasco-Ochoa, J. Francisco Martı́nez-Trinidad, and Josef Kittler. A review of instance selection methods. Artif. Intell. Rev., 34(2):133–143, 2010. [17] Erik Štrumbelj and Igor Kononenko. An efficient ex- planation of individual classifications using game the- ory. J. Mach. Learn. Res., 11:1–18, March 2010. [18] Simone Stumpf, Vidya Rajaram, Lida Li, Weng-Keen Wong, Margaret Burnett, Thomas Dietterich, Erin Sullivan, and Jonathan Herlocker. Interacting mea- ningfully with machine learning systems: Three ex- periments. Int. J. Hum.-Comput. Stud., 67(8):639– 662, 2009. [19] Douglas Walton. Foundamentals of Critical Argu- mentation; 1st edition. Cambridge University Press, 2005. [20] Ting Yu. Incorporating prior domain knowledge into inductive machine learning: its implementation in contemporary capital markets. PhD thesis, Univer- sity of Technology, Sydney. Faculty of Information Technology., 2007. 60 Informatica 42 (2018) 53–59 M. Možina Informatica 42 (2018) 61–68 61 An Inter-domain Study for Arousal Recognition from Physiological Signals Martin Gjoreski1,2, Mitja Luštrek1 and Matjaž Gams1,2 1Department of Intelligent Systems, Jožef Stefan Institute 2Jožef Stefan International Postgraduate School Ljubljana, Slovenia E-mail: martin.gjoreski@ijs.si Blagoj Mitrevski Faculty of Computer Science and Engineering Skopje, R. Macedonia Keywords: arousal recognition, GSR, R-R, machine learning, emotion recognition, health Received: October 27, 2017 Arousal recognition from physiological signals is a task with many challenge remaining, especially when performed in several different domains. However, the need for emotional intelligent machines increases day by day, starting with timely detection and improved management of mental disorders in mobile health, all the way to enhancing user experience in human-computer interaction (HCI). One of the open research questions, which we analyze in this paper, is which machine-learning (ML) methods and which input is most suitable for arousal recognition. We present an inter-domain study for arousal recognition on six different datasets. The datasets are processed and translated into a common spectro- temporal space of R-R intervals and Galvanic Skin Response (GSR) data, from which features are extracted and fed into ML algorithms. We present a comparison between dataset-specific models, “flat” models build on the overall data, and a novel stacking scheme, developed to utilize knowledge from all six datasets. When one model is built for each dataset, it turns out that whether the R-R, GSR, or merged features yield the best results is domain (dataset) dependent. When all datasets are merged into one and used to train and evaluate the models, the stacking scheme improved upon the results of the “flat” models. Povzetek: Zaznavanje psihološkega vzburjenja iz fizioloških signalov je težka naloga, posebej če se je želimo lotiti na enoten način za več različnih domen. Vendar je potreba po inteligentnih strojih, ki so zmožni razumeti tudi čustva, vedno večja: uporabljajo se za različne probleme, od obvladovanja duševnih motenj z rešitvami mobilnega zdravstva do izboljševanja uporabniške izkušnje pri interakciji človeka z računalnikom. Odprto raziskovalno vprašanje, s katerim se ukvarja ta članek, je, katere metode strojnega učenja in kakšni vhodni podatki so primerni za zaznavanje vzburjenja. Članek opisuje več-domensko študijo zaznavanja vzburjenja na šestih različnih zbirkah podatkov. Zbirke so pretvorjene v enoten spektralno-časovni prostor intervalov R-R in galvanskega odziva kože, iz katerih izluščimo značilke in jih uporabimo kot vhod v algoritme strojnega učenja. Primerjamo modele, prilagojene posamičnim zbirkam podatkov, modele, zgrajene iz združenih podatkov vse zbirk, in inovativen ansambel modelov, ki takisto uporablja vseh šest zbirk. Izkaže se, da če zgradimo po en model za vsako zbirko podatkov, je od zbirke odvisno, ali se najbolje obnesejo značilke, izluščene iz intervalov R-R, galvanskega odziva kože ali obojega. Če zbirke podatkov združimo, pa se ansambel obnese bolje od navadnega modela. 1 Introduction In 1897, Wundt [1] set the basis for modeling affective states by identifying the two emotional dimensions of calm-excitement and relaxation-tension. Almost a century later, in 1997, the field of affective computing [2] has been introduced, which aims for computational modeling of the affective states. Besides the maturity of the field of affective computing, modeling affective states has still remained a challenging task. Its importance is mainly reflected in the domain of human- computer interaction (HCI) and mobile health. In the HCI, it enables a natural and emotionally intelligent interaction. In the mobile health, it is used for timely detection and management of emotional and mental disorders such as depression, bipolar disorders and posttraumatic stress disorder. For example, the cost of work-related depression in Europe was estimated to €617 billion annually in 2013. The total was made up of costs resulting from absenteeism and presenteeism (€272 billion), loss of productivity (€242 billion), health care 62 Informatica 42 (2018) 61–68 M. Gjoreski et al. costs of €63 billion and social welfare costs in the form of disability benefit payments (€39 billion) [3]. Affective states are complex states that results in psychological and physiological changes that influence behaving and thinking [5]. These psycho-physiological changes can be captured by a wearable device equipped with galvanic skin response (GSR – measures sweating rate), Electrocardiography (ECG – measures heart electrical activity) or blood volume pulse (BVP – measures cardiovascular dynamics) sensors. For example, the affective state of excitement usually initiates changes in heartbeat, breathing, sweating, and muscle tension, which can be captured using wearable sensors. There are several approaches for modeling emotions, including discrete, continuous, and appraisal-driven approach. For the apprise-driven approach, context information is needed to model people’s relationship to the environment that elicits their emotional response [4]. However, in computer science studies, the required context information is usually not available. In the discrete approach, the affect (emotion) is represented as discrete and distinct state, i.e., anger, fear, sadness, happiness, boredom, disgust and neutral. In the continuous approach, the emotions are represented in 2D (see Figure 1) or 3D space of activeness, valance and dominance [5]. Unlike the discrete approach, this model does not suffer from vague definitions and fuzzy boundaries, and has been widely used in affective studies [6] [7] [8]. The use of the same annotating model allows for an inter-study analysis. Figure 1: Circumplex model of affect. The model maps affective states in a 2D space of Arousal and Valence [5]. In this study we examine arousal recognition from GSR and heart–related physiological data, captured via: chest-worn ECG and GSR sensors, finger-worn BVP sensor, and wrist-worn GSR sensor BVP sensor. The data belongs to six publicly available datasets for affect recognition, in which there are 191 different subjects (70 females) and nearly 150 hours of arousal-labelled data. All of this introduces the problem of inter-domain learning, to which ML techniques are sensitive. To overcome this problem, we propose a preprocessing technique and a novel ML stacking scheme. The preprocessing technique translates the datasets into a common spectro-temporal space of R-R and GSR data. After the preprocessing, R-R and GSR features are extracted, which can be fed into ML algorithms to build models for arousal recognition. The novel ML stacking scheme builds dataset-specific ML models and uses a meta-learner to build general models. The novelties of this study are: (1) First study in affect recognition that analyzes data from six different datasets (see Section 3 Data). (2) Methodology for translating physiological data into a common spectro-temporal space of R-R and GSR data (see Section 4.1 Pre-processing and feature extraction). (3) Novel ML stacking scheme that generalizes from dataset-specific to general ML model for arousal recognition (see Section 4.2 Machine learning). 2 Related work Affect recognition is an established computer-science field, but one with many remaining challenges. Many studies confirmed that affect recognition can be performed using speech analysis [10], video analysis [11], or physiological sensors in combination with ML [12]. The majority of the methods that use physiological signals use data from ECG, electroencephalogram (EEG), functional magnetic resonance imaging (fMRI), GSR, electrooculography (EOG) and/or BVP sensors. In general, the methods based on EEG data outperform the methods based on other data [6] [7], probably due to the fact the EEG provides a more direct channel to one’s mind. However, even though EEG achieves the best results, it is not applicable in normal everyday life. In contrast, affect recognition from R-R intervals or GSR data, is much more unobtrusive since this data can be extracted from ECG sensors, BVP sensors, or GSR sensors, most of which can be found in a wrist device (e.g., Empatica [13] and Microsoft Band [14]). Our methodology is tailored towards this type of data. Regarding the typical ML approaches for affect recognition, Iacoviello et al. have combined discrete wavelet transformation, principal component analysis and support vector machine (SVM) to build a hybrid classification framework using EEG [15]. Khezri et al. used EEG combined with GSR to recognize six basic emotions via K-nearest neighbors (KNN) classifiers [16]. Mehmood and Lee used independent component analysis to extract emotional indicators from EEG, EMG, GSR, ECG and effective refractory period (ERP) [17]. Mikuckas et al. [18] presented a HCI system for emotional state recognition that uses spectro-temporal analysis only on R-R signals. More specifically, they focused on recognizing stressful states by means of the heart rate variability (HRV) analysis. An Inter-domain Study for Arousal Recognition … Informatica 42 (2018) 61–68 63 Regarding the more advanced ML approaches, Yin et al. [20] used an ensemble of deep classifiers for recognizing affective states using EEG, electromyography (EMG), ECG, GSR, and EOG. Using the same data, Verma et al. [19] developed an ensemble of shallow classifiers. Similarly, Kuncheva et al. [21] introduced AMBER - Advanced Multi-modal Biometric Emotion Recognition approach which uses data from EEG, EDA and HR sensor. In contrast with the related work, which analyzes only one dataset, we perform experiments with six different datasets (domains), we analyze which ML algorithms in combination with which data type (either R-R intervals or GSR) yields best performance across all six different dataset for arousal recognition, and we propose a novel stacking method for learning from all six different domains. Finally, the work presented here is related to our previous conference paper [39]. Here we present more details regarding the data pre-processing and feature extraction, we present the novel stacking scheme and new experimental results. 3 Data The data belongs to six publicly available datasets for affect recognition: Ascertain [6], Deap [7], Driving workload dataset [26], Cognitive load dataset [27], Mahnob [29], and Amigos [30]. Overall, nearly 150 hours of arousal-labelled data that belong to 191 subjects. Table 1 presents the data summary, which contains: number of subjects per dataset, the mean age, number of trials per subject, mean duration of each trial, duration of data per subject ̶ in seconds, and overall duration. Our goal was to recognize the arouse. Four datasets, Ascertain, Deap, Mahnob and Amigos, were already labelled with the subjective arousal level. One difference between these datasets was the arousal scale used for annotating. For example, the Ascertain dataset used a 7- point arousal scale, whereas the Deap dataset used a 9- point arousal scale (1 is very low, and 9 is very high, and the mean value is 5). Since the problem of arousal recognition is difficult, we decided to formulate it as a binary classification problem. From both scales, we thus split the labels in two classes using the mean value with respect the original scales. This is the same split used in the original studies. A similar step was performed for the Mahnob dataset. Two datasets, Driving workload and Cognitive load, did not contain labels for subjective arousal level. The Driving workload dataset was labelled with subjective ratings for a workload during a driving session. For this dataset, we presume that increased workload corresponds to increased arousal. Thus, we used the workload ratings as arousal ratings. The threshold for high arousal was put on 50%. Similarly, the Cognitive load dataset was labelled for subjective stress level during stress inducing cognitive load tasks (mathematical equations). The subjective scale was from 0 to 4 (no stress, low, medium and high stress). We put the threshold for high arousal on 2 (medium stress). 4 Methods 4.1 Pre-processing and feature extraction 4.1.1 R-R data The preprocessing is essential, since it allows merging of the six different datasets. For the heart-related data, it translates the physiological signals (ECG or BVP) to R-R intervals and performs temporal and spectral analysis. First, a peak detection algorithm is applied to detect the R-R peaks. Figure 2 presents an example for ECG signal and the detected R-R peaks. On the x-axis is the sample of the data window, on the y-axis is the output of the ECG sensor (voltage) and the detected peaks are marked with red. Next, is temporal analysis, i.e., calculating the time distance between the detected peaks. Once the R-R intervals are detected they can be analyzed as a time series. Figure 3 is an example of an R-R time series. On the y-axis is the duration of the R-R interval, and on the x-axis is the time (in seconds) in which the R-R interval has occurred. After the detection of R-R intervals, the R-R signal is processed. First, each R-R signal is filtered using a median filter which removes the R-R intervals that are outside of the interval [0.7*median, 1.3*median]. These parameters were determined experimentally. Table 1: Experimental data summary [39]. Dataset Duration per Subjects Females Mean age Trials trial [s] subject [min] dataset [h] Ascertain 58 21 31 36 80 48.0 46.4 DEAP 32 16 26.9 40 60 40.0 21.3 Driving 10 3 35.6 1 1800 30.0 5.0 Cognitive 21 0 28 2 2400 80.0 28.0 Mahnob 30 17 26 40 80 53.3 26.7 Amigos 40 13 28 16 86 22.9 15.3 Overall 191 70 29.25 135 884.0 251.3 142.7 64 Informatica 42 (2018) 61–68 M. Gjoreski et al. Figure 2: ECG signal and detected R-R peaks (red color). ASCERTAIN dataset t, Subject 1, Video 29 [6]. Figure 3: Example R-R signal as a time-series. ASCERTAIN dataset, Subject 1, Video 29 [6]. After the median filter, person specific winsorization is performed with the threshold parameter of 3 to remove outlier R-R intervals. From the filtered R-R signals, periodogram is calculated using the Lomb-Scargle algorithm [9]. The Lomb-Scargle algorithm allows efficient computation of a Fourier-like power spectrum estimator from unequally spaced data (as are the R-R intervals). Figure 4 presents an example Lomb-Scargle periodogram. The red color represent the low frequencies and the yellow color represents the high frequencies. Finally, based on the related work [36], the following HRV features were calculated from the time and spectral representation of the R-R signals: the mean heart rate (meanHR), the mean of the R-R intervals (meanRR), the standard deviation of the R-R intervals (sdnn), the standard deviation of the differences between adjacent R- R intervals (sdsd), the square root of the mean of the squares of the successive differences between adjacent R-R intervals (rmssd), the percentage of the differences between adjacent R-R intervals that are greater than 20 ms, the percentage of the differences between adjacent R-R intervals that are greater than 50 ms, Poincaré plot indicies (SD1 and SD2), total spectral power of all R-R samples in power between 0.003 and 0.04 Hz (lf - low frequencies), between 0.15 and 0.4 Hz (hf ̶ high frequencies), and the ratio of low to high frequency power. 4.1.2 GSR data To merge the GSR data from the six datasets, several problems were addressed. Each dataset is recorded with different GSR hardware, thus the data can be presented in different units and different scales. To address this problem, each GSR signal was converted to µS (micro Siemens). Next, the GSR signal was filtered using a lowpass filter with a cut-off frequency of 1 Hz. Figure 5 presents an example filtered GSR signal. To address the inter-participant variability of the signal, person-specific min-max normalization was performed, i.e., each signal was scaled to [0, 1] using person specific winsorized minimum and maximum values. The winsorization parameter was set to 3. Figure 4: Normalized Lomb-Scargle periodogram calculated from R-R signal. ASCERTAIN dataset, Subject 1, Video 29 [6]. Figure 5: Filtered GSR signal. ASCERTAIN dataset, person 1, Clip 1 [6]. Figure 6: GSR signal decomposition (green – tonic driver, slow acting component; red – GSR responses, fast acting component). ASCERTAIN dataset, person 1, Clip 1 [6]. An Inter-domain Study for Arousal Recognition … Informatica 42 (2018) 61–68 65 Finally, the fast acting component (GSR responses) and the slow acting component (tonic component) were determined in the signal using the “peakutils.baseline” function from the Python’s PeakUtils library. The function is used with the default parameters. It iteratively performs a polynomial fitting in the data to detect its baseline. For example, in Figure 6, the GSR responses are marked with red and the tonic component (baseline) is marked with green. Based on the related work [30], the preprocessed GSR signal was used to calculate GSR features: mean, standard deviation, 1st and 3rd quartile (25th and 75th percentile), quartile deviation, derivative of the signal, sum of the signal, number of responses in the signal, rate of responses in the signal, sum of the responses, sum of positive derivative, proportion of positive derivative, derivative of the tonic component of the signal, difference between the tonic component and the overall signal. 4.2 Machine learning 4.2.1 Flat machine learning After the feature extraction, the data is in a format which can be input for typical ML algorithms. Models were built using seven different ML algorithms: Random Forest, Support Vector Machine, Gradient Boosting Classifier, AdaBoost Classifier (with a Decision Tree as a base classifier), KNN Classifier, Gaussian Naive Bayes and Decision Tree Classifier. The algorithms were used as implemented in the Scikitlearn, the Python ML library [37]. For each algorithm, a randomized search on hyper parameters was performed on the training data using 2- fold cross-validation. 4.2.2 Stacking The novel stacking scheme, depicted in Figure 7, was designed to train a meta-learner which would utilize the knowledge from all six datasets. In the example scenario, we used the 7 ML algorithms mentioned in the previous section. Thus, there are 42 base models (6 datasets x 7 ML algorithms). The outputs of the base models, which are probabilities for the class “high arousal”, are used as input to a meta-learner. The meta-learner can be any ML algorithm previously mentioned. We experimentally chose Random Forest to be our meta-learner. The meta- learner is trained using a 10 fold-cross validation on the training data. That is, the base learners are trained on 90% of the data, then predictions are provided on the rest 10% of the data, and this procedure is repeated ten times. Finally, the meta-learner is trained on the cross-validated predictions of the base learners. In the test phase, the test instances are provided as input to all of the 42 base models, their output is summed up in a 42 dimensional vector (in Figure 7 marked as p11, p12,…,p67 – six datasets and seven base models) as input to the meta-learner, which provides the final prediction for the test instance. 5 Experimental results Two types of experiments were performed: dataset specific experiments, and experiments with merged datasets. The dataset-specific experiments were used to identify the ML algorithm and the input that would yield the best performance per dataset. The experiments on the merged datasets were used to build general, dataset-independent ML models. This evaluation simulates a scenario where the source (dataset) is unknown, i.e., we do not know whether the subject is watching an affective video (e.g., the DEAP dataset), is driving a car (e.g., the Driving workload dataset) or he/she is working on a cognitive demanding task (e.g., the cognitive load dataset). The evaluation was performed using trial-specific 10-fold cross-validation, i.e., the data segments that belong to one trial (e.g., one affective stimuli), can either belong only to the training set or only to the test set, thus there was no overlapping between the training and test data. 5.1 Dataset specific The results for the dataset-specific experiments are presented in Table 2. The first column represents the ML algorithm, the second column represents the features used as input to the algorithm (R-R, GSR or Merged - M) and the rest of the columns represent the dataset which is used for training and evaluation using the trial-dependent 10-fold cross-validation. We report the mean accuracy ± the standard evaluation for the 10 folds. For each dataset, the best performing model(s) is (are) marked with green. Figure 7: The novel stacking scheme for training a meta-learner that utilizes knowledge from all six datasets. 66 Informatica 42 (2018) 61–68 M. Gjoreski et al. For example, on the Ascertain and the Driving workload dataset, the best performing algorithm is the SVM, on the Deap dataset, the best performing algorithm is the RF, on the Cognitive Load and the Mahnob datasets, the best performing is the NB, and on the Amigos dataset, the best performing is the AdaBoost algorithm. When we compare which input (R-R features, GSR features or Merged-M) provide better accuracy, on two datasets, the Asceratin and the Driving workload, the results are the same, on the Deap dataset, the R-R features provide better results, on the Cognitive Load dataset, the highest accuracy is achieved both for the GSR and the Merged features, on the Mahnob dataset, the GSR features provide best accuracy and on the Amigos dataset, the Merged features. Regarding the majority class, the biggest accuracy improvement was achieved for the Cognitive load dataset, which is an improvement of 9 percentage points. For the two datasets, the Deap and the Amigos, the improvement was 2-3 percentage points, and for the three datasets, the Ascertain, the Driving workload and the Mahnob, the best performing models were as good as the majority classifier. 5.2 Merged datasets In the dataset-specific experiments, none of the algorithms yielded best performance (compared to the rest of the algorithms) over all datasets, thus there was no experimental hint about which algorithm would be able to generalize over all datasets. For that reason, we came up with the stacking approach, where a meta-learner learns how to combine the output of all of the algorithms trained on the different datasets. The details are presented in section 4.2. Stacking. The input to the algorithms was the merged feature set, i.e., R-R and GSR features. We compared the meta-learning approach to a simple approach where the “flat” ML algorithms are trained on all datasets merged. The evaluation is performed using the same trial-specific 10-fold cross- validation. The results are presented in Figure 8. It can be seen that all of the “flat” algorithms achieved an accuracy bellow or equal to 60%. The meta-learning approach slightly improved the results by achieving an Table 2: Dataset-specific experimental results. Mean accuracy ± stdDev for trial-specific 10-fold cross validation. The best performing models per dataset are marked with green [39]. Ascertain Deap D. Workload Cog. Load Mahnob Amigos R-R 0.655 ± 0.07 0.556 ± 0.03 0.785 ± 0.24 0.739 ± 0.13 0.580 ± 0.11 0.536 ± 0.06 GSR 0.638 ± 0.06 0.503 ± 0.04 0.780 ± 0.24 0.763 ± 0.12 0.611 ± 0.07 0.473 ± 0.11 M 0.653 ± 0.05 0.540 ± 0.04 0.785 ± 0.25 0.755 ± 0.13 0.611 ± 0.10 0.559 ± 0.10 R-R 0.664 ± 0.07 0.536 ± 0.05 0.795 ± 0.26 0.717 ± 0.21 0.623 ± 0.15 0.521 ± 0.24 GSR 0.664 ± 0.07 0.525 ± 0.05 0.795 ± 0.26 0.712 ± 0.20 0.588 ± 0.10 0.470 ± 0.12 M 0.664 ± 0.07 0.513 ± 0.03 0.795 ± 0.26 0.691 ± 0.18 0.623 ± 0.15 0.506 ± 0.13 R-R 0.649 ± 0.07 0.554 ± 0.03 0.785 ± 0.20 0.736 ± 0.15 0.578 ± 0.11 0.543 ± 0.06 GSR 0.642 ± 0.05 0.500 ± 0.04 0.800 ± 0.21 0.743 ± 0.12 0.609 ± 0.08 0.527 ± 0.09 M 0.644 ± 0.05 0.533 ± 0.03 0.755 ± 0.23 0.761 ± 0.15 0.609 ± 0.11 0.542 ± 0.09 R-R 0.658 ± 0.06 0.532 ± 0.02 0.750 ± 0.23 0.718 ± 0.13 0.580 ± 0.09 0.531 ± 0.07 GSR 0.633 ± 0.05 0.485 ± 0.03 0.750 ± 0.22 0.740 ± 0.13 0.589 ± 0.08 0.514 ± 0.09 M 0.623 ± 0.05 0.526 ± 0.03 0.755 ± 0.22 0.766 ± 0.16 0.610 ± 0.08 0.560 ± 0.08 R-R 0.625 ± 0.05 0.509 ± 0.02 0.710 ± 0.19 0.715 ± 0.13 0.582 ± 0.07 0.509 ± 0.05 GSR 0.590 ± 0.06 0.496 ± 0.04 0.795 ± 0.26 0.772 ± 0.09 0.605 ± 0.06 0.533 ± 0.08 M 0.600 ± 0.05 0.490 ± 0.02 0.750 ± 0.23 0.770 ± 0.13 0.601 ± 0.09 0.533 ± 0.06 R-R 0.654 ± 0.07 0.537 ± 0.04 0.735 ± 0.15 0.748 ± 0.15 0.574 ± 0.06 0.485 ± 0.09 GSR 0.602 ± 0.04 0.537 ± 0.05 0.540 ± 0.22 0.803 ± 0.09 0.624 ± 0.07 0.454 ± 0.10 M 0.591 ± 0.04 0.535 ± 0.06 0.665 ± 0.17 0.804 ± 0.12 0.592 ± 0.06 0.486 ± 0.09 R-R 0.664 ± 0.07 0.519 ± 0.05 0.685 ± 0.17 0.736 ± 0.15 0.597 ± 0.09 0.505 ± 0.06 GSR 0.640 ± 0.05 0.542 ± 0.05 0.765 ± 0.22 0.734 ± 0.08 0.583 ± 0.09 0.483 ± 0.11 M 0.650 ± 0.05 0.524 ± 0.04 0.615 ± 0.22 0.704 ± 0.09 0.581 ± 0.13 0.551 ± 0.09 Majority 0.664 0.536 0.795 0.717 0.623 0.521 GB AdaB KNN NB DT Algorithm Features Dataset RF SVM Figure 8: Accuracy of the meta-learner and the “flat” approaches for the merged-datasets experiments. 0,58 0,60 0,59 0,58 0,56 0,60 0,59 0,63 0,51 0,50 0,55 0,60 0,65 0,70 RF SVM GB AdaB KNN NB DT Meta Majority Accuracy An Inter-domain Study for Arousal Recognition … Informatica 42 (2018) 61–68 67 accuracy of 63%. 6 Conclusion and discussion We presented an inter-domain study for arousal recognition on six different datasets, recorded with twelve different hardware sensors. We experimented with dataset-specific models, general models built on the overall (merged) data and general models build using the novel stacking scheme. For the dataset-specific models, we compared the results of seven different ML algorithms, using three different feature inputs (R-R, GSR or Merged – M features). For the models built on the overall (merged) data, we compared the results of the novel stacking scheme and “flat” ML models. The results on the dataset-specific setup showed that, out of the seven ML algorithms tested, none yields the best performance on all datasets. In addition to that, a clear conclusion cannot be made whether the R-R, GSR or the Merged features yield the best results – this is domain (dataset) dependent. On the merged-datasets experiments, the novel stacking scheme slightly outperformed the “flat” models. This was expected since the stacking scheme utilizes seven different ML models built on the six different datasets, thus 42 different models (views). However, the experimental results show that there is room for improvement regarding the accuracy achieved in both types of experiments. In the future, we plan to investigate more advanced techniques such as deep neural networks and transfer learning, which might be able to learn more accurate models that will be able to generalize across different domains. Finally, once we find the best performing scenario, we will generalize the method for arousal recognition to a method for valence recognition and method for discrete emotion recognition. 7 References [1] W. Wundt. Outlines of psychology (C. H. Judd, Trans.). Oxford, UK: Engelman, 1897. [2] R. Picard. Affective Computing. Cambridge, MA: MIT Press, 1997. [3] Depression cost: http://ec.europa.eu/health/sites/ health/files/mental_health/docs/matrix_economic_an alysis_mh_promotion_en.pdf, [Accessed 27.03.2017]. [4] S. Marsella, J. Gratch. Computationally modeling human emotion. Commun. ACM 57, 12 (November 2014), pp. 56-67. 2014. [5] J. A. Russell. A circumplex model of affect. Journal of Personality and Social Psychology, 1980. [6] R. Subramanian, J. Wache, M. Abadi, R. Vieriu, S. Winkler, N Sebe. ASCERTAIN: Emotion and Personality Recognition using Commercial Sensors. IEEE Transactions on Affective Computing. 2016. [7] S. Koelstra, C. Muehl, M. Soleymani, J.-S. Lee, A. Yazdani, T. Ebrahimi, T. Pun, A. Nijholt, I. Patras. DEAP: A Database for Emotion Analysis using Physiological Signals (PDF). IEEE Transaction on Affective Computing, 2012. [8] M.K. Abadi, R. Subramanian, S. M. Kia, P. Avesani, I. Patras. Nicu Sebe. DECAF: MEG- Based Multimodal Database for Decoding Affective Physiological Responses. IEEE Transactions on Affective Computing, 2015. [9] N.R. Lomb. Least-squares frequency analysis of unequally spaced data. Astrophysics and Space Science, vol 39, pp. 447-462, 1976 [10] Z. Zeng, M. Pantic, G.I. Roisman, T.S. Huang.A survey of affect recognition methods: audio, visual, and spontaneous expressions. Transactions on Pattern Analysis and Machine Intelligence, 2009. [11] I. Abdic, L. Fridman, D. McDuff, E. Marchi, B. Reimer, Schuller, B. Driver Frustration Detection From Audio and Video. Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI'16), 2016. [12] S. Jerritta, M. Murugappan, R, Nagarajan, K. Khairunizam. Physiological Signals Based Human Emotion Recognition: A Review. International Colloquium on Signal Processing and its Applications. 2011. [13] M. Garbarino, M. Lai, D. Bender, R. W. Picard, S. Tognett. Empatica E3 - A wearable wireless multi- sensor device for real-time computerized biofeedback and data acquisition. 4th International Conference on Wireless Mobile Communication and Healthcare, pp. 3-6, 2014. [14] Microsoft band. https://www.microsoft.com/ microsoft-band/en-us [15] D. Iacovielloa, A. Petraccab, M. Spezialettib, G. Placidib. A real-time classification algorithm for EEG-based BCI driven by self-induced emotions. Computer Methods and Programs in Biomedicine, 2015. [16] M. Khezria, M.Firoozabadib, A. R. Sharafata. Reliable emotion recognition system based on dynamic adaptive fusion of forehead biopotentials and physiological signals. [17] R. M. Mehmooda, H. J. Leea. A novel feature extraction method based on late positive potential for emotion recognition in human brain signal patterns. Computers & Electrical Engineering, 2016. [18] A. Mikuckas, I. Mikuckiene, A. Venckauskas, E. Kazanavicius2, R. Lukas2, I. Plauska. Emotion Recognition in Human Computer Interaction Systems. Elektronika Ir Elektrotechnika, Reserarch Journal, Kaunas University of Technology, 2014. [19] G. K. Verma, U. S. Tiwary. Multimodal fusion framework: A multiresolution approach for emotion classification and recognition from physiological signals. NeuroImage, 2014. [20] Z. Yin, M. Zhao, Y. Wang, J. Yang, J Zhang. Recognition of emotions using multimodal physiological signals and an ensemble deep learning model. Computer Methods and Programs in Biomedicine, pp. 93-110, 2017. [21] L. I. Kuncheva, T. Christy, I. Pierce, Sa’ad P. Mansoor. Multi-modal Biometric Emotion Recognition Using Classifier Ensembles. 68 Informatica 42 (2018) 61–68 M. Gjoreski et al. Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, 2011. [22] Wei Liu, Wei-Long Zheng, Bao-Liang Lu. Multimodal Emotion Recognition Using Multimodal Deep Learning. Online. Available at: https://arxiv.org/abs/1602.08225, 2016. [23] W-L. Zheng, B-L Lu. A multimodal approach to estimating vigilance using EEG and forehead EOG. Journal of Neural Engineering, 2017. [24] Z. Yin, M. Zhao, Y. Wang, J. Yang, J. Zhang. Recognition of emotions using multimodal physiological signals and an ensemble deep learning model. Comput Methods Programs Biomed. 2017. [25] K.Weiss, T. M. Khoshgoftaar, D. Wang. A survey of transfer learning. Journal of Big Data, 2016. [26] S. Schneegass, B. Pfleging, N. Broy, A. Schmidt, Frederik Heinrich. A Data Set of Real World Driving to Assess Driver Workload. 5th International Conference on Automotive User Interfaces and Interactive Vehicular Applications, 2013. [27] M. Gjoreski, M. Luštrek, M. Gams, H. Gjoreski. Monitoring stress with a wrist device using context. Journal of Biomedical Informatics, 2017, in press. [28] M. Gjoreski, H. Gjoreski, M. Luštrek, M. Gams. Continuous stress detection using a wrist device: in laboratory and real life. ACM Conf. on Ubiquitous Computing, Workshop on mentalhealth, pp. 1185- 1193, 2016. [29] M. Soleymani, T.Pun. A Multimodal Database for Affect Recognition and Implicit Tagging, IEEE Transactions On Affective Computing, 2012. [30] J. A. Miranda-Correa, M. Khomami Abadi, N. Sebe, I. Patras. AMIGOS: A Dataset for Affect, Personality and Mood Research on Individuals and Groups. Transactions On Affective Computing, 2017. [31] L. H. Negri. Peak detection algorithm. Python Implementation. Online. Available at: http://pythonhosted.org/PeakUtils/. [32] M. Wu, PhD thesis. Michigan State University; 2006. Trimmed and Winsorized Eestimators. [33] J.D. Scargle. Studies in astronomical time series analysis. II - Statistical aspects of spectral analysis of unevenly spaced data. The Astrophysical Journal, vol 263, pp. 835-853, 1982. [34] D. P. Kingma, J. Ba. Adam: A Method for Stochastic Optimization, http://arxiv.org/abs/1412.6980, 2014. [35] Tensorflow. Online. Available at: https://www.tensorflow.org/ [36] R. Castaldoa, P. Melillob, U. Bracalec, M. Casertaa,c, M. Triassic, L. Pecchiaa. Acute mental stress assessment via short term HRV analysis in healthy adults: A systematic review with meta- analysis. Biomedical Signal Processing and Control. 2015. [37] Scikit-learn, Python machine-learning library http://scikit-learn.org/dev/_downloads/scikit-learn- docs.pdf [38] L.J.P, van der Maaten., G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research. 9: 2579–2605, 2008. [39] M. Gjoreski, B. Mitrevski, Mitja Luštrek, Matjaž Gams. R-R vs GSR – An inter-domain study for arousal recognition, Multiconference Information Society, Ljubljana, 2017. [40] Python library for signal analysis: http://pythonhosted.org/PeakUtils/ Informatica 42 (2018) 69–76 69 Computational Creativity in Slovenia Senja Pollak1, Geraint A. Wiggins2,3, Martin Žnidaršič and Nada Lavrač1,4 1 Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Computational Creativity Lab, Queen Mary University of London, London E1 4NS, UK 3 AI Lab, Free University of Brussels, Brussels 1050, Belgium 4 University of Nova Gorica, Nova Gorica, Slovenia E-mail: {senja.pollak,martin.znidarsic,nada.lavrac}@ijs.si; geraint.wiggins@qmul.ac.uk Keywords: computational creativity, bisociative reasoning, computational creativity platform Received: November 6, 2017 Computational Creativity is a field of Artificial Intelligence that addresses processes that would be deemed creative if performed by a human. The field has been very active since 1999, and is now an established research field with its own International Conference on Computational Creativity (ICCC) conference series founded in 2010. This paper briefly surveys the field of Computational Creativity (CC) that is based on the analysis of ICCC conference papers, followed by a more detailed presentation of projects and selected contributions of Slovenian researchers to the field. Povzetek: Računalniška ustvarjalnost je področje umetne inteligence, ki obravnava procese, ki bi jih oce- nili kot kreativne, če bi jih izvajal človek. Področje računalniške ustvarjalnosti se je razmahnilo po letu 1999, kot veja znanosti pa se je uveljavilo leta 2010 z ustanovitvijo serije letnih konferenc z imenom In- ternational Conference on Computational Creativity (ICCC). V članku podamo kratek pregled področja računalniške ustvarjalnosti, ki temelji na analizi ICCC konferenčnih člankov, posebno pozornost pa name- nimo predstavitvi projektov in izbranih dosežkov slovenskih raziskovalcev. 1 Introduction As a sub-field of Artificial Intelligence (AI) research, Com- putational Creativity1 (CC) is concerned with machines that exhibit behaviours that might reasonably be deemed creative [49; 11]. Slovenian researchers have made impor- tant contributions to CC. This paper aims to provide an ob- jective snapshot of the field of computational creativity as a whole, and to give a brief summary of the particular con- tribution of Slovenian researchers to it. In the next section, we summarise an analysis of the re- search field, that we conducted in 2016, using it to structure a brief introduction to the field for unfamiliar readers. We then summarise the contributions of Slovenian researchers to CC. 2 Domain understanding We here summarise the results of a study of the research field of Computational Creativity [36], which was based on the analysis of papers published in the Proceedings of the International Conference on Computational Creativity (ICCC)2. The aim of the study was to objectively identify areas of interest in this research field. Here, we use its con- clusions to motivate our subsequent outline of CC research. 1http://computationalcreativity.net 2http://computationalcreativity.net/home/ conferences/ In the previous study, Pollak et al. [36] used semi- automatic topic ontology generation tool OntoGen [16] to explore the texts of the complete conference proceedings of the International Conference on Computational Creativity to date. This allowed them to make an objective, explaina- ble bottom-up analysis of the field. The input to the OntoGen tool are documents, which are texts of individual articles from the proceedings. Af- ter manual text cleaning and removal of the papers’ refe- rence sections, OntoGen performs stemming and stop word removal, followed by the construction of Bag-of-Words (BoW) feature vector representations of documents, where the features are weighted by the TF·IDF heuristic [41] and used for clustering. The user may explore the results, and identify hierarchies of significant terms and clusters of do- cuments. The keywords are identified by OntoGen in two ways: descriptive keywords are extracted from document centroid vectors, while distinctive keywords are extracted from the SVM classification model distinguishing the do- cuments in the given topic (document cluster) from the do- cuments neighbouring clusters [16]. Other functionalities used were expert’s manual moving of documents between clusters to reduce inappropriate paper categorisation and active learning of selected concepts/categories. Several outputs were presented by Pollak et al. [36], in- cluding understanding of the field of Computational Crea- tivity based on its topics, which is also of interest to this study. A final corpus-based categorisation of the field of com- 70 Informatica 42 (2018) 69–76 S. Pollak et al. Figure 1: Semi-automatically generated conceptualization of the CC domain, with CC concept naming and sub-concept creation. putational creativity is presented in Figure 1. The main sub-domains of computational creativity identified by our method were: Musical, Visual, Linguistic creativity, Ga- mes and creativity, Conceptual creativity as well as ne- wly created category of Evaluation. For several domains, subcategories were detected also at lower levels, including Narratives, Poetry, Recipes and Lexical creativity as subdo- mains of Linguistic creativity. Each category can be further characterised through descriptive keywords listed in Table 1, as extracted from cluster centroid vectors. 3 Brief review of computational creativity We now present a brief overview of Computational Crea- tivity, as represented by the domains identified by Pollak et al. [36]. We have added in an additional category, scien- tific creativity, on the grounds that important work in this area was performed prior to the inception of the ICCC con- ference, and was therefore not represented in the conferen- ces. In this position paper we do not present a detailed re- view of the field but explain the key issues and cite some successful exemplars of CC research. A recurring general theme of ICCC is the attempt to better understand what is meant by term “creativity.” Early on, it was recognised that we must move away from Romantic notions of “great” cre- ativity, if we are to make progress. So ICCC is interested in creative process more than creative output, and there is no acceptance of the notion of “inspiration”, understood as mystical intervention by some agency extrinsic to the cre- ator. Of course, in a paper such as this, one cannot discuss process without reference to outputs, without being inter- minably dull. For this reason, we include examples where possible. Boden [1, 2] first formally raised the question of creati- vity in AI, but there have been significant precursors of CC field in several domains that are also mentioned here. 3.1 Visual creativity Most work on visual creativity is conceptualised in terms of painting or drawing. In this domain, there tends to be a focus on painting technique and on the objects produced. The clear forerunner of CC in this domain was Harold Cohen, a successful artist in his own right, who built a ro- bot painter, AARON3, programmed in a rule-based style. Its development began in the 1970s, with developments right up to the artist’s death in 2016 [26; 2]. Cohen vie- wed AARON as a part of his art, and therefore did not al- ways disclose the methods used to make it work, though he did write several papers on some aspects of the system [e.g., 7; 6; 5]. Figure 2(a) shows a well-known painting by AARON. Simon Colton’s The Painting Fool4 deconstructs painting from subject composition (for example, collage based on stories from The Guardian newspaper) right down to brush stroke [9]. Figure 2(b) shows an example. DARCI5 [27], unusually, is multi-modal and can explain itself: it combines image processing with language com- prehension, so as to focus the system on the extraction and generation of meaning. DARCI produced the image in Fi- 3www.aaronshome.come 4http://www.thepaintingfool.com 5http://darci.cs.byu.edu Computational Creativity in Slovenia Informatica 42 (2018) 69–76 71 Table 1: Categories and keywords of the first layer of the semi-automatically constructed CC ontology. Category Automatically extracted keywords Musical music, chord, improvisation, melodies, harmonize, composition, accompaniment, pitch, emotions, beat Visual image, painting, darci, artifacts, collage, adjectives, associations, rendered, colored, artists Linguistic story, poems, actions, character, words, agents, narrative, artefacts, poetry, evaluating Games games, design, player, games design, angelina, agents, code, jam, filter, gameplay Conceptual analogy, blending, mapping, conceptual, objective, associations, team, graphs, concepts, domain Evaluation music, poems, improvisation, evaluating, interactive, poetry, creativity system, musician, participants, beha- vioural Comp. creativity music, image, story, games, agents, words, actions, poems, character, blending gure 2(c), explaining it as follows (there is not space here for the intermediate images): “I looked at this picture, [an elephant walking across a verdant African plain] and it re- minded me of this image that I’ve seen before, [a standing stone] which is a picture of a stone. The picture also see- med gloomy and brooding. So I created this initial sketch, [black and white graphic drawing] and then rendered it in a style related to stone, gloomy, and brooding, which resul- ted in this image. [intermediate image] It turned out more like a bucket or a cauldron, and it seems creepy, but I’m happy with it.” 3.2 Creative game design Computational creativity has many applications in games, perhaps most obviously in the area of game level genera- tion, where the landscape and structure of a game are cre- ated live. However, probably the most unexpected and in- teresting example of CC in games is Yavalath6 [3], ranked in the top 100 board games ever invented by the Board- GameGeek website. It is highly novel in that the board is hexagonal. Another success has been Angelina7 [12], a long term project aiming for completely computational creativity of digital games. 3.3 Linguistic creativity Creativity in language covers a broad area, including poetry and story-telling. Two systems that demonstrate different approaches are MEXICA [31] and Propper [17]. MEXICA uses a general creative method, the Engagement-Reflection model, to model a two-phase, cyclic approach to creativity. Propper takes a contrasting approach, using heuristics from literary theory [38] to guide exploratory creativity. A third successful approach is that of Tony Veale [46]. Veale’s lab specialises in the development of elegant methods of ex- tracting data from lingustic corpora, and then using that data for creative text generation, often in TwitterBots—see @MetaphorMagnet [e.g., 47]. 6http://www.cameronius.com/games/yavalath/ 7http://www.gamesbyangelina.org 3.4 Musical creativity Musical creativity had important precursors too. David Cope’s EMI [13; 50] produced many compositions, but none of the reports on the work made it clear what the system actually did, and how much was due to its author. A clearer early contribution, with full scientific reporting, was by [14], which produced musical harmony in the style of J. S. Bach. This is a remarkable contribution, and still stands today as an excellent piece of work; its fault is that its harmonisations sound too much like Bach—the system does not reflect on its overall balance, but applies Bachian compositional tricks everywhere. Perhaps the first attempt at automated composition really to situate itself in CC was the work of [28]. Melodies were generated from a learned model of style, and evaluated in detail by expert musicians [29]. François Pachet’s team has produced the most thorough CC music systems to date, working from chords and melo- dies right through to studio production [40]. 3.5 Scientific creativity It is often forgotten that human creativity is evident in science and engineering, as well as in the arts and huma- nities. One of the earliest successes in CC was the HR system of Colton [8]. This was an exploratory creativity system, which invented new integer sequences with pro- perties that mathematicians find interesting; 17 of the se- quences it dicovered were novel and interesting enough to be included in the Journal of Integer Sequences, which re- cords these structures and acts as an encyclopedia of them. It also made conjectures about some of these sequences that were subsequently proven correct. Another successful project in scientific creativity was funded by the EU FP7 programme: BISON studied the ap- plication of bisociative reasoning [20] to medical text ana- lysis (see Section 4.1). 3.6 Concept creation Concept creation arises as a separate category in the ob- jective analysis because it is central to all creative domains. 72 Informatica 42 (2018) 69–76 S. Pollak et al. (a) (b) (c) Figure 2: Three computationally created images. (a) Untitled from AARON’s middle period output. (b) The Painting Fool’s Uneasy. (c) DARCI’s Always Be A Gloomy Cauldron, Even in Creepy Stone. There are too many approaches to survey here; however, a recurring theme is conceptual blending [44], which has been carried forward with some success. An example is the Divago system [30], a computational model that uses con- ceptual blending. The key idea here is somewhat similar to Koestler’s bisociation [20]: new concepts are created from combinations of features of existing and/or imagined ones. A recent EU FP7 project, ConCreTe, focused on Concept Creation Technology (see Section 4.3). 3.7 Creative systems evaluation Evaluation is a particularly difficult problem in comptua- tional creativity, which attracts commensurate attention in the literature. There are two distinct ways that computatio- nally creative systems involve evaluation: first, in the con- ventional scientific sense, where the correctness and value of work is assessed; and, second, in the sense of reflection within the system, that allows it to make intelligent creative decisions. Quite often, but certainly not always, these two aims coincide. The value of a creative act is a function of four aspects [51]: Context, Observer, Creator and Artefact, forming the acronym COCA. But this does not give detail of how cre- ativity might actually be assessed. Ritchie [39] gives a de- tailed set of criteria that can be used to assess the creativity of a computer program, which have been used in several projects. Jordanous [e.g., 18] and van der Velde et al. [e.g, 45] have made substantial contributions in this area. 4 Computational creativity in Slovenia To the best of our knowledge, the only Computational Cre- ativity research in Slovenia has been performed by the members of Department of Knowledge Technologies at Jožef Stefan Institute (JSI) in Ljubljana. Most of the rese- arch, including the work summarised in Section 2, has ta- ken place within three distinct EU-funded projects and the PROSECCO networking action, all supported by the Euro- pean FP7 funding programme. We summarise this work, with a special focus on Slovenian contributions. 4.1 Bisociation networks for creative information discovery (BISON) BISON8 was a research project from the field of scientific creativity, which deals with the bisociation-based scienti- fic knowledge discovery. Arthur Koestler [20] argued that the essence of creativity lies in “perceiving of a situation or idea . . . in two self-consistent but habitually incompa- tible frames of reference”, and introduced the expression bisociation to characterise this creative act. The key vi- sion of the BISON project was to develop a fundamentally new ICT paradigm for bisociative information discovery. JSI’s main contributions were related to scientific literature mining aimed at creatively forming new hypotheses based on yet uncovered relations between knowledge from diffe- rent, relatively isolated fields of specialization. We deve- loped CrossBee9, a literature-based discovery support tool [19], where different elementary and ensemble heuristics 8http://cordis.europa.eu/project/rcn/86374_en. html 9http://crossbee.ijs.si/ Computational Creativity in Slovenia Informatica 42 (2018) 69–76 73 are implemented for bisociative bridging term (b-term) dis- covery. The heuristics are defined as functions that numeri- cally evaluate the term quality by assigning it a bisociation score (measuring the potential that a term is actually a b- term). Other methodologies developed for cross-domain literature based discovery focus on exploration of outlier documents [34; 42]. JSI’s methods were tested on standard datasets (e.g., migraine-magnesium studied in early rese- arch by [43], but also actually led to new hypotheses in understanding autism [23] and Alzheimer’s disease [4]. 4.2 The What-If Machine (WHIM) The WHIM project was concerned with the automated ge- neration, understanding and evaluation of fictional ideas. Fictional ideas are propositions of situations that are un- realistic or commonly considered as unplausible, such as: “What if there was a little fish who couldn’t swim?” which are a central part of various creative works and products. Artificial production of What-if ideas is creative work that is inherently hard to automate, but there are now some ge- nerators available (e.g., [22]). In the generation process, there is usually a trade-off between a template driven pro- cess (with a relatively narrow covering of the fictional ide- ation space) or more open and autonomous generative pro- cess (producing more interesting and valuable ideas, but larger amount of lower quality results). The What-if Machine was also the inspiration for a real musical show Beyond The Fence, billed as “the world’s first computer-generated musical”, that performed in Lon- don in 2016. In this artistic project—containing the mu- sical and a documentary—several computational creativity research prototypes were combined and used in the artistic process [10]. JSI’s main role in the WHIM project was in automated modelling of human evaluations. The main tasks included the design of a large crowd-sourcing data gathering exer- cise, resulting in more than 10,000 evaluated fictional ideas and next, to build data mining models, which would al- low differentiation between the sentences, appreciated by human evaluators as good/creative (regarding their novelty and narrative potential) or bad. We tested also an alter- native approach for gathering human evaluations through interaction with the robot Nao [35]. Other contributions of Slovenian researchers to the WHIM project included biso- ciative generation of fictional ideation [32] and the Robo- Chair10 system for enhancing scientific creativity by gene- rating questions regarding decisions made by authors when writing scientific articles [37]. 4.3 Concept creation technology (ConCreTe) The ConCreTe11 project focused on AI technology for con- cept construction, identification, and evaluation. ConCreTe 10http://kt-robochair.ijs.si/ 11http://www.conceptcreationtechnology.eu addressed several forms of conceptual blending (CB), a ba- sic cognitive mechanism by which two or more mental spa- ces are integrated to produce new concepts [15]. Optima- lity principles (OPs), a key element in the CB framework, are responsible for guiding the integration process towards good blends. The role of OPs was studied from the point of view of computational systems [24], as well as within a study of human perception of visual animal blends12 [25], performed with the aim of better understanding of creative artefacts reception. The main contribution of JSI to ConCreTe was the Con- CreTeFlows platform13 [48] for collective CC workflows construction. It is a platform built on top of the exis- ting ClowdFlows infrastructure [21], but it is specialised at supporting (primarily text-based) computational creativity tasks, such as conceptual blending and poetry generation. It currently contains more than 35 native widgets for suppor- ting creativity by developers from five different institutions participating in ConCreTe. The asset of a web-based sy- stem is that it integrates creative software written in a large variety of programming languages (e.g., components writ- ten in Python, C#, Java, PROLOG). An interesting example of multimodal conceptual blending [48] is available as an interactive workflow.14 4.4 Other projects and activities We have described the main projects from the field of CC in which we were actively involved. Other project were closely related to computational creativity. For example, within the EU project MUSE15, the question of interactive story-telling was addressed. Our main role was the integra- tion of the developed components in the online workflow environment [33]. The PROSECCO16 networking action had a crucial role in building the European CC community, with a number of events including the organisation of summer schools, code camps, etc. Computational Creativity has became an important research topic in Slovenia. A large number of activities were organised also by Slovenian researchers and held place in Ljubljana, including the 5th edition of the ICCC conference17, with material available through Vide- oLectures18, and the Symposium on Computational Crea- tivity19. We have also organised the Computational Creati- vity art exhibition entitled You/Me/It.20 Since 2016, a Computational Creativity course has been offered at the International Postgraduate School Jožef Ste- 12http://animals.janez.me/ 13http://concreteflows.ijs.si 14http://concreteflows.ijs.si/workflow/137/ 15http://www.muse-project.eu/ 16http://prosecco-network.eu/ 17http://procsecco-network.eu 18http://videolectures.net/iccc2014_ljubljana/ 19http://videolectures.net/ktsymposium2013_ ljubljana/ 20http://computationalcreativity.net/iccc2014/ you-me-it-art-exhibition/ 74 Informatica 42 (2018) 69–76 S. Pollak et al. fan21. As CC related outreach activity, a large number of events for children and youth were organised for science promo- tion by means of a Nao robot, for which the main developer Vid Podpečan received the Slovenian “Prometej znanosti” (Prometheus of Science) science dissemination award. 5 Conclusion This paper presented a brief review of historic and current activity in Computational Creativity, an exciting and rela- tively new sub-field of Artificial Intelligence. In particular, we have highlighted contributions from Slovenian resear- chers. Computational Creativity is in some sense a final frontier for AI [11], because it pulls the field away from comforta- bly defined problem-solving activity such as classification, into the areas that are more challenging to formulate. Much of the work in this developing field is focused not so much on “What is the answer?” but rather on “What is the ques- tion?”, and this makes for exciting prospects for the future, both in Slovenia and elsewhere. In 2008, the Association for Computational Creativity22 (ACC) was founded to ma- nage the ICCC conferences and support the CC community into the future. Acknowledgements We acknowledge the support of the Slovenian Research Agency (core funding no. P2-0103), the European pro- jects Prosecco (grant no. 600653) and ConCreTe (grant nb. 611733). GW is very grateful to the International Pos- tgratuate School Jožef Stefan internationalisation grant for funding a sabbatical visit in Autumn 2017, which enabled his contribution to this paper. Literature [1] Boden, M. (1977). Artificial Intelligence and Natural Man. Harvester Press. [2] Boden, M. A. (2004). The Creative Mind: Myths and Mechanisms (2nd ed.). Routledge. [3] Browne, C. (2008). Automatic Generation and Evalu- ation of Recombination Games. Ph. D. thesis, Queens- land University of Technology. [4] Cestnik, B., E. Fabbretti, D. Gubiani, T. Urbančič, and N. Lavrač (2017). Reducing the search space in literature-based discovery by exploring outlier docu- ments: A case study in finding links between gut mi- crobiome and alzheimers disease. Genomics and Com- putational Biology 3(3), 58. 21https://www.mps.si/splet/studij.asp?lang=eng& main=1&left=4&id=721&m=4 22http://computationalcreativity.net [5] Cohen, H. (1979). What is an image? In Proceedings of the 1979 International Joint Conference on Artificial Intelligence. [6] Cohen, H. (1988). How to draw three people in a bo- tanical garden. In Proceedings of the 1988 Conference of the American Association for Artificial Intelligence (AAAI-88). [7] Cohen, H. (1999). Colouring without seeing: A pro- blem in machine creativity. AISB Quarterly 102, 26–35. [8] Colton, S. (2012a). Automated Theory Formation in Pure Mathematics. Distinguished Dissertations. Sprin- ger London. [9] Colton, S. (2012b). The painting fool: Stories from building an automated artist. In J. McCormack and M. d’Inverno (Eds.), Computers and Creativity. Springer-Verlag. [10] Colton, S., M. T. Llano, R. Hepworth, J. W. Charn- ley, C. V. Gale, A. Baron, F. Pachet, P. Roy, P. Gervás, N. Collins, B. L. Sturm, T. Weyde, D. Wolff, and J. R. Lloyd (2016). The Beyond the Fence musical and Com- puter Says Show documentary. In Proceedings of the Se- venth International Conference on Computational Cre- ativity, UPMC, Paris, France, June 27 - July 1, 2016., pp. 311–321. [11] Colton, S. and G. A. Wiggins (2012). Computational creativity: The final frontier? In de Raedt L. et al. (Ed.), Proceedings of ECAI Frontiers. [12] Cook, M., S. Colton, A. Raad, and J. Gow (2013). Mechanic miner: Reflection-driven game mechanic dis- covery and level design. In A. I. Esparcia-Alcázar (Ed.), Applications of Evolutionary Computation: 16th Euro- pean Conference, Proceedings, pp. 284–293. Springer. [13] Cope, D. (1992). Computer modelling of musical in- telligence in EMI. Computer Music Journal 16(2), 69– 83. [14] Ebcioğlu, K. (1988). An expert system for harmoni- zing four-part chorales. Computer Music Journal 12(3), 43–51. [15] Fauconnier, G. and M. Turner (2002). The Way We Think. New York: Basic Books. [16] Fortuna, B., D. Mladenič, and M. Grobelnik (2006). Semi-automatic construction of topic ontologies. In Se- mantics, Web and Mining: Joint International Works- hops, EWMF 2005 and KDO 2005, Revised Selected Papers, pp. 121–131. Springer. [17] Gervás, P. (2015). Computational drafting of plot structures for Russian folk tales. Cognitive Computa- tion. Computational Creativity in Slovenia Informatica 42 (2018) 69–76 75 [18] Jordanous, A. (2012). A standardised procedure for evaluating creative systems: Computational creativity evaluation based on what it is to be creative. Cognitive Computation 4(3), 246–279. [19] Juršič, M., B. Cestnik, T. Urbančič, and N. Lavrač (2012, may). Cross-domain literature mining: Finding bridging concepts with crossbee. In Proceedings of the Third International Conference on Computational Cre- ativity, Dublin, Ireland, pp. 33–40. [20] Koestler, A. (1976). The Act of Creation. London, UK: Hutchinson. [21] Kranjc, J., V. Podpečan, and N. Lavrač (2012). Clo- wdFlows: A cloud based scientific workflow platform. In Machine Learning and Knowledge Discovery in Da- tabases - European Conference, ECML PKDD 2012, Bristol, UK, September 24-28, 2012. Proceedings, Part II, pp. 816–819. [22] Llano, M. T., S. Colton, R. Hepworth, and J. Gow (2016). Automated fictional ideation via knowledge base manipulation. Cognitive Computation 8(2), 153– 174. [23] Macedoni-Lukšič, M., T. Urbančič, I. Petrič, and B. Cestnik (2016). Autism research dynamic through ontology-based text mining. Advances in Autism 2(3), 131–139. [24] Martins, P., S. Pollak, T. Urbančič, and A. Cardoso (2016). Optimality principles in computational appro- aches to conceptual blending: Do we need them (at) all? In Proceedings of the Seventh International Confe- rence on Computational Creativity (ICCC 2016), Paris, France. Sony CSL: Sony CSL. [25] Martins, P., T. Urbančič, S. Pollak, N. Lavrač, and A. Cardoso (2015). The good, the bad, and the aha! blends. In Proceedings of ICCC, pp. 166–173. compu- tationalcreativity.net. [26] McCorduck, P. (1991). AARON’S CODE: Meta-Art, Artificial Intelligence and the Work of Harold Cohen’S CODE: Meta-Art, Artificial Intelligence and the Work of Harold Cohen. Freeman. [27] Norton, D., D. Heath, and D. Ventura (2013). Fin- ding creativity in an artificial artist. Journal of Creative Behavior 47(2), 106–124. [28] Pearce, M. T. (2005). The Construction and Evalua- tion of Statistical Models of Melodic Structure in Music Perception and Composition. Ph. D. thesis, Department of Computing, City University, London, London,UK. [29] Pearce, M. T. and G. Wiggins (2007). Evaluating cog- nitive models of musical composition. In A. Cardoso and G. Wiggins (Eds.), Proceedings of the 4th Internati- onal Joint Workshop on Computational Creativity, Lon- don, pp. 73–80. Goldsmiths, University of London. [30] Pereira, F. C. (2007). Creativity and Artificial Intelli- gence: A Conceptual Blending Approach. Berlin: Mou- ton de Gruyter. [31] Pérez y Pérez, R. and M. Sharples (2001). Mexica: A computer model of a cognitive account of creative wri- ting. Journal of Experimental & Theoretical Artificial Intelligence 13(2), 119–139. [32] Perovšek, M., B. Cestnik, T. Urbančič, S. Colton, and N. Lavrač (2013). Towards narrative ideation via cross- context link discovery using banded matrices. In IDA, Volume 8207 of Lecture Notes in Computer Science, pp. 333–344. Springer. [33] Perovšek, M., V. Podpečan, J. Kranjc, T. Erjavec, S. Pollak, N. Q. Do Thi, X. Liu, C. Smith, M. Cavazza, and N. Lavrač (2015). Text mining platform for NLP workflow design, replication and reuse. In Proceedings of IJCAI Workshop on Replicability and Reusability in Natural Language Processing: From Data to Software Sharing, Buenos Aires, Argentina, 26 July 2015. [34] Petrič, I., B. Cestnik, N. Lavrač, and T. Urbančič (2012, January). Outlier detection in cross-context link discovery for creative literature mining. Comput. J. 55(1), 47–61. [35] Podpečan, V. (2015). The What-If machine robot in- terface (WHIMBOT). In Show, tell imagine: A day to explore computational creativity together, pp. 17. Queen Mary, Univ. of London. [36] Pollak, S., B. M. Boshkoska, D. Miljkovic, G. Wig- gins, and N. Lavrač (2016). Computational creativity conceptualisation grounded on iccc papers. In V. C. F. a. G. François Pachet, Amilcar Cardoso (Ed.), Procee- dings of ICCC 2016, pp. 123–130. Association for Com- putaitonal Creativity. [37] Pollak, S., B. Lesjak, J. Kranjc, V. Podpečan, M. Žnidaršič, and N. Lavrač (2015). RoboCHAIR: Cre- ative assistant for question generation and ranking. In Proceedings of SSCI, pp. 1468–1475. IEEE. [38] Propp, V. (1968). Morphology of the folktale. Austin: University of Texas Press. [39] Ritchie, G. (2007). Some empirical criteria for at- tributing creativity to a computer program. Minds and Machines 17(1), 67–99. [40] Sakellariou, J., F. Tria, V. Loreto, and F. Pachet (2017). Maximum entropy models capture melodic sty- les. Scientific Reports 7(9172). [41] Salton, G. and C. Buckley (1988). Term-weighting approaches in automatic text retrieval. Information Pro- cessing & Management 24(5), 513–523. 76 Informatica 42 (2018) 69–76 S. Pollak et al. [42] Sluban, B., M. Juršič, B. Cestnik, and N. Lavrač (2012). Exploring the Power of Outliers for Cross- Domain Literature Mining, pp. 325–337. Berlin, Hei- delberg: Springer Berlin Heidelberg. [43] Swanson, D. R., N. R. Smalheiser, and V. I. Torvik (2006). Ranking indirect connections in literature-based discovery: The role of medical subject headings. Jour- nal of the American Society for Information Science and Technology 57(11), 1427–1439. [44] Turner, M. and G. Fauconnier (1995). Conceptual in- tegration and formal expression. Metaphor and Symbo- lic Activity 10(3), 183–203. [45] van der Velde, F., R. Wolf, M. Schmettow, and D. Na- zareth (2015, 6). A semantic map for evaluating creati- vity. In H. Toivonen, S. Colton, M. Cook, and D. Ven- tura (Eds.), Proceedings of the Sixth International Con- ference on Computational Creativity (ICCC 2015), pp. 94–101. WordPress. Open access. [46] Veale, T. (2012). Exploding the Creativity Myth. New York, NY: Bloomsbury Academic. [47] Veale, T. and G. Li (2016, Apr). Distributed divergent creativity: Computational creative agents at web scale. Cognitive Computation 8(2), 175–186. [48] Žnidaršič, M., A. Cardoso, P. Gervás, P. Martins, R. Hervás, A. O. Alves, H. G. Oliveira, P. Xiao, S. Linkola, H. Toivonen, J. Kranjc, and N. Lavrač (2016). Computational creativity infrastructure for on- line software composition: A conceptual blending use case. In Proceedings of the Seventh International Con- ference on Computational Creativity, UPMC, Paris, France, June 27 - July 1, 2016., pp. 371–379. [49] Wiggins, G. (2006). A preliminary framework for description, analysis and comparison of creative sys- tems. Journal of Knowledge Based Systems 19(7), 449– 458. [50] Wiggins, G. (2007). Models of musical similarity. Musicae Scientiae 11, 315–338. [51] Wiggins, G., P. Tyack, C. Scharff, and M. Rohr- meier (2015). The evolutionary roots of creativity: me- chanisms and motivations. Philosophical Transactions of the Royal Society of London B: Biological Scien- ces 370(1664). Informatica 42 (2018) 77–84 77 Towards Creative Software Blending: Computational Infrastructure and Use Cases Matej Martinc1,2, Martin Žnidaršič1, Nada Lavrač1,3 and Senja Pollak1 1 Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia 2 Jožef Stefan International Postgraduate School, Jamova 39, 1000 Ljubljana, Slovenia 3 University of Nova Gorica, Nova Gorica, Slovenia E-mail: matej.martinc@ijs.si, nada.lavrac@ijs.si, martin.znidarsic@ijs.si, senja.pollak@ijs.si Keywords: computational creativity, software blending, visual programming platforms Received: October 31, 2017 Numerous visual programming platforms support the generation, execution and reuse of constructed scien- tific workflows. However, there has been little effort devoted to building creative software blending sys- tems, capable of composing novel workflows by autonomously combining individual software components or even entire workflows originally designed for solving tasks in different research fields. Based on the review of relevant computational creativity research and of contemporary web platforms for workflow construction, this paper defines the desired functionality of a software blending system. Considering the required autonomy of the system and the workflow complexity limitations, we investigate the necessary conditions for the implementation of a creative blending system within the existing visual programming platforms. Povzetek: Številne platforme za vizualno programiranje podpirajo gradnjo, izvajanje in ponovno uporabo zgrajenih znanstvenih delotokov. Dosedanje raziskave niso posvečale pozornosti izdelavi kreativnih siste- mov za spajanje programske opreme, ki bi bili sposobni avtonomnega sestavljanja posameznih program- skih komponent ali celo celotnih delotokov, prvotno izdelanih za reševanje nalog na različnih znanstvenih področjih. Na podlagi pregleda raziskav s področja računalniške ustvarjalnosti in obstoječih spletnih plat- form za gradnjo delotokov v tem članku definiramo želeno funkcionalnost sistema za kreativno spajanje programske opreme. Upoštevaje zahteve po avtonomnosti sistema in dovoljeno kompleksnost delotokov preučimo tudi pogoje za implementacijo takega sistema v obstoječih platformah za vizualno programiranje. 1 Introduction Creativity was defined by M. Boden [3] as “the ability to come up with ideas or artefacts that are new, surprising, and valuable”. It is considered as an aspect of human intel- ligence, grounded in everyday abilities such as conceptual thinking, perception, memory and reflective self-criticism. Software is usually not considered creative because it follows explicit instructions of the programmer [4]. Ho- wever, writing software is considered to be a creative task. If a program could define its own instructions, this would clearly mean that the program has some level of creativity. A subfield of artificial intelligence has recently emerged, in which one of the main goals is the creation of software that is able to model, simulate or replicate human creativity. This field, called computational creativity, has been defi- ned by S. Colton and G. Wiggins [6] as “the philosophy, science and engineering of computational systems which, by taking on particular responsibilities, exhibit behaviours that unbiased observers would deem to be creative.” Note that the field of computational creativity should not be confused with the field of creative computing. Alt- hough these two research areas partly overlap, creative computing differs from computational creativity by gene- rally not being considered as a subfield of artificial intel- ligence, since it mostly addresses the task of creative de- velopment of computing products and with how to write software that would better serve the needs of the creative community [13]. Infrastructures supporting computational creativity and the generation of creative systems are scarce, although some recent research attempts has tried to fill this gap. One of the recent developments is FloWr [4], a system for im- plementing creative systems as scripts over processes and manipulated visually as flowcharts. Another is the Con- CreTeFlows infrastructure [27], which was developed to enable the construction, sharing and execution of compu- tational creativity (CC) workflows, composed of software ingredients of different partners of European project Con- CreTe1. Both of these infrastructures use different types of resources (e.g., musical, pictorial and textual inputs) in order to support the development of some typical CC task such as poetry generation, metaphor creation, generation of narratives, creation of fictional ideas and conceptual blen- ding. 1http://conceptcreationtechnology.eu 78 Informatica 42 (2018) 77–84 M. Martinc et al. These platforms, which enable the user to build procedu- res capable of producing a variety of different creative arte- facts, could hardly be called creative systems, since they do not exhibit creative behavior in terms of automated work- flow development. The arguably most creative system for automated workflow construction, optimization and altera- tion, which is implemented in the FloWr platform, requires a lot of manual user input and could only be called creative with some major reservations. To fill the identified gap, this paper addresses the task of developing an infrastructure capable of autonomously composing novel scientific workflows by creatively combi- ning individual software components or even entire work- flows originally designed for specific tasks in different rese- arch fields. We consider the process of autonomous work- flow composition—which we name creative software blen- ding in this paper—to be an important first step towards a long term goal of creating software that could write code directly. The proposed system would be able to bridge dif- ferent scientific fields by combining methods from specific fields into novel interdisciplinary workflows. It would ide- ally also be capable of automated interdisciplinary research by autonomously discovering novel scientific procedures. This paper presents the design principles underlying a creative system described above. Section 2 introduces the research topic and presents the infrastructures suitable for the implementation of a creative software blending system. Section 3 motivates this research by presenting two exis- ting hand-blended workflows. Section 4 presents the re- lated software blending and computational creativity rese- arch, followed by an outline of the desired system functi- onality, investigating the necessary conditions for the im- plementation of a creative system for autonomous creative workflow generation. The paper concludes by presenting plans for future work. 2 Research background and infrastructures As background to our creative software blending research, this section first outlines some creativity support tools, fol- lowed by a brief description of a selection of easy-to-use workflow management systems that allow the user to com- pose complex computational pipelines in a modular visual programming manner. 2.1 Creative software As Colton’s and Wiggins’ definition of computational cre- ativity [6] is hardly operational for measuring creativity of a program, G. Ritchie [23] proposed some empirical crite- ria for attributing creativity to a computer program. The main idea is to use empirically observable and comparable factors, such as the properties of the generated output of the creative system, when trying to assess the creativity of a system. These observable factors can be judged by two quantifiable and essential criteria: Novelty of an output determines to what extent is the produced item dissimilar to existing examples of its genre. Quality of an output determines to what extent is the pro- duced item a high quality example of its genre. Using these criteria, we can say that the system for cre- ative software blending is creative if it outputs novel and high quality scientific workflows. Another relevant question is what types of creative beha- viors exist and how can they be computationally modeled. Boden [3] distinguishes three basic types of creativity: Combinational creativity involves making unfamiliar combinations of familiar ideas. Exploratory creativity involves exploration of a concep- tual space, which is characterized as a structured style of thought, and coming up with a new idea or artefact within that thinking style. Transformational creativity refers to the modification of the conceptual space so that new kinds of ideas and artefacts can be generated. Combinational creativity is the easiest one to be modeled on a computer. However, created combinations should be meaningful and interesting, which usually requires a solid background knowledge and the ability to form and evaluate relations of many different types. Several programs exist that can explore a given space and invent new artefacts with a certain style, for example, a program for automatic music generation [19] or a program for generating game designs [7]. Some programs can even transform their conceptual space by altering their own rules; for example, evolutionary algorithms can make random changes in their current rules and by this evolve new structures. Another important distinction made by Boden [3] is a distinction between psychological creativity (P-creativity) and historical creativity (H-creativity). P-creativity rela- tes to creation of surprising, valuable ideas and artefacts that are new to the person who comes up with it. Howe- ver, if an artefact or idea has arisen for the first time in human history and (so far as we know) nobody else has had it before, then we are talking about H-creativity. We anticipate that if the targeted creative software blending sy- stem is to be an active participant in scientific discovery or artefact creation, it should ideally be H-creative, although even a P-creative system can play a very useful supporting role in scientific research and its development is therefore a worthy research goal. 2.2 Infrastructures A system for creative software blending would best be im- plemented inside an already existing infrastructure ena- bling interdisciplinary and creative scientific workflow Towards Creative Software Blending. . . Informatica 42 (2018) 77–84 79 composition. In this section we present the ClowdFlows and ConCreTeFlows platforms that host the two motivatio- nal use cases, but other platforms, such as FloWr [4], Ra- pid Miner [18], KNIME [2], ORANGE [8] are also worth exploring as potential infrastructures for creative software blending. ClowdFlows [16] is a cloud-based web application2 for composition, execution and sharing of interactive data mining workflows. It has a web based user interface for building workflows, runs in all major browsers and requires no installation. It contains a large set of work- flow components called widgets, which can be con- nected in a specific meaningful order to create a work- flow. ClowdFlows enables visual programming and has a graphical user interface which consists of a wid- get repository and a workflow canvas. ConCreTeFlows [27] is a platform3 built on top of the ClowdFlows infrastructure. It is specialized in com- putational creativity tasks, including conceptual blen- ding based on textual or visual input or text generation tasks, such as poetry generation. The specialization of ConCreTeFlows in computational (and especially text-based) creativity, as well as a smal- ler number of implemented widgets, makes it less appro- priate for the implementation of the proposed system for creative software blending, but it is appropriate to show- case the creative blending process. On the other hand, ClowdFlows is not specialized in a single specific rese- arch field and contains widgets from the fields of text mi- ning, machine learning and NLP, which makes it appro- priate for the implementation of a creative software blen- ding system since combining tools from different research fields would most likely increase the chance of the sy- stem to be H-creative. As a basis of automated software composition, ClowdFlows already includes a—somewhat loosely defined—ontology of its components (named wid- gets), which should be enhanced and elaborated in further work, to enable ClowdFlows to actually become a useful infrastructure for software blending. 3 Motivational use cases This section presents two hand-crafted motivational work- flows, which illustrate the usefulness of blending software from different scientific fields in order to develop new in- novative scientific methods. In this sense, they represent the type of workflows that a system for creative software blending would be capable to produce. 3.1 Wordification use case: Blending data mining and text mining in ClowdFlows Propositionalization [15] is an approach to inductive lo- gic programming (ILP) and relational data mining (RDM), which offers a way to transform a relational database into a propositional single-table format. Consequently, learning with propositionalization techniques is divided into two self-contained phases: (1) transformation of relational data into a single-table format and (2) selecting and applying a propositional learner to the transformed data set. As an ad- vantage, propositionalization is not limited to specific data mining tasks such as classification, which is usually the case with ILP and RDM methods that directly induce pre- dictive models from relational data. This section motivates creative software blending by outlining the Wordification workflow [22], implemented in ClowdFlows, which per- forms propositionalization by combining data mining and text mining techniques. In the Wordification workflow, shown in Figure 1, given a MySQL relational database as input, the user selects the target table from the initial relational database, which will later represent the main table in the Wordification compo- nent of the workflow. The user is able to discretize each of the tables using one of the available discretization techni- ques. These discretized tables are used by the Wordifica- tion widget, where the transformation from the relational tables to a ‘corpus of documents’ is performed. Several elements of blending data mining and text mi- ning techniques are incorporated in the Wordification: i.e. transforming attribute values into bags of word-like items, using TF-IDF weighting of items, and the possibility of using n-grams of items where n-gram construction is per- formed by taking every combination of length n of items from the set of all items corresponding to the given indivi- dual. Nevertheless, the element of the workflow that most clearly illustrates the software blending potential is the in- clusion of a word cloud visualization (an approach deve- loped in text mining research), together with decision tree construction and visualization (an approach developed in data mining research). 3.2 Conceptual blending use case: computational creativity in ConCreTeFlows The elements of the conceptual blending theory [12], des- cribed in more detail in Section 4, are an inspiration to many algorithms and methodologies in the field of com- putational creativity. In brief, according to this theory, two different concepts for which we can define (find) a simila- rity, can be blended into a new concept in the context of knowledge that is necessary to represent and generalize the two concepts. 2Available at http://clowdflows.org 3Available at http://concreteflows.ijs.si 80 Informatica 42 (2018) 77–84 M. Martinc et al. Figure 1: Clowdflows Wordification workflow with additional analyses after the wordification process, available at http: //clowdflows.org/workflow/1455/. Figure 2: Workflow implementation of multimodal blending in ConCreTeFlows, available at: http://concreteflows.ijs.si/workflow/137/. Let us present a conceptual blending CC workflow [27], implemented in the ConCreTeFlows platform by different partners of the ConCreTe project. Its process components are implemented either as internal functions, wrapped stan- dalone programs or as Web services. The publicly available workflow, presented in Figure 2, can be executed, changed and extended with additional functionality. The workflow presents conceptual blending by con- structing conceptual graphs from textual input and repre- senting the results (blends) as graphs, natural language des- criptions and visual representations. Two textual inputs are transformed into conceptual graphs by a series of wid- gets: the Download web page for obtaining the Web page source from a given URL (In the example, these are the Wikipedia pages for two animals: hamster and zebra.), Boi- lerplate removal and Text2Graph transforming the textual content into conceptual graphs (output g). The outputs of Text2Graph widgets enter Blender basic, which blends the two graphs together and outputs a combined blended graph (output bg). This one gets served to the Textifier widget, which produces a textual description of the blend. Its out- put is presented by a standard Display String widget. The two main entities from Text2Graph widgets enter also the Vismantic2 visual blending widget [28], which either chan- ges the texture of one input space to the texture of the other (see Figure 3a), or puts one in the usual surroundings of the other. (Figure 3b). Its outcome is shown in an output similar to the ones shown in Figure 3. 4 Towards design principles for creative software blending There are two major paradigms in artificial intelligence re- search: problem solving and artefact generation [5]. While the problem solving paradigm deals with a series of pro- blems that needs to be solved, in the artefact generation pa- radigm the task is to generate a series of valuable artefacts. This study is more related to the latter and the artefacts of our interest are functional workflows. A creative software blending system should be able to build new workflows composed of software components from different fields, leading to novel ways of software composition for computational purposes that were not ex- pected in advance. Such blending of software would best be implemented in an existing infrastructure for interdisci- plinary scientific research with already implemented com- ponents for specific and well defined tasks. As shown in Section 2.2, much effort in the fields of data mining and NLP has already been devoted to the de- velopment of infrastructures that provide support for easier and quicker experimentation. One of the biggest challen- Towards Creative Software Blending. . . Informatica 42 (2018) 77–84 81 (a) (b) Figure 3: Two outputs of the Vismantic2 widget for the example of blending the concepts of hamster and zebra: left is a result of exchanging hamster’s texture with zebra’s and the right is an example of exchanging zebra’s with a hamster’s common visual context. ges in implementation and use of these infrastructures has been the integration of different components into functio- nal workflows. Combining different tools and technologies in a common infrastructure is a difficult task because of software incompatibility and inappropriately defined onto- logies. 4.1 Related software blending research To design an appropriate creative software blending system one should consider three fields of study. First, one has to reflect upon the concept of creativity and how to build soft- ware that exhibits creative behavior (see the related rese- arch in Section 2.1). Next, one has to be aware of strengths and limitations of the existing infrastructures that could be used as a platform for the implementation of our system (see the related infrastructures in Section 2.2). Finally, one has to become aware of potentially existing implemented approaches for software blending, surveyed below. While the FloWr framework [4] is conceptually very si- milar to the two infrastructures described in Section 2.2, it is currently the only one with a specifically defined aim of being able to automatically optimize, alter and ulti- mately generate novel workflows presented as flowcharts. This automatic workflow generation via the combination of code modules means that FloWr has the potential to in- novate at the process level and the manifested long-term goal is a software system that can write program code for itself [4]. Although the platform currently does not sup- port fully functioning software blending, some preliminary experiments to automatically alter, optimize and generate flowcharts have been conducted. One of the FloWr experiments dealt with an automatic construction of a system for producing poetic couplets from scratch. In order to reduce the number of possible com- binations of different workflow components, only a sub- set of all the available components were manually selected for blending in the experiment. Possible options for the input parameters were manually reduced and the number of components in the generated workflow was limited to 3 to 5. Despite these limitations, at the end there were still over 261 million variable definition combinations. For this reason the brute-force approach of testing all combinati- ons was intractable, so a depth-first search for all possible workflows was implemented in a way that just one node combination and one parameter setting were randomly se- lected from a set of allowed combinations. The compati- bility of sequential components and some other restrictions were taken into account, which reduced the number of pos- sible workflow candidates. The algorithm was run 200 ti- mes resulting in 200 workflows. A manual evaluation sho- wed that 18.5% of workflows worked successfully and pro- duced poetic couplets.The conducted experiment required a lot of human intervention in order to be successful and the evaluation of produced artefacts was done by humans. Because of this we can question the creativity of the propo- sed software blending approach since a software should — at least in our opinion — have the capacity to evaluate its own performance in order to be called creative. While FloWr belongs to CC research, several attempts have been made to develop support systems also in the field of knowledge discovery. These systems are to some extent related to our research, since they either support the users workflow composition by recommending the new compo- nents that could be attached to an existing workflow, or by generating entire workflows according to user require- ments. Zakova et al. [26] proposed a semi-automatic system for workflow generation that is based on a background know- ledge ontology in which all workflow components are des- cribed together with their inputs, outputs and pre-/post con- ditions. The system uses a planning algorithm and returns just one optimal workflow with the smallest number of pro- cessing steps. Given that alternative workflows are not ge- nerated, this is not in accordance with a desired system for creative software blending. Complexity limitations are another problem, which is common to all the systems that use planning approaches for workflow generation. The IDEA system by Bernstein et al. [1] is based on an ontology of data mining components that guides the work- flow composition and contains heuristics for the rankings of different alternatives. The system does not enable fully 82 Informatica 42 (2018) 77–84 M. Martinc et al. Figure 4: The conceptual blending network [11]. automatic workflow generation but was implemented as a support system for the user who decides on the weights to determine the trade-off between different performance cri- teria (e.g., speed, accuracy, comprehensibility). IDEA is limited to proposing fairly simple workflows. Kietz et al. [14] proposed a KDD support system that uses a data mining ontology. The ontology contains infor- mation about the objects manipulated, the meta data, the operators (i.e. components containing algorithms for spe- cific tasks) and a description of the goal, which is a for- malization of the user desired output. The system takes a goal description as an input and returns a workflow toget- her with all the evaluation and reporting needed to let the user assess if it fulfills the user defined success criterion. The system, implemented in the RapidMiner platform, is not fully autonomous, as it was designed as a support sy- stem for the user. 4.2 Design principles As the above survey shows, no adequate solution for the au- tonomous creative software blending currently exists. To build such a system, first, an ontology with well-defined rules and relations needs to be created, in order to ena- ble combining software components in a meaningful way. Next, a system for creative blending of software compo- nents would be created, enabling automated combination of components in functional workflows. Computational creativity, which is still in early phrases of its development [25], provides some methodological ap- paratus and inspiration for designing the guiding princi- ples for a creative software blending system. One of the very productive fields of research in computational crea- tivity is the conceptual blending (CB) theory [12], which inspired many algorithms, methodologies and discussions in the field (e.g., [24, 21, 17]). CB is a basic mental operation that leads to new mea- ning, global insight, and conceptual compressions useful for memory and manipulation of otherwise diffuse ranges of meaning [9]. A key element is the mental space, a partial and temporary structure of knowledge built for the purpose of local understanding and action [10]. To describe the CB process, the theory [12] makes use of a network of four mental spaces (see Figure 4). In blending, structure from two input mental spaces (Input I1, Input I2), is projected to a new space, the blend. A partial mapping between elements of input spaces—that are perceived as similar or analogous in some respect—is performed. The third mental space, called generic space, encapsulates the conceptual structure shared by the input spaces, generali- zing and possibly enriching them. This space provides gui- dance to the next step, where elements from each of the in- put spaces are selectively projected into the blend, i.e. the new blended mental space. Emergent structure arises in the blend that is not copied there directly from any input. The conceptual blending model is not directly transfe- rable from the human cognition to the blending of soft- ware. However, the methodology, together with the opti- mality principles [11] that optimize the blending process— which were already addressed also in computational mo- dels [20]—should be considered when implementing the software blending algorithm and the workflow ontology. For example, in software blending the two inputs would not represent concepts but rather two workflows from two dif- ferent scientific domains. The “generic space” could then be adapted to software blending in order for the blending system to find all the compatible widgets from two diffe- rent input workflow domains. Finally, the blend would be a newly produced workflow containing new emergent struc- tures not copied from original workflows. The optimality principles, such as the relevance principle (which dictates that all elements in the blend should be relevant) and inte- gration principle (which states that the final blend should be perceived as an enclosed unit) should be kept in mind when designing the ontology. Another important aspect to be considered in the imple- mentation of the system is its creative part. In order for the system to be recognized as creative, its produced artefacts should be novel and of good quality [23] and the human interference in the production and evaluation of these ar- tefacts should be minimal. Three criteria are proposed for attributing creative autonomy to a system [25]: Autonomous evaluation The system should be able to evaluate new creations autonomously and possess its own “opinion” on which creations are better than ot- hers. Autonomous change The system should be able to change its evaluation function without explicit directi- ons. Non-Randomness (Aleatoricism) Random behavior is not creative, so evaluation and change should not be Towards Creative Software Blending. . . Informatica 42 (2018) 77–84 83 completely random, although some randomness can be involved. In order to satisfy these criteria and since most of the afo- rementioned platforms contain a large set of manually built workflows that could be used as a training set, we propose a combination of an evolutionary algorithm and a classifi- cation model induction. An evolutionary algorithm would operate directly on representations of workflows and gene- rate new workflow candidates, according to the constraints defined by ontology rules. These constraints would enforce a minimum quality for the produced workflows (correspon- ding to the criterion of quality [23], which is, as explained earlier, one of the guiding principles in the construction of creative artefacts). The initial population of the evolutionary algorithm would consist of manually built workflows that would be “blended” into new workflow candidates with the help of mutation and crossover. The fitness function used for eva- luating the fitness of the generated workflow candidates would contain following elements: A binary classification model trained on the features ex- tracted from successful and unsuccessful workflows would serve as an additional workflow quality check. A similarity function for determining the similarity be- tween a generated workflow candidate and existing workflow would be used for evaluating the novelty of the candidate. In this way the system would be able to generate new— possibly creative—workflows and even propose changes in the existing rules for workflow generation, which would make this system capable of transformational creativity ac- cording to [3]. 5 Conclusions In this study we elaborate the initial design principles of a system for automatic workflow generation that would be capable of autonomous composition of novel workflows from existing software components. We have presented two workflows with human-designed blending, implemen- ted in the ClowdFlows and ConCreTeFlows platforms for online workflow composition. The first workflow clearly illustrated the potential for the composition of computa- tional creativity solutions. The second use case presents several computational creativity software components that were combined in a collaborative effort to implement an interesting conceptual blending solution, resulting in con- ceptual, visual and textual blends. The benefits of a uni- fying workflow for blending are twofold: the user can get blends of various kinds through the same user interface and the components can affect one another to produce a more coherent and orchestrated set of multimodal blending re- sults. The presented prototype solution is fully operational and serves as a proof of concept that such an approach to multimodal conceptual blending is possible. On the other hand, the sketched evolutionary algorithm approach to blending workflows and workflow components shows, that the theory of conceptual blending can be trans- ferred to the problem of creative software blending. We also demonstrated that the system will be capable of self evaluation by using the empirical criteria of novelty and quality in the fitness function. In our future work we will first design an ontology capa- ble of supporting the planned widget recommender system. We also plan to integrate a larger number of widgets and workflows in the presented platforms. Moreover, we will undertake the challenging task of the implementation. We realize that creation of software that can innovate at a pro- cess level is a very demanding task and we can expect many challenges during this phase. Anyhow, we do believe that the effort will be fruitful and bring us closer to the long- term goal of creating software that could write novel and valuable code directly. Acknowledgments We acknowledge the support of the Slovenian Research Agency through research programme Knowledge Techno- logies (grant number P2-0103), and project ClowdFlows Data and Text Analytics Marketplace on the Web (CF- Web), which has received funding from the European Uni- ons Horizon 2020 research and innovation programme un- der grant agreement No 754549. We would like to thank Pedro Martins and Amilcar Cardoso for numerous discus- sions on the topic of conceptual blending. References [1] Bernstein, A., Provost, F., Hill, S.: Toward intelligent assistance for a data mining process: An ontology- based approach for cost-sensitive classification. IEEE Transactions on Knowledge and Data Engineering 17(4), 503–518 (2005) [2] Berthold, M.R., Cebron, N., Dill, F., Gabriel, T.R., Kötter, T., Meinl, T., Ohl, P., Thiel, K., Wiswedel, B.: Knime–the Konstanz Information Miner: Version 2.0 and beyond. ACM SIGKDD Explorations Newsletter 11(1), 26–31 (2009) [3] Boden, M.A.: Creativity in a nutshell. Think 5.15, 83–96 (2007) [4] Charnley, J., Colton, S., Llano, M.T.: The FloWr fra- mework: Automated flowchart construction, optimi- sation and alteration for creative systems. In: Proc. of the Fifth International Conference on Computational Creativity. pp. 315–323 (2014) [5] Colton, S., Ramezani, R., Llano, M.: The hr3 dis- covery system: Design decisions and implementation 84 Informatica 42 (2018) 77–84 M. Martinc et al. details. In: Proc. of the AISB Symposium on Compu- tational Scientific Discovery (2014) [6] Colton, S., Wiggins, G.: Computational creativity: The final frontier? In: Proc. of the 20th Euro- pean Conference on Artificial Intelligence. pp. 21–26 (2012) [7] Cook, M., Colton, S.: Multi-faceted evolution of sim- ple arcade games. IEEE Conference on Computatio- nal Intelligence and Games (CIG) pp. 289–296 (2011) [8] Demšar, J., Zupan, B., Leban, G., Curk, T.: Orange: From experimental machine learning to interactive data mining. In: European Conference on Principles of Data Mining and Knowledge Discovery. pp. 537– 539. Springer (2004) [9] Fauconnier, G., Turner, M.: Conceptual blending, form and meaning. Recherches en communication 19(19), 57–86 (2003) [10] Fauconnier, G.: Mental Spaces: Aspects of Meaning Construction in Natural Language. Cambridge Uni- versity Press (1994) [11] Fauconnier, G., Turner, M.: Conceptual integration networks. Cognitive Science 22(2), 133–187 (1998) [12] Fauconnier, G., Turner, M.: The way we think: Con- ceptual blending and the mind’s hidden complexities. Basic Books (2002) [13] Hugill, A., Yang, H.: The creative turn: new challen- ges for computing. International Journal of Creative Computing 1(1), 4–19 (2013) [14] Kietz, J., Serban, F., Bernstein, A., Fischer, S.: To- wards cooperative planning of data mining work- flows. In: Proc. of the Third Generation Data Mining Workshop at ECML/PKDD-2009. pp. 1–12 (2009) [15] Kramer, S., Lavrač, N., Flach, P.A.: Propositio- nalization approaches to relational data mining. In: Džeroski, S., Lavrač, N. (eds.) Relational Data Mi- ning, pp. 262–292. Springer (2001) [16] Kranjc, J., Podpečan, V., Lavrač, N.: ClowdFlows: A cloud based scientific workflow platform. In: Proc. of ECML/PKDD (2). pp. 816–819. Springer (2012) [17] Martins, P., Pollak, S., Urbancic, T., Cardoso, A.: Optimality principles in computational approaches to conceptual blending: Do we need them (at) all? In: Proceedings of the Seventh International Conference on Computational Creativity, UPMC, Paris, France, June 27 - July 1, 2016. pp. 346–353 (2016) [18] Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., Euler, T.: Yale: Rapid prototyping for complex data mining tasks. In: Proc. of the 12th ACM SIGKDD In- ternational Conference on Knowledge Discovery and Data Mining. pp. 935–940. ACM (2006) [19] Monteith, K., Martinez, T., Ventura, D.: Automatic generation of music for inducing emotive response. In: Proc. of the International Conference on Compu- tational Creativity. pp. 140–149 (2010) [20] Pereira, F.C., Cardoso, A.: Optimality principles for conceptual blending: A first computational approach. AISB Journal 1, 4 (2003) [21] Pereira, F.C.: Creativity and AI: A Conceptual Blen- ding approach. Ph.D. thesis, Dept. Engenharia In- formática da FCTUC, Universidade de Coimbra, Por- tugal (2005) [22] Perovšek, M., Vavpetič, A., Cestnik, B., Lavrač, N.: A wordification approach to relational data mining. In: Proc. of the International Conference on Disco- very Science. pp. 141–154. Springer (2013) [23] Ritchie, G.: Some empirical criteria for attributing creativity to a computer program. Minds and Machi- nes 17.1, 67–99 (2007) [24] Schorlemmer, M., Smaill, A., Kühnberger, K.U., Kutz, O., Colton, S., Cambouropoulos, E., Pease, A.: COINVENT: Towards a computational concept in- vention theory. In: Proc. of the 5th Int. Conference on Computational Creativity. pp. 288–296 (2014) [25] Toivonen, H., Gross, O.: Data mining and machine learning in computational creativity. Wiley Interdis- ciplinary Reviews: Data Mining and Knowledge Dis- covery 5.6, 265–275 (2015) [26] Žakova, M., Kremen, P., Železnỳ, F., Lavrač, N.: Planning to learn with a knowledge discovery onto- logy. In: Proc. Planning to Learn Workshop (PlanLe- arn 2008). vol. 951 (2008) [27] Žnidaršič, M., Hervás, R., Alves, A.O., Oliveira, H.G., Xiao, P., Linkola, S., Toivonen, H., Kranjc, J., Lavrač, N.: Computational creativity infrastruc- ture for online software composition: A conceptual blending use case. In: Proc. of the 7th International Conference on Computational Creativity (2016) [28] Xiao, P., Linkola, S.: Vismantic: Meaning-making with images. In: Proceedings of the Sixth Interna- tional Conference on Computational Creativity. pp. 158–165. ICCC2015 (Jun 2015) Informatica 42 (2018) 85–93 85 Graph Theoretical View on Text Understanding Jure Zupan National Institute of Chemistry, Ljubljana E-mail: jure.zupan@ki.si Keywords: graph theory, cyclic-connected graph, topological distance, network text analysis, information content Received: October 6, 2017 The system STAVEK-02 described in the contribution is concentrated on yielding supplemental information (besides parsing/tagging of words) for text understanding through the clustering of nouns and/or verbs according to their meanings and common features. The system consists of two word processing blocks. The first block is a vocabulary of 149,000 Slovenian word-roots and 3,100 endings and assigns the grammatical feature to the words by the grammatical rules without any link to pre- tagged lexical corpora. The second block is a Network of meanings of Slovenian words which in principle is a graph connecting 45,000 and 15,000 noun and verb lexemes, respectively, all of them hierarchically clustered into larger and larger groups having /exhibiting specific features and/or common properties of the words encompassed Such formations are in a similar lexical systems usually called synsets. Due to the complete connectivity between the synsets (groups) in the graph it is possible to find all possible property/feature paths between any pair of two words (nouns and/or verbs) in the network. Because clustering of words according to their meanings is made during the parsing of one, a pair, or several consecutive sentences, the features and properties that appear on the closest path between the particular words within the sentence are quite informative for their interpretation of the text. Clustering of the words according to their meanings during the parsing of text is a novel concept of the text interpretation. Ob the basis of a simple example of parsing a sentence and clustering of the nouns within it the concept using the network of meanings in the program STAVEK-02 is described and discussed. Povzetek: Opisani sistem STAVEK-02 je orientiran na širše izločanje informacij iz slovenskih besedil, kot je samo besedna analiza in označevanje besed. Osnova sta dva programska dela. Prvega sestavljata podatkovna baza (149.000 korenov besed in 3.100 končnic), drugega pa 45.000 samostalnikov in 15.000 glagolov, ki so s skupinami teh besed grupirani po različnih skupnih značilnostih v ciklični graf (connected cyclic graph). Prvi del izvrši slovnično označevanje besed v tekstu, drugi pa med posameznimi besedami, ali v grafu hierarhično povezanih skupin besed (synsets) s podobnimi lastnostmi in značilnostmi izračuna topološke razdalje in nariše shemo povezovanja skupin samostalnikov ali glagolov. Izkazalo se je, da topološko izračunana razdalja med besedami dobro predstavi pomensko razliko/sličnost med njimi. Obe besedni zbirki skupaj vsebujeta in obdelujeta pretežni del najpogostejših slovenskih besed (cca 149.000 slovenskih besed). V prispevku so razložene nekatere pasti slovenščine pri obvladovanju več-smiselnosti besedila. Opisana je tudi struktura cikličnega grafa besed (samostalnikov in glagolov) in način izračuna topološke razdalje med besedami Poudarjena je dvosmernost poti in sprehodov (paths and walks) v omenjenem grafu besed. Dodan je kratek primer analize stavka, ki se konča z matriko topoloških razdalj med besedami stavka in drevesom podobnosti. Na koncu so omenjene nekatere možnosti razvoja sistema STAVEK-02 in hierarhične mreže za določanje pomenov slovenskih besed. 1 Introduction The parsing or tagging of words in the sentence provides the user with all relevant grammatical features of each word, which itself is a very hard task to implement either by the computer or by hand alone. The fact hat most of the modern parsing programs today rely on large corpora of previously parsed data does not mean that the efforts and programs solving the tagging of sentences by hand are either unnecessary or outmoded. Even if one forget that the testing of parsing-algorithms based on previously parsed corpora first relay on the hand-made parsing, the ab-initio, i.e., parsing by exclusively using grammatical rules will always be necessary. It should not be forgotten that statistical solutions mostly ignore the occurrences of rare specific cases. Such problems can be solved easier by considering and combining both methods (corpora driven and rule-based tagging) consecutively and/or iteratively. For example: the problem of the words having two or more clearly different meanings of which at least two can have grammatically correct but for any kind of machine parsing or rule-based tagging completely indistinguishable forms. Unfortunately, in Slavic languages with a much higher degree of flexibility 86 Informatica 42 (2018) 85–93 J. Zupan of words than in English the problems of the word senses begin already on the parsing level. In the case of a gram- matically correct sentence with two completely different interpretations of word senses it is possible that no parsing can correctly identify even the word classes of the constituent words, not to talk about the senses. The possible solution of such problems is to list all possible meanings or senses of each word and leave this information for further consideration when the context of the following sentences allow to single-out the actual meaning. For example, neither the sentence To je dobro za vas nor the title of the well-known Slovenian story Martin Krpan can be tagged correctly by the computer. In the first case the word vas can be interpreted either as for you or, alternatively, as the village, hence, the sentence can mean either: This is good for you, or This is good for the village. In the second example, the title of the well known Slovenian story Matin Krpan introduces the name of the main character. However, the title has, unfortunately, a second grammatically correct meaning of the word Martin, not as a noun (name Martin) but as the adjective meaning belonging to female Marta, which implies that man of the name Krpan is a husband of Marta or at least involved with Marta. Of course, the machine interpretation based on the pre-tagged corpora will always yield grammatically ‘correct’, i.e., the most often used variation, but at the same time always omit the less probable, but grammatically correct possibilities, witch nevertheless can appear in the spoken or written communication, and should therefore be at least considered. Such cases are handled better by the rule- based tagging compared to the statistical ones. In order to bring attention to such possibilities and to provide the tool for helping the developers of man- machine dialog to handle such cases the program STAVEK-02 with options of showing all grammatical possibilities and additionally provide the user with clusters of various word meanings at each sentence (or group of sentences) was developed and is described in this paper. 2 Related work The most closely related system to the PMSB (Pomenska mreža slovenskih besed [1], (Engl. Network of Meanings of Slovenian Words) used by the program STAVEK-02 is the well-known WordNet [2,3] lexical collection developed by the Princeton University with its graphic visualization VisuWords [4] based on the Thinkmap, data visualization technology. In order to handle the difficulties in the cross-language differences in the meanings of lexical words the Universal Word Net (UWN) Project was launched [5,6]. According to the UWN suggestions and guidelines specific versions for close to 200 different languages are now under development. Similar to the other Slavic languages (see Polish [7], or Bulgarian [8], for example) the Slovenian version named sloWNet [9] is as well progressing. At the moment the version described in the present paper is not included into sloWNet. There are several features of the PMSB that are similar to the WordNet but some of them are not. The organization of synsets for nouns in the hipo- hypero-, mero-, and holonym groups (the word A is a meronym of B if A is a part of B; the nose is a part of head, while head is a holonym of nose) is very similar, while the verbs in PMSB follow closely the six branch division (to exist, to have, to move, to do/to, to think/to create, and to sense/to) as suggested by Vidovič Muha [10] is quite different. The way the distances between the word senses in PMSB are calculated compared to the similarity evaluation between two synsets in WordNet is practically the same: it calculates the length of the shortest path between two nodes in the graph. It is worthwhile to mention that the distance measure used in our case is the length of the shortest path between two nodes (synsets) in a graph. This graph theoretical path distance is not related to the distances between objects (words) represented by the multi-dimensional distributed representations of word vectors as obtained by the word2vector software [11] developed by Thomas Mikolev at Google. The number of words and meanings (synsets), 60,000 and 110,000, respectively, in PMSB is already large enough to cover a large variety of texts. A considerable difference with WordNet is in the design of our network STAVEK-02. Although the PMSB can act as a stand-alone program in the role of a sort of thesaurus of Slovenian language, its is actually designed as a subroutine to support the system STAVEK-02 which goal is to enhance and/or to improve the machine-man dialog, by pinpointing and/or explaining the meanings of specific words. The mentioned goal can be clearly seen through the selection of hyper- and hyponym groups of the PMSB network which is described in the following paragraph more in detail. 3 Hierarchical Network of Meanings of Slovenian Words (PMSB) The solution to the discussed information enhancing problem seems to be the organization of words into network of words linked according to the common features or some other commonly present or absent property(ies). Therefore, the links (branches) between nodes in the graph must contain meaningful information about the relation between the nodes they connect. For example: if one node is labeled tool and the other one object (man-made) the link between them must exhibit the property that the first node (synset) labeled tools is a part of the second node labeled all man-made object) and not vice versa. At the same time these two nodes should occupy positions in the work much closer to each other than they have to the synset labeled insect, for example. Either individual words or clusters of words could simultaneously be members of several groups (synsets with larger number of meanings) what makes the network to contain cyclic paths (circular paths between clusters) in the structure (Figure 1). Graph Theoretical View on Text Understanding Informatica 42 (2018) 85–93 87 The PMSB Network consists of 45,000 noun and 15,000 verb dictionary lexemes (words) forming 85,000 and 25,000 different entries of noun and verb meanings, respectively. For example, if ‘konj’ (Engl. horse) is one of the 45,000 lexemes the four senses of the word ‘horse’ in Slovenian language (horse – an animal, horse – a clumsy man, horse – a chess-piece, and horse – a gymnastic equipment, paddle-horse) are four of 85,000 noun meanings or senses. Using the above kind of reasoning, a graph of about their meanings and properties containing close to 4,500 clusters of words (nodes) was generated [1]. The closest collection to our database is the Levine’s collection of verb classes [12] and Dornseiff’s Wortschatz [13]. There are various Internet versions like WordNet [2,3]) and for the Slovenian language the sloWNet [9]. What the size, i.e. the number of words is concerned; only the Dornseiff’s [13] collection has about the same number of verbs (14,000) as our collection. The part of our network VERBS (24,626) Verbs of existing (3,405) to exist on a specific way (542), verbs to sustain living (1,427), to end existence (299), emission verbs (949), weather verbs (187) Verbs of having (1,339)) to posses (154), to obtain/take (333), to use possession (288), to negotiate possession (461), to spend possession (102) Verbs of moving (3,129) to move (general) (804), to move (specific way) (692), to move (body/parts) (629), to arrive/leave (676), to change movement (206), to do while moving (121) Verbs of doing (9,663) to put (2,416), to do (general) (669), to assemble/disassemble (1,340), to change (2,164), to use force/influence (1,322), to do complex tasks (1,751) Verbs of thinking/creating (1,583) to create (intellectually) (550), to think (general) (145), to think (specific) (407), to expressing thoughts with symbols (480), Verbs of communication (5,507) to exchange of information (2,770), verbs of perception (322), to have/response to feelings (883), verbs of social contact (1,531), NOUNS (86,799) nature (31,988) nature (non-living)(3,130) is divided into: nature (general) (10), nature (phenomenon) (521), nature (physical parameter) (151), nature (space) (82), matter (general) (1,359), matter (Earth) (933), matter (outer-space) (84) nature (living) (28,847) is divided into: nature (general/broader) (4,218), nature (plant kingdom) (3,111), nature (animal kingdom) (3,431), nature (human) (18,087) product (19,222) product (origin) (552) divided into: product (origin (human)) (40), product (origin (nature)) (53), product (origin (plant)) (258), product (origin (animal)) (201) product (human) (18,670) divided into: product (human (material)) (13,190), product (human (intellectual)) (5,352) product (human (commodity)) (29), creation (general) (5), creation (limitation) (94) concept (35.589) activity (11,645) is divided into: activity (general) (101), activity (to do something) (3,507), activity (society) (3,045), activity (emotion) (76), activity (sense) (15), activity (existence) (1,068), activity (movement) (1,240), activity (communication) (1,912), activity (possession) (582), activity (mind) (97) property (5,943) is divided into: property (action) (323), property (animal) (45), property (broader meaning) (357), property (company) (17), property (device) (90), property (form) (62), property (general) (37), property (human) (2,774), property (mind) (128), property (matter) (267), property (nation) (35), property (number) (13), property (object) (482), property (phenomenon) (42), property (plant) (34), property (procedure) (390), property (religion) (15), property (ruling) (52), property (society) (111), property (sound) (39), property (space) (309), property (status) (159) property (word/speech) (123), group of properties (38), and 8 other groups: event (1,208), form (3,169), group (1,958), phenomenon (526), procedure (992), result (5,342), space (1,532), state (2,910). Table 1. The first two levels of verbs (upper part of the Table 1) and nouns (lower part of the table) are shown according to their common features. In the parentheses the number of words in each group is given. Because individual word can have several meanings or senses it is listed in as many groups (synsets) as there are meanings. Therefore, the sum of words given in parenthesis is larger than the number of meanings in the network. The largest groups are printed bold. 88 Informatica 42 (2018) 85–93 J. Zupan containing verbs is based on six main groups [10] and is already well described in the literature [14,15] and is accessible on the web [16]. The complete structure of verb hierarchy in English language (16,000 verbs and 1000 groups) is given in [17]. The basic division of nouns has three groups: the product, the nature, and the concept. It can be seen from second part of Table 1. The clusters of verbs and nouns in all levels of hierarchy are of very different sizes (Table 1). On the contrast to the English language, the Slovenian lexical forms of verbs can be well distinguished from those of nouns, however, due to high flexibility of Slovenian declination and conjugation (approximately 20 per each noun, verb, adjective, pronoun, and numeral) there are numerous cases where two or even three word types mix. For example the sentence To je lepo padalo has two meanings: a) This is a nice parachute and b) It was falling nicely. In the first case the word padalo is a noun (parachute) while in the second case it is the verb (to fall). To have all words together in one network (graph) both word types are linked in the network on the highest node. It is worthwhile to mention that the same word in different languages has different synsets of meaning. This is the reason why such a hierarchy cannot be ‘blue- printed’ from one to another language. The effect of ‘lost with translation’ is unavoidable: each translated word could be connected to completely different clusters of words. For example, the English word plant in its botanical meaning can be linked with Slovenian counterpart rastlina, or German Pflanze, but has no connection to the second sense of a production place like Slovenian tovarna or German Fabrik). 4 Semantic distance measure Mathematically, the network is a connected cyclic bi- directional graph. Vertices or nodes represent single words, meanings and/or or clusters of words with similar properties/features (synsets). The connected graph enables a continuous walk, described as a sequence of connected nodes (path), between any two nodes. The graph is cyclic if it contains closed paths (cycles), i.e., paths that starts and ends on the same node) with all nodes on that path different (with exception of the closing node). Hierarchical graph has one special node called top node Ntop or root, distinguished from the other ones by defining the orientation of the graph and walk directions within it. All valid paths between nodes must have one of the two directions: either towards the Ntop (up) or backwards from (down). Therefore, each node must have two lists for connections, to up and to down connected neighbors, respectively. Similar to the Ntop which is the last node of all up-paths, so at the end of any down-paths is always a node called terminal, having no down directions. The terminal nodes are individual words or senses if the word has only one sense (meaning). The fact that the walk path is not allowed to change direction assures that from any node one can always reach either a terminal node or the Ntop. Thus no walk with the constant direction could be captured in a cycle and thus end in an infinite loop. In the case of update of new words or relocation of nodes the described hierarchy prevents updates to generate infinite loops and self- referencing nodes. All the explained features of our graph offer the advantage of calculation the topological distance between the nodes. The topological distance Dij between two nodes Ni and Nj has all four properties classifying it as a standard metric distance: 1) Dij > 0 for all i ≠ j 2) Dij = 0 only for i = j 3) Dij = Dji, the distance is symmetrical, and 4) Dij ≤ Dik + Dkj triangle rule for any node k To evaluate all topological distance Dij between two arbitrary nodes Ni and Nj in the graph, one needs a complete connectivity matrix of order (Ni×Nj). For a graph containing approximately 105 nodes this means storing and handling the matrix of about 0.5×1010 distances. Fortunately, instead of keeping this large connectivity and/or distance matrix, only two connectivity tables one for keeping all up and the other one keeping all down connections from each node to neighboring nodes are needed. Using these two connectivity tables it is straightforward to determine topological distance between any two nodes Ni and Nj or words i and j, respectively. The procedure is as follows: 1. Find the complete set {Pi (Ni,Ntop)} of ni paths from the node Ni to the node Ntop. 2. Find the complete set {Pj (Ni,Ntop)} of nj paths from the node Ni to the node Ntop. 3. Compare k pairs of paths from both sets {Pi (Ni,Ntop), Pj (Nj,Ntop)}, k = 1...ni(nj – 1)/2 and for each pair determine the common node Ck 4. Determine the length lk of the path from node Ni to the node Nj passing node Ck for each pair k. 5. Keep the shortest path. To summarize: the distance Dij between two nodes Ni and Nj is the length lk of the shortest path from node Ni to the node Nj through the common node Ck, from which both nodes Ni and Nj have access to Ntop: Dij = min { lk } of {Pk(Ni,C,Nj) }, k = 1...ni(nj – 1)/2 /1/ where ni is the number of different paths from the node Ni to the top node Ntop; and P(A,C,B) is the path from node A to node B passing node C. 5 The Case Study The described system STAVEK-02 can serve as a model how to use the PMSB hierarchy of word meanings and synsets for enhancing the information in free text. The system can handle individual sentences input by the keyboard or text files of any size. The system handles sentences one by one, hence, the information are reported Graph Theoretical View on Text Understanding Informatica 42 (2018) 85–93 89 . Figure 1. A simplified part of the discussed network of words showing essential features of a cyclic bi-directional graph. Each label represents a node (synset). A cycle is a path that starts and ends on the same node. From the word kobilica having 4 meanings in Slovenian language, six cycles can be drawn to calculate six distances between all four meanings. Because the graph is 2-directional, only the paths in up or down to the Ntop (opposite to arrows) or to the terminal nodes (words, along arrows), respectively, are allowed. The cycles are detected via the common nodes Ck on the paths. at the end of each sentence. First, all grammatical information for all words in the sentence, are reported (Part A in Figure 2). This part, of the tagged text is similar in the content, but quite different in the form to the output provided by the public Slovenian parser [18] available on the Slovenian ZRC portal. All tasks performed by the parser are executed ab initio, i.e., by the grammatical rules without considering any corpus or web connection. For highly flexible language like Slovenian, the parser is very important, because for tagging of words it must convert the words into appropriate lexemes, what is mandatory form for further search in the PMSB network. Our tagging system uses 149,000 Slovenian word-roots for all types of words and combines them with about 3,100 different endings in order to obtain the correct lemma for any Slovenian word in SSKJ [19]. Second, for each noun and each verb appearing in the input sentence(s) the system searches through PMSB network and lists all the meanings/senses associated with any noun and verb together with the complete list of synsets from the lemmas to the top of the network. The list of senses for all words together with full paths through the PMSB is output after each sentence (part B in Figure 2). Third, the program draws a clustering scheme of up to 500 nouns or verbs together with all their senses. The cluster-tree of senses is built into the system as an option and can be performed a) after each sentence, b) after a certain number of sentences providing the number of part of nature nature (animal kingdom) (human being) product (material) nature (live) Ntop Noun Verb Product Nature Concept part of non-vertebra mammal (specific product) part of part of insect horse (musical (vessel) instrument) bridge keel locust mare on the violin (slov. kobilica - the word has four meanings) 90 Informatica 42 (2018) 85–93 J. Zupan words does not exceed 500, or c) at the end of parsing a text file after the user can selects up-to 500 nouns or verbs from the list of the most frequent word types of the scanned text. Finally, at the end of each session (either for one sentence or for the text file) the program yields a) statistics of the input text with respect to the word frequencies of all word types and separators, b) the distribution of word-lengths (in characters) of each word Part A =================================== 1 Na; na; predlog, (4,5); 2 violini; violina; sam., f(d/4)(e/5); 3 je; biti; pom.gl., (3os./f/e) pretekli čas 4 počila; počiti; gl., nedov. nepov. (3os./f/e) pretekli čas 5 struna; struna; sam., f(e/1); 6 . ločilo, pika; ----- 1 Kaj; kaj; prislov, (vzročnostni); 2 je; biti; pom.gl., (3os./m/e) pretekli čas 3 bil; biti; gl, nedov. nepov. (3os./m/e) pretekli čas 4 vzrok; vzrok; sam., m(e/1)(e/4); 5 ? ločilo, vprašaj; ----- 1 Kobilica; kobilica; sam., f(e/1); 2 je; biti; pom.gl, (3os./f/e) pretekli čas 3 bila; biti; gl, nedov. nepov. (3os./f/e) pretekli čas 4 poškodovana; poškodovan; pridevnik, (m/d/1)(m/d/4)(f/e/1) 5 . ločilo, pika; =================================== Part B =================================== /001/01: violina (violin); strings; instrument (musical (specific)), instrument (musical); product (sound emitting); product (communication); product (material); product/creation; Noun; Ntop. /002/01: struna (string): part of (musical instrument)), instrument (musical); product (sound emitting); product (communication); product (material); product/creation; Noun; Ntop. /002/02: struna (string); product (sound emitting); product (communication); product (material); product/creation; Noun; Ntop. /003/01: vzrok (cause); factor; measure (specific); creation (measure/unit); creation (intellectual); product/creation; Noun; Ntop. /004/01/ kobilica (violin’s bridge): part of (musical instrument); instrument (musical); product (sound emitting); product (communication); product (material); product/creation; Noun; Ntop. /004/02/ kobilica (keel): part of (vessel); part of (specific device); product (machine/device); product (general part); product (material); product/creation; Noun; Ntop. /004/03/ kobilica (locust): insect; insect (pterygota); insect (arthropoda); insect (general); antropoda; non-vertebra; nature (animal taxonomy); nature (animal kingdom); nature; Noun; Ntop. /004/04/ kobilica (locust): insect; insect (pterygota); arthropoda; polimeria; animal (common name); nature (animal kingdom); nature; Noun; Ntop. /004/05/ kobilica: mare; horse (animal (general)); horse (animal); animal (domestic); animal (property); nature (animal kingdom); nature; Noun, Ntop. /004/06/ kobilica: mare; horse (animal (general)); horse (animal); odd-toed ungulate; mammal; vertebra; chordata, nature (animal-taxonomy); nature (animal kingdom); nature; Noun, Ntop. =================================== Figure 2: Output of the program STAVEK-02 after the input of three sentences representing a short dialog. Na violini je počila struna. Kaj je bil vzrok? Kobilica je bila poškodovana. (Eng.: The string on the violin broke. What was the cause? The bridge was damaged. The word types are nouns (sam.), verbs (gl.), adverbs (prislov), adjective (pridevnik), the letters m, f, os, e, and d stand for (masculine, feminine, person, singular, and dual), respectively; the numbers mark the falls. Part B shows ten chains of nodes (synsets) of words and meanings from the PMSB network as used for the distance matrix D and dendrogram calculations (see Figure 3). Ntop is the top node of the PMSB hierarchy of meanings. In the actual output of program STAVEK-02 the synsets assigned to words of one sentence are printed immediately after one of the main three punctuation marks (full stop, question mark, or exclamation mark) is encountered. Graph Theoretical View on Text Understanding Informatica 42 (2018) 85–93 91 type, and c) the frequency is of 2000 most frequently used nouns, adjectives, verbs, and adverbs. In order to show the entire procedure more in detail the output as given by the system STAVEK-02 for three short consecutive sentences is worked out and discussed more in detail. The three sentences in English translation are: The string on the violin broke. What was the cause? The bridge was damaged. (slov. Struna na violini je počila. Kaj je bil vzrok? Kobilica je bila poškodovana.) (Figure 2, parts A and B). This particular example using the word kobilica in two separate sentences was chosen deliberately to show how the graph-theoretical distances (Figure 2. and Figure 3) as obtained by the PMSB network could correctly determine the sense of a word. Similar to English the word bridge having several senses, the Slovenian word kobilica has been coded by six synsets in PMSB. It has four (4) main senses (locust, keel, mare, and the bridge on the violin) of which both animal senses have two synset paths for showing the relevant taxonomies of both species. (Figure 2, part B). There is not much to say about tagging shown as part A in Figure 2), however, the tagging the second word violini as singular locative (e5) is a good example showing how the statistical approach ignores the possibility that the word violin has in the dual the same form (for example: ’Pozabil sem na violini’ Engl. I forgot about two violins) of accusative in dual (d/4). STAVEK-02 tags both possibilities (d/4) and (e/5). Additionally, the rule-based tagging is considerably faster compared to the statistical pre-tagged-corpora- based one. The public Slovenian parser [18] can tag on the average 8 sentences per second, while the parser of the system STAVEK-02 managed to tag 400 sentences per second. By additionally searching for all noun and verb meanings through the database of close to 110,000 synsets makes the rule-based parser almost two orders of magnitude faster then the public one. Part B shows all the synset paths for the nouns in the sentences. In the actual output the synset paths for verbs are also given. In the print option, the paths are listed after each sentence. Each chain is a sequence of labels of nodes (synsets) encountered during the walk between the word and the Ntop. The search algorithm finds all possible walks from any encountered noun or verb to the Ntop. The reader can verify this part of the search engine in real time on-line on the link given in [20]. Mostly, the labels are organized in self-explanatory manner using structure of keywords in which each keyword is itself a cluster label with the link to the particular cluster in the network. For example, the node labeled property (human) contains words each of which marks a property of a human’ (intelligence, beauty, greed, innocence, etc.). On the other hand, the words in the cluster with the same two keywords, but ordered differently e.g., human (properties) describe a human being with a particular property, genius and liar are in the synsets human (property (intelligence)) and human (property (bad)), respectively. Additionally, both words human and property are labels of other clusters. The cluster property, for example, contains 5,964 nouns with 14 sub-clusters named property (keywordi), i = 1,…14. Each keyword of these clusters: property (animal), property (human), property (number), … property (object), contains again cluster descriptors with keywords. Take for example the sub cluster property (object): property (object (color)), property (object (form)), property (object (price)). At the end each keywordi represents a cluster with a smaller set of words. Table 3 shows the topological distance matrix D of 45 distances between the ten meanings. All distances reflect the relation between the similarities of meanings of the words concerned very reasonable. The two main groups, the upper one representing material products (violin, string, bridge on the violin) and the lower one representing locust and mare: have two descriptions each, respectively. In the middle of both groups is the word vzrok (cause), representing the concept of non- material products. In the group of material objects the string /002/01/ (part of the violin) and kobilica /004/01/ (part of the violin) are joined at the lowest level. The pair goes together with the second meaning of the string /002/02/ as a sound emitting device and then three join together with the fourth sense violin combining all four into a reasonable synset musical instrument. As said above, the last four meanings represent the animal synsets (animal living beings). To this group of four meanings (horse (domestic animal)), horse (taxonomy), locust (insect), and locust (taxonomy), there is no counterparts of meanings from the rest of the considered three sentences, hence, one can safely assume that the four meanings of the word kobilica do not apply in this context. It is interesting to see that the remaining two words kobilica /004/03/ (keel as a part of a vessel) and vzrok (cause) fit well between the two larger group. The sense keel and violin are linked together relatively high in the dendrogram because there are both material objects, however, the level of the link between the concept cause and the material object keel shows that there is still a lot of space for improvements of the procedure for distance evaluation. This results help us to argue that as much the meanings of single word is important, the distance between the words is important as well. This in turn requires two things; first each word should be represented in unique and uniform way based on various kind of properties and second, the words should bi organized in a system that allows definition of a metrics. 6 Conclusion The discussed example and hierarchical network of words PMSB present only a very simple and small part of the general solution that can be accomplished by the use of an exhaustive and therefore much more complex network of word meanings. Neither the presented network, nor the presented model for extracting broader information from the text, is the final product. Still a lot of improvements can be implemented. Although the present network links together slightly more than 60,000 words (nouns and verbs) forming about 110,000 meanings (synsets) of various sizes, it is not the number of words that is a limiting factor, but rather more 92 Informatica 42 (2018) 85–93 J. Zupan factors like the absolute number of synsets (clusters of words with different features), the number of links to which each synset is connected, and least but not last the ability of algorithms for distance calculation to reflect the actual distinction between the meanings of word. These are the issues that should be of first concern. One should add not only more clusters presenting larger variety and number of properties, features, and/or meanings, but as well clusters of words pointing to rare, dangerous, or by any other criterion extreme features that the words represent, for example synsets containing words like non-poisonous plants, extremely hard or non combustible material, etc. The constant updating and enhancement of the networks of meaning require much more man-power and/or machine-supported feature selection efforts for addition of new groups than it has been spent for the present variations of WordNets on varieties of languages. However, for each specific language the native speakers are responsible for the growth and complexity of their specific meaning networks and no automatic procedure could completely replace their manual work and decisions. The presented PSMB network of meanings was put together by hand what requires approximately eight man-years to reach the present size. Some critics are afraid that such knowledge bases has arbitrary structure, because the meanings of the words are subjective and no objective criteria exist how to link or cluster words according to their meanings. The described example has shown the potential of such network to help understanding the context of the communication. As a matter of fact it is true, that such a hierarchy of meanings will always be subjective, but so is human mind. 7 Acknowledgement The author wishes to thank National Institute of Chemistry for providing him with the facilities to work at the Institute as a research emeritus. 8 References [1] Zupan, Jure; Koncept mrežnega pomenskega slovarja slovenskih besed, Jezik in slovstvo, 54, (3- 4), 2009, pp. 139-151. [2] Miller, George A, WordNet: A Lexical Database for English. Communications of the ACM. 1995, Vol. 38 (11), 39-41. [3] Fellbaum, Christiane; WordNet: An Electronic Lexical Database, Editor, 1998, Cambridge, MA: MIT Press. [4] Visuword™, On-line graphical dictionary and thesaurus, https://visuwords [5] Towards a Universal Multilingual WorldNet - D5: Databases and Information Systems, Max-Planck- Institut für Informatik; Mpi-inf.mpg.de; 2011-08- 14. [6] Vossen, Piek, EuroWordNet: A Multilingual Database with Lexical Semantic Networks, Editor, 1998, Kluwer, Dordrecht, The Netherlands. [7] Maziarz M., Szpakowicz S., Piasecki M., Semantic Relations among Adjectives in Polish WordNet 2.0: A New Relation Set, Discussion and Evaluation, Cognitive Studies / Études Cognitives, t. 12, s. 149– 179, 2012. [8] Koeva, S., G. Totkov and A. Genov. Towards Bulgarian WordNet. Romanian Journal of Information Science and Technology, Vol. 7, No. 1- 2, 45-61, 2004. [9] Fišer, Darja, Novak, Jernej. Visualizing sloWNet. Proceedings of the conference on Electronic lexicography in the 21st century: New applications for new users (eLEX2011). Bled, Slovenia, 9-12 November 2011. [10] Vidovič Muha, Ada, Slovensko leksikalno pomenoslovje. Ljubljana: Znanstveni inštitut Filozofske fakultete, 2000: [11] Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg S.; Dean, Jeff; Distributed representa-tions of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 2013. [12] Levin, Beth; English Verb Classes and Alternations, The University of Chicago Press, Chicago, 1993. [13] F. Dorenseiff, der deutsche Wortschatz nach Sach- gruppen, 8. Edition, Ed. U. Quasthoff, W. de Gruyter, Berlin, 2004. [14] Zupan, Jure; Problemi in nekaj rešitev računalniških obdelav slovenskih besedil, Slav. revija, 47 (3), 1999, 277-296. [15] Zupan, Jure; Hierarhična mreža slovenskih glagolov, v Obdobja 30, Interdisciplinarity in Slovene Studies, Filozofska Fakulteta, Ljubljana 2011, pp. 551-557. [16] Zupan, Jure; Lajovic, Andrej; PMSG – Network of Slovenian verbs, web address: http://pmsg.zrc- sazu.si. [17] Zupan, Jure; Pomenska mreža slovenskih glagolov, Založba ZRC SAZU, 2013, pp. 31-51, [18] Oblikoslovni označevalnik za slovenski jezik, Amebis, d.o.o. Kamnik, Inštitut Jožef Stefan, Univerza v Ljubljani, ZRC SAZU, Trojina, Zavod za uporabno slovenistiko, 2008-2013, konzorcij projekta Sporazumevanje v slovenskem jeziku: link to the network: http//www. oznacevalnik.slovenscina.eu [19] Slovar Slovenskega knjižnega jezika (SSKJ), Bajec, Anton, et al., Eds., Državna založba Slovenije, DZS, Ljubljana, 1995. [20] J. Zupan, A. Lajovic; PMSB, Pomenska mreža slovesnkih besed, link to the network of meanings of Slovenian words: http://mreza.andrej.ad-vega.si. Graph Theoretical View on Text Understanding Informatica 42 (2018) 85–93 93 Distance matrix between ten meanings of four words. The distances are the numbers of nodes (synsets) between two meanings in the network PSMB evaluated according to the procedure and equation /1/. 1 2 3 4 5 6 7 8 9 10 1 violina (violine) /001/01 0 6 6 13 6 12 19 19 18 21 2 struna (string, violin's part)/002/01 0 4 12 3 11 18 18 17 20 3 struna (string, sound emitter)/002/02 0 10 5 9 16 16 15 18 4 vzrok (cause) /003/01 0 12 12 17 17 16 19 5 kobilica (violin's part)/004/01 0 10 17 17 16 19 6 kobilica (keel)/004/02 0 17 17 16 19 7 kobilica (locust) /004/03 0 3 15 16 8 kobilica (locust-taxonomy)/004/04 0 15 18 9 kobilica (mare)/004/05 0 3 10 kobilica (horse-taxonomy)/004/06 0 Strategy: Ward method D(link)*100 / D(max) 100---90---80---70---60---50---40---30---20---10---00 .______ violina /001/01 | .__ struna /002/01 | ._|__ kobilica /004/01 .__________|_|____ struna /002/02 | .____________ vzrok /003/01 ._______________________________|____|____________ kobilica /004/02 | .__ kobilica /004/03 | .___________________________|__ kobilica /004/04 | | .__ kobilica /004/05 _|__________________|___________________________|__ kobilica /004/06 Figure 3. The distance matrix D between ten different senses of four words (violin, string, cause and kobilica). The word string has two meanings a) part of the violin and b) sound-emitting device. The word kobilica has four meanings and six synset paths from the meanings to the top of the network (see Figure 1). The distances between individual meanings are calculated using the procedure and equation /1/. The dendrograms based on the distance matrix D can be output optionally after any number of tagged sentences providing there is no more than 500 nouns or verbs. 94 Informatica 42 (2018) 85–93 J. Zupan Informatica 42 (2018) 95–106 95 A Segmentation-Recognition Approach with a Fuzzy-Artificial Immune System for Unconstrained Handwritten Connected Digits Hocine Merabti LabSTIC Laboratory, 8 May 1945 University, BP-401, Guelma, 24000. Algeria E-mail: merabti.dr@gmail.com Brahim Farou and Hamid Seridi LabSTIC Laboratory, Computer Science Department, 8 May 1945 University, BP-401, Guelma, 24000. Algeria E-mail: farou@ymail.com, seridihamid@yahoo.fr Keywords: pattern recognition, optical characters recognition, handwritten digit recognition, handwritten numeral string segmentation, artificial immune system (AIS), fuzzy logic Received: October 24, 2017 In this paper, we propose an off-line system for the segmentation and recognition of the unconstrained handwritten connected digits. The proposed system provides new segmentation paths by finding two types of structural features. The background and foreground features points are found from the input string image. The possible cutting paths are generated from these features points. Each candidate component is evaluated individually based on its features points and its height. The output of the segmentation module is evaluated using the fuzzy-artificial immune system (Fuzzy-AIS). The latter performs a decision function on the resulting segments, and then the hypothesis that has the best score is regarded as the global decision. The experimental results on the ell-known handwritten digit database NIST SD19 show the effectiveness of the proposed system compared with other methods in both segmentation and recognition. Povzetek: Razvit je sistem za segmentiranje in prepoznavanje ročno pisanih števk. 1 Introduction The handwritten numeral string recognition has become a very open research area since their introduction in a wide range of application areas such: indexing and automatic processing of documents, automatic processing of bank checks, and automatic location of addresses and postal co- des [1]. The aim of these applications is to reduce the ma- nual effort involved in these tasks. Handwriting recognition can be divided, according to the nature of the input, into two categories: on-line and off-line [2]. In the on-line case, the handwriting is produced by a pen or a mouse on an electronic surface and acquired as a time-dependent signal. In the off-line case, the handwriting is scanned on paper. Due to the variation in writing styles and the presence of overlapping and touching characters, the off-line recognition presents a good deal of challenging problems. For building such off-line recognition system, the first step is the acquisition of the numeral string image followed by pre-processing operations on this image. Afterward, each numeral string is segmented into individual isolated digits. Finally, these digit images are sent to the classifier which assigns the corresponding class [3, 4, 5]. The seg- mentation of a string into isolated digits becomes one of the important challenges of handwritten recognition systems. Indeed, a very good recognition system can be practically useless when text identification and segmentation are per- formed poorly [6]. The segmentation problems are mainly related to several factors. First, the slope of the images or the noise intro- duced by the scanner. The variability in writing style and the inking defects caused by scripters. The variability and complexity of the character string shapes illustrated in the overlapping or the joining of two consecutive digits. Se- cond, we do not know the number of characters in the string, and consequently, the optimal boundary between them is unknown [6]. To overcome these problems, many proposed solutions combine the segmentation and recognition processes. Per- forming a correct segmentation of an image involves kno- wing what it contains. On the other hand, if the recognition of the content of an image is correct, it means that the sy- stem has all the necessary information for the segmentation process. The segmentation process can be divided into two clas- ses: segmentation-then recognition and recognition-based [7] (Fig. 1). In the first class, the segmentation module tries to separate the connected characters by building a segmen- tation path. The latter contains a unique sequence hypot- hesis, and each subsequence should contain a single cha- racter to be submitted for recognition [8, 9]. In the second class, the process provides a set of segmentation hypothe- ses and defines the segmented digits by performing recog- 96 Informatica 42 (2018) 95–106 H. Merabti et al. nition of each provided segmentation hypothesis [10, 11]. This kind of approach gives good results because it pro- vides several hypotheses that increase the classifier choice to find the correct recognition [6]. The segmentation can also be either explicit or implicit, as seen in Fig. 1. In the explicit methods, the segmentation is carried out prior to the recognition to provide candidate digits for the classi- fier [12, 13]. However, in the implicit methods, the seg- mentation is embedded in the recognition process, and it is performed simultaneously with recognition [14, 15]. Se- veral works have proposed segmentation algorithms based on these two methods in the last few years. The literature has also shown that implicit segmentation offers very in- teresting perspectives, but explicit segmentation achieves better results [6, 16]. Explicit Segmentation Explicit Segmentation Handwritten Digit Segmentation Segmentation-then Recognition Recognition-based Implicit Segmentation Figure 1: Segmentation and recognition of digits string methods (adapted from [6]). Usually, segmentation can be conducted by the exami- nation of the following three cases: connected digits, over- lapped digits or distinct digits (as shown in Fig. 2). From these problems and in most instances, the connected and the overlapped digits are the most frequent situations obser- ved in handwriting. Also, many algorithms have been pro- posed to deal with these situations [6, 17]. Some of them are based on features extracted from background pixels in the image [18], and others on features extracted from fo- reground pixels in the image [12]. Recently, several al- gorithms have used a combination of both these features [19, 20]. (a) (b) (c) Figure 2: Main difficult examples; (a) connection (the 3 and the 5), (b) overlapping (the first and the second 5), (c) disjunction (in the 5). Practically, to build a robust system for segmentation and recognition of connected handwritten digits, it is necessary to find: in the first, new methods to select or reduce the number of segmentation points which optimize the number of resulting segmentation hypotheses. In the second, new methods to eliminate the unnecessary segmentation paths which decrease the rejection rate. Finally, using the accu- rate classifiers on this type of data for keeping or increasing the recognition performance. In this paper, we propose a new segmentation- recognition approach for handwritten numerical strings. Our work is focused on segmentation and recognition of connected digits, which present the main problematic in the segmentation through: selecting new features points for segmentation, evaluating the segmentation hypotheses to get more precise candidate segments, and using a good classifier for recognition. In segmentation process, we pro- vide segmentation paths for separating the touched digits. This process is based on combining features from the back- ground and foreground of the image. These features are used as segmentation points in the image. The fuzzy- artificial immune system (Fuzzy-AIS) is used for selecting the best segmentation hypotheses and properly classifying the separated digits. It also eliminates the arbitrary assign- ments in the decision phase when the dataset is overlapped, or the characteristics of objects are almost similar. The paper is organized as follows. Section 2 presents a description of the proposed method. Section 3 is devoted to the experimental results. Finally, Section 4 concludes the paper. 2 Description of the proposed method Our system consists of several stages: pre-processing, seg- mentation, feature extraction, and classification. The pre- processing module aims to remove the noise in the con- nected digits images and to simplify their further proces- sing. The segmentation module allows providing the best set of candidate cutting paths for the input image and seg- menting them into isolated individual digit images. The fe- ature extraction module extracts some statistical and struc- tural features from each digit image and represents them in a feature vector. Finally, the resulted features vectors are sent to the Fuzzy-AIS, and the corresponding class labels are assigned. An overview of the proposed system is shown in Fig. 3. 2.1 Pre-processing The pre-processing module is applied to the strings image to eliminate or reduce the noise and to simplify the further processing. This module includes smoothing, binarization, dilation, and erosion. – The smoothing is used to reduce the noise in the image. – The binarization converts the image to a black and white. – The dilation and erosion aim to close the disjoint ed- ges and to smooth the global edges of the image. A Segmentation-Recognition Approach with a Fuzzy-Artificial. . . Informatica 42 (2018) 95–106 97 Fe at u re s E xt ra ct io n Pre-processed Image Connected Components Extraction Possible Segmentation Paths Construction Segmentation Evaluation Structural and Statistical Features Extraction Classification and Verification Final Decision G lo b al D e ci si o n R e co gn it io n Se gm e n ta ti o n Figure 3: Flow diagram of the system. Figure 4 shows a sample of the used database before and after the pre-processing stage. Input image Output image Pre-processing Figure 4: A sample before and after the pre-processing mo- dule. 2.2 Segmentation The aim of the segmentation module is to segment the input string image into isolated digit images by provi- ding the best set of candidate cutting paths. This mo- dule consists of three main steps: connected-components- extraction, touching connected-components-identification, and cutting paths constructing and evaluation of connected components (CCs) as seen in Fig. 5. In the first step, the input image is separated into CCs. The second one allows detecting if a CC contains touched components (TC) or not by checking the following equa- Connected- Components- Extraction Touching Connected- Components- Identification Cutting Paths Construction and Evaluation Pre-processed Image Candidate Components Figure 5: Block diagram of the segmentation module. tion: TC = { 1, if WCC > α∗H100 0, Otherwise (1) where, WCC is the width of a CC, H is the height of the numeral string image, and α is a predefined parameter set in our case to 75. In the case where the CC does not contain TC, then this CC is very likely to be a piece of a broken digit or a sin- gle digit. Figure 6 shows the extraction and identification process. CC3 W𝑐𝑐2 W𝑐𝑐3 CC1 CC2 H W𝑐𝑐1 Figure 6: Extraction and identification of connected com- ponents; CC2 with (WCC2) is higher than α% of the height (H) of the numeral string image, and need further segmen- tation. CC1 and CC3 with (WCC1) and (WCC3) respecti- vely, do not require any further segmentation. The final step allows providing the optimum position for cutting a CC and extracting the correct candidate compo- nents. This step involves analyzing the foreground and background features of the CC to generate the segmenta- tion points, followed by the generation of the possible cut- 98 Informatica 42 (2018) 95–106 H. Merabti et al. ting paths. The evaluation process is used to optimize the resulting segmentation paths and get more accurate results. In the following, we explain in detail, how to construct and evaluate the cutting paths for a CC. 2.2.1 Generating segmentation points a. Profile features The method of finding the profile features for a CC is as follows: – Find the vertical upper and lower projection profiles of the CC, as seen in Fig. 7(b) and (c). – Extract the upper and lower skeletons of these profiles, which are less and higher than the middle height (H) of CC (see Fig. 7(d) and (e)). – Extract the end points (PFs) that have just one black neighbor pixel from the skeletons. The first and the last end points in each skeleton will not be considered (see Fig. 7(f)). (a) (b) (c) (e) (d) (f) PFs Figure 7: Profile features extraction; (a) Original image, (b) Upper projection profile, (c) Lower projection profile, (d) Upper skeleton profile, (e) Lower skeleton profile, (f) End points of the skeletons (PFs) (denoted by a red circle). b. Skeleton and edge features The following steps show how to find the skeleton and edge features: – Extract the skeleton of the CC. – Extract the intersection points (SFs) which have more than two black neighbor pixels from the skeleton (Fig. 8(b)). – Extract the outer edge (upper/lower) from the CC, and add it to the skeleton image (Fig. 8(c)). – Calculate the distance between the intersection point and the upper edge image, and select the points (EFs) that have the minimum value (Fig. 8(c)). – Calculate the distance between the intersection point and the lower edge image, and select the points (EFs) that have the minimum value (Fig. 8(d)). These feature points (SFs and EFs) show the proper lo- cation of segmentation regions. (a) (b) (c) (d) EFs EFs SFs Figure 8: Skeleton (SFs) and Edge (EFs) features ex- traction; (a) Original image, (b) Skeleton of the CC with intersection points, (c) Upper edge of the CC superimpo- sed on the skeleton image, (d) Edge points (denoted by a red circle). 2.2.2 Generating segmentation paths All the feature points of the touching digits are found in the previous step. Now, the segmentation path can be genera- ted from these points using two ways: from top to bottom, and from bottom to top. These feature points are connected together to construct the possible segmentation paths (Fig. 9). The two points P1 and P2 are connected according to the following equation: | xP1 − xP2 |≤ µ ∗ (WCC/2) (2) where, xP1 and xP2 are the horizontal coordinates of P1 and P2 respectively, µ is a constant parameter set empiri- cally to 0.6, and WCC is the horizontal width of the con- nected component. A Segmentation-Recognition Approach with a Fuzzy-Artificial. . . Informatica 42 (2018) 95–106 99 X WCC W a y o f co n str u c tio n P2 P1 xp1 xp2 Figure 9: Construction of the segmentation path from fea- ture points (from top to bottom). The proposed method scans all possible relationships be- tween PF, EF, and SF and generates the related segmenta- tion paths according to the equation (2). Therefore, three hypotheses can be considered for the optimal segmentation path: • Hypothesis 1: If the distance between the projection of the PF and the EF verifies the equation (2), then con- structing a vertical segmentation path between these points (Fig. 10(a)). • Hypothesis 2: If there is a skeleton path rather than one SF that linking both upper and lower EFs, this skeleton path is used as part of the vertical segmentation path (Fig. 10(b)). • Hypothesis 3: If the CC does not contain SFs, the verti- cal segmentation path is constructed between PFs and the closest points PFs (Fig. 10(c)). During the segmentation process, the segmentation paths may produce outliers: over-segmented parts (out-of-class) or under-segmented parts (non-digit patterns). The re- sulting segments that contain at least one outlier digit must be rejected using the evaluation of segmentation process. 2.2.3 Evaluation of the segmentation After finding all possible segmentation paths, each one di- vides a CC into two new candidate connected components. At this stage, each candidate path is evaluated individually by using two constraints to evaluate our segmentation met- hod and to get more precise results. The first constraint is related to the features points, while the second one is rela- ted to the height: • Constraint one: if a candidate component is inside two possible segmentation paths with the same start and end points, then this candidate component is rejected (Fig. 11(a) and (b)). •Constraint two: If the higher of a candidate component is lower than 20% of the height (H) of the image, then this candidate component is rejected (Fig. 11(c) and (d)). Each segmentation hypothesis divides a CC into two or more new CCs. Now, all the new segments are normalized into a matrix of size 78 × 64 for preserving their aspect ratio. From each normalized segment, we extract a set of characteristics and represent them as a feature vector. The latter is introduced into the Fuzzy-AIS for the classifica- tion. 2.3 Features extraction In this work, we extracted 39 statistical and structural fe- atures from the character. These features are based on Hu moments, zoning features, transitions histograms, and end and crossing points. • Hu moments: seven invariant moments of Hu are com- puted from normalized and centralized moments up to or- der three of the segment [21]. They are invariant to transla- tion, scaling, and rotation. • Zoning features: this technique allows dividing the seg- ment into several zones (a grid ofN×M ), where the featu- res are extracted from each zone [22]. We take the skeleton of each normalized segment, and we divide it into 3×2 zo- nes. For each zone, we extract the density zoning and the gravity center. The density zoning represents the ratio of the number of black pixels on the total size of a zone [23]. The two coordinates of gravity center are used [24]. •Transitions histograms: this technique counts the num- ber of transitions from foreground to Background in spe- cified direction (horizontal, vertical and both diagonals 45◦/135◦). We extract the mean, the variance, and the max from each histogram. • End and crossing points: the end point is a point that has just one black neighbor pixel. A crossing point con- nects three or more branches. After extracting features and representing them in fea- ture vectors, the resulting features vectors are sent to the Fuzzy-AIS for assigning the corresponding class labels. 2.4 Fuzzy-AIS for recognition and verification An artificial immune system (AIS) is an adaptive system inspired by the principles and functioning of the natural immune system [25]. They are classes of algorithms that have properties and abilities very useful for pattern recog- nition, especially the classification problem [26, 27]. In our case, we coupled one of the best-known classification algorithms based on artificial immune systems, called the Artificial Immune Recognition System (AIRS) [28], with the Fuzzy-KNN approach. The principle of AIRS algorithm is as follows: for a gi- ven training set of samples from a data class of interest (an- tigens); the AIRS returns a set of memory antibodies which are used to recognize this class. It is also characterized by: – Self-regulation: the ability of adaptation and learning, 100 Informatica 42 (2018) 95–106 H. Merabti et al. (a) (c) (b) Figure 10: Hypotheses of the segmentation path; (a) Hypothesis 1, (b) Hypothesis 2, (c) Hypothesis 3. (c) (d) (a) (b) Figure 11: Effect of the evaluation method; (a) and (c) Cases of segmentation before evaluation, (b) and (d) Cases of segmentation after evaluation. – Competitive performance: their results can be classi- fied among the best works in the classification field, – Generalization via data reduction: it allows reducing the database on a few training samples, – Parameter stability: their parameter tuning on diffe- rent data. For more detail about this algorithm, the reader is refer- red to [28, 29]. The similarity measure is one of the most significant de- sign choices in the development of an artificial immune sy- stem algorithm, and more precisely in their decision phase. The decision in most artificial immune systems algorithms is provided with the K-Nearest Neighbor approach. The latter has not the ability to correctly assign an object to a particular class when it belongs to other classes with the same value of similarity measure. The decision will be random in the case when the data- set is overlapped, or the characteristics of the objects are almost similar. To overcome these limitations, the fuzzy concept is introduced in the decision phase, and it lies in the Fuzzy-KNN approach. It ensures that the arbitrary as- signments are not made [30]. The Fuzzy-KNN approach finds the k Nearest Neighbors of the candidate component. Each candidate component D belongs to a class i with a membership value mvi(D). The latter depends on the class of its k Neighbors, and it is given by: mvi(D) = k∑ j=1 mvij ( 1 d(D,xj) 2 (m−1) ) k∑ j=1 ( 1 d(D,xj) 2 (m−1) ) (3) A Segmentation-Recognition Approach with a Fuzzy-Artificial. . . Informatica 42 (2018) 95–106 101 MA 1 0.52 MA 2 0.79 MA 3 0.1 Features vectors Segmentation Path 1 Segmentation Path 2 Segmentation Path 3 Max 0.79 Final Decision 3 Normalization CC DF11 DF12 DF13 DF21 DF22 DF31 DF32 2 Figure 12: Segmentation followed by the Fuzzy-AIS as a recognition and verification strategy. where, mvij is the membership in the ith class of the jth vector of the training set, d(D,xj) is the distance between D and its jth nearest neighbor xj . The parameter m deter- mines how heavily the distance is weighted when calcula- ting the class membership. In this stage, for each candidate component D, the clas- sifier gives a membership to every class and assigns to it the class which has the highest membership valuemv. For this reason, the Fuzzy-AIS classifier allows performing a set of decision functions (DF ) on the segments of CC according to the following equation: DF = { B ∗mv, if mv < 0.5 mv, Otherwise (4) where, B is a predefined parameter set empirically to 0.75. Afterward, the classifier calculates the average (MA) of DFs provided by each hypothesis. Finally, the maximum of these averages is regarded as the final decision function of the classifier, as seen in Fig. 12. 3 Experimental results To evaluate the proposed method, we perform our experi- ments on the standard database NIST SD19, which con- tains unconstrained handwritten numeral strings with va- rious lengths [31]. Our experiments were performed on two stages. In the first, the digit classifier was trained with isolated digit samples. Secondly, the digit classifier was applied to numeral string recognition. 3.1 Isolated handwritten digit recognition In this stage, we divided the used database into two sets: a set of 2000 isolated digits used for the Fuzzy-AIS lear- ning, and a set of 1500 isolated digits used for testing. The first stage of the Fuzzy-AIS learning consists in performing several tests to initialize the parameters: clonal rate, hy- per clonal rate, hypermutation rate, mutation rate and Af- finity threshold scalar. These parameters are necessary for calculating the clones number, the ARBs resources, and the 102 Informatica 42 (2018) 95–106 H. Merabti et al. mutation function. The parameters selection of our classi- fier is shown in Table 1. Fuzzy-AIS Parameters Values Clonal rate 10 Hyper clonal rate 4 Mutation rate 0.1 Hypermutation rate 15 Affinity threshold scalar 0.01 Table 1: Parameters selection for Fuzzy-AIS. After the training process, we obtained a recognition rate of 98.70% on the testing set. The main target of this work is to evaluate the performance of foreground and background features with the Fuzzy-AIS. Indeed, we are not trying to train the classifier with no digits, to optimize their accuracy or to compare the result with other works. In the next stage, we will discuss these issues and compare the performance of our system with other works. 3.2 Handwritten numeral string recognition Our experiments were performed in two phases. In the first, we examined the performance of our segmentation module without using classification information. In the second, the segmentation is integrated with the recognition process to construct a segmentation-recognition system. • In the first phase, we perform some experiments on the 3000 string images of the NIST SD19 database for evalua- ting our segmentation module. All images contain touching pairs of digits, but the module does not know the length of the string. Figure 10 shows some of the results of our seg- mentation module and Table 2 illustrates their performan- ces. As shown in Table 2, after the segmentation module, we Cases of segmentation path Visualization (%) Correct segmentation path 95.86 % Errors 1.77 % Rejection 2.37 % Exactly one segmentation path 87.3 % Table 2: Performances of handwriting pairs digit segmen- tation with our method on 3000 images of NIST SD19 da- tabase. made a visual analysis and verified in 95.86% of cases, the best segmentation path is among the paths generated by the module. In this case, the module does not know the length of the input digits string and some images produce more than one cutting path (see Fig. 13(a) and (b)). In 1.77 % of cases, the correct segmentation path is not among the produced paths, so we consider these cases as errors (see Fig. 13(e) and (f)). In 2.37 % of cases, the segmen- tation path is not produced on images; we consider these cases as rejected images (see Fig. 13(g) and (h)). The error and rejection cases are related to the overlapping connected digits. Among 95.86% of the correct segmentation paths, 87.3 % of them have only one segmentation path (see Fig. 13(c) and (d)). A comparison of this result with several segmentation algorithms proposed in the literature in the last few years is shown in Table 3. Approaches 2-digit Strings Number Results (%) StR [32] 2000 88.70 Rb [33] 1000 93.77 Rb [9] 3287 94.8 Rb [34] 2069 95.84 Our Approach 3000 95.86 Table 3: Performance comparison of several works on touching pairs of digits. Rb: Recognition-based, StR: Segmentation-then Recognition. Table 3 summarizes a set of segmentation algorithms, declares the number of samples used for testing, and shows their accuracy on touching pairs of digits. As shown in Table 3, our approach gives good segmen- tation results in pairs of digits compared with others works. • In the second phase of our experiments, the recogni- tion module is introduced. It is based on the Fuzzy-AIS approach. We used 2000 images as training samples from the NIST SD19 Database, 200 images per class. For the testing stage, we randomly selected 1500 images. For each string length from 2, 3, 4, 5, 6, and 10, we took 250 images. To determine the performances of the proposed approach, we tested the influence and effectiveness of both: the eva- luation method in the segmentation module, and the fuzzy concept in the recognition module. The performance re- sults of our segmentation-recognition system are shown in Table 4. Table 4 summarizes the recognition rates of our system on numeral strings recognition of lengths 2, 3, 4, 5, 6, and 10 digits. The results in Table 4 show that the use of the evaluation method in the segmentation module improves the performance of the proposed system (Fig. 11). This improvement is visible in both classifiers (AIS and Fuzzy- AIS). The system segment and recognize 96.55% of string samples with the use of AIS classifier, and 96.79% with the use of Fuzzy-AIS classifier. From these results, we notice that the introduction of the evaluation method increased the recognition rate by 11.6% in the case of AIS and 11.4% in the case of Fuzzy-AIS. However, the changeover from the AIS to the Fuzzy-AIS gave a slight improvement by 0.24% in the recognition rate. This is due to the efficiency of seg- mentation method. To discuss and compare the effective- A Segmentation-Recognition Approach with a Fuzzy-Artificial. . . Informatica 42 (2018) 95–106 103 (c) (d) (f) (e) (a) (b) (h) (g) Figure 13: Some results of the segmentation module; (a) and (b) Case of correct segmentation, (c) and (d) Case of exactly segmentation, (e) and (f) Case of error, (g) and (h) Case of rejection. ness of the proposed approach, we compare our results with others recent approaches on the same database (Table 5). The results in Table 5 indicate that our system is promi- sing and compare favorably with the other works. 4 Conclusion In this paper, we proposed a new system to recog- nize unconstrained handwritten digit strings. We used a segmentation-recognition strategy for handwritten con- nected digits based on structural features and the Fuzzy- artificial immune system. First, we combined the back- ground and foreground analysis for extracting the feature points. For the background features, we applied a thinning procedure to the vertical projection profile of the image. For the foreground features, we applied a thinning proce- dure on the connected component and their edge. These feature points are linked to generate the possible segmen- tation paths in connecting digits. The resulted candidate segmentation paths are evaluated for removing the useless among them and keeping the best. The evaluation process is based on two main constraints. The first one is related to the features points of the candidate segmentation paths and the second one is related to its height. Finally, we in- troduced the Fuzzy-AIS classifier for ranking all possible segmentation paths and considering the best of them as the global decision. The introduction of both the evaluation process in the segmentation module and the fuzzy concept in the decision phase allowed increasing the recognition rate. Our experiments on the NIST SD19 database show that our system gets good results in both segmentation and re- cognition and compare favorably with other works in the same database. References [1] Gayathri, P. and Ayyappan, S. (2014) ‘Off-line hand- written character recognition using Hidden Markov Model’, in Proceeding of the International Confe- rence on Advances in Computing, Communications and Informatics (ICACCI), IEEE, pp.518–523. [2] Lacerda, E. B. and Mello, C. A.(2013) ‘Segmen- tation of connected handwritten digits using Self- Organizing Maps’, Expert Systems with Applications, Vol. 40, no. 15, pp.5867–5877. [3] Saba, T., Rehman, A. and Elarbi-Boudihir, M. (2014) ‘Methods and strategies on off-line cursive touched characters segmentation: a directional review’, Artifi- cial Intelligence Review, Vol. 42, pp.1047–1066. 104 Informatica 42 (2018) 95–106 H. Merabti et al. String Length Recognition Rate (%) Without Evaluation Method With Evaluation Method AIS Fuzzy-AIS AIS Fuzzy-AIS 2 84.33 85.66 97.33 98.00 3 88.44 88.88 97.11 97.33 4 84.83 85.33 96.33 96.66 5 84.00 84.26 96.00 96.13 6 80.00 80.11 95.22 95.33 10 88.12 88.12 97.33 97.33 Average rates 84.95 85.39 96.55 96.79 Table 4: Experimental results of our segmentation-recognition approach. String Recognition Rate (%) Length Approaches [16] [35] [13] [34] Our approach 2 96.88 94.8 98.94 98.57 98.00 3 95.38 91.6 97.23 96.28 97.33 4 93.38 91.3 96.16 96.12 96.66 5 92.40 88.3 95.86 94.73 96.13 6 93.12 89.1 96.10 95.02 95.33 10 90.24 86.9 94.25 90.46 97.33 Average 93.57 90.33 96.42 95.63 96.79 rates Table 5: A comparison with others works. [4] El Kessab, B., Daoui, C., Bouikhalene, B. and Sa- louan, R. (2014) ‘A Comparative Study between the K-Nearest Neighbours and the Multi-Layer Percep- tron for Cursive Handwritten Arabic Numerals Re- cognition’, International Journal of Computer Appli- cations (0975–8887), Vol. 107, No. 21. [5] El Kessab, B., Daoui, C., Bouikhalene, B. and Sa- louan, R. (2015) ‘A comparative study between the support vectors machines and the k-nearest neighbors in the handwritten latin numerals Recognition’, Inter- national Journal of Signal Processing, Image Proces- sing and Pattern Recognition, Vol. 8, No. 2, pp.325– 336. [6] Ribas, F. C., Oliveira, L. S., Britto Jr, A. S. and Sa- bourin, R. (2013) ‘Handwritten digit segmentation: a comparative study’, International Journal on Docu- ment Analysis and Recognition (IJDAR), Vol. 16, no. 2, pp.127–137. [7] Casey, R. G. and Lecolinet, E. (1996) ‘A survey of methods and strategies in character segmentation’, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 18, no. 7, pp.690–706. [8] Shi, Z. and Govindaraju, V. (1997) ‘Segmentation and recognition of connected handwritten numeral strings’, Pattern Recognition, Vol. 30, no. 9, pp.1501– 1504. [9] Yu, D. and Yan, H. (2001) ‘Separation of touching handwritten multi-numeral strings based on morpho- logical structural features’, Pattern Recognition, Vol. 34, no. 3, pp.587–599. [10] Gattal, A. and Chibani, Y. (2015) ‘SVM-Based Segmentation-Verification of Handwritten Con- nected Digits Using the Oriented Sliding Win- dow’,International Journal of Computational Intelligence and Applications, Vol. 14, no. 1, pp.1550005. [11] Fujisawa, H., Nakano, Y. and Kurino, K. (1992) ‘Seg- mentation methods for character recognition: from segmentation to document structure analysis’, Pro- ceedings of the IEEE, Vol. 80, no. 7, pp.1079–1092. [12] Pal, U., Belaid, A. and Choisy, Ch. (2003) ‘Touching numeral segmentation using water reservoir concept’, Pattern Recognition Letters, Vol. 24, no. 1, pp.261– 272. A Segmentation-Recognition Approach with a Fuzzy-Artificial. . . Informatica 42 (2018) 95–106 105 [13] Sadri, J., Suen, C.Y. and Bui, T.D. (2007) ‘A genetic framework using contextual knowledge for segmenta- tion and recognition of handwritten numeral strings’, Pattern Recognition, Vol. 40, no. 3, pp.898–919. [14] Procter, S., Illingworth, J., and Elms, A.J. (1998) ‘The recognition of handwritten digit strings of unknown length using hidden markov models’, In Proceedings of the 14th International Conference on Pattern Re- cognition, pp.1515–1517. [15] Choi, S. M. and Oh, I. S. (1999) ‘A segmentation- free recognition of two touching numerals using neu- ral networks’, In Proceedings of the 5th International Conference on Document Analysis and Recognition (ICDAR’99), IEEE, Bangalore, India, pp.253–256. [16] Oliveira, L. S., Sabourin, R., Bortolozzi, F. and Suen, C. Y. (2002) ‘automatic recognition of handwritten numerical strings: a recognition and verification stra- tegy’, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, no. 11, pp.1438–1454. [17] Kulkarni, R. V. and Vasambekar, P. N. (2010) ‘An overview of segmentation techniques for handwritten connected digits’, In Proceedings of the International Conference on Signal and Image Processing (ICSIP), IEEE, pp.479–482. [18] Ayat, N.E., Cheriet, M. and Suen, C.Y. (2000) ‘Un sy- stme neuro-flou pour la reconnaissance de montants numriques de chques arabes’, In Colloque internati- onal francophone sur l’crit et le document, pp.171– 180. [19] Oliveira, L. S., Lethelier, E., Bortolozzi, F. and Sa- bourin, R. (2000) ‘A new approach to segment hand- written digits’, In Proceedings of the 7th Internatio- nal Workshop on Frontiers in Handwriting Recogni- tion, Amsterdam, The Netherlands, pp.577–582. [20] Sadri, J., Suen, C. Y. and Bui, T. D. (2004) ‘Automa- tic segmentation of unconstrained handwritten nume- ral strings’, In Proceedings of the 9th International Workshop on Frontiers in Handwriting Recognition ( IWFHR-9), IEEE, Tokyo, Japan, pp.317–322. [21] Cash, G. L. and Hatamian, M. (1987) ‘Optical charac- ter recognition by the method of moments’, Computer Vision, Graphics and Image Processing, Vol. 39, no. 3, pp.291–310. [22] Hirabara, L. Y., Aires, S. B., Freitas, C. O., Britto Jr, A. S. and Sabourin, R. (2011) ‘Dynamic zoning se- lection for handwritten character recognition’, In Pro- gress in Pattern Recognition, Image Analysis, Com- puter Vision, and Applications,Pucn, Chile, pp.507– 514. [23] Parker, J.R. (1993) Practical computer vision using C, John Wiley and Sons, Inc., New York. [24] Gorgevik, D. and Cakmakov, D. (2004) ‘An efficient three-stage classifier for handwritten digit recogni- tion. In Pattern Recognition’, In Proceedings of the 17th International Conference on Pattern Recogni- tion (ICPR’04), IEEE, pp.507–510. [25] Timmis, J., Andrews, P.S., Owens, N. and Clark, E. (2008) ‘An interdisciplinary perspective on artificial immune systems’, Evolutionary Intelligence, Vol. 1, no. 1, pp.5–26. [26] De Castro, L. N. and Timmis, J. (2002) ‘Artificial Im- mune Systems: A Novel Paradigm to Pattern Recog- nition’, Artificial Neural Networks in Pattern Recog- nition, Vol. 1 , pp.67–84. [27] Yang, Y. (2011) ‘Application of artificial immune Sy- stem in handwritten Russian Uppercase character re- cognition’, In Proceedings of the International Con- ference on Computer Science and Service System (CSSS), IEEE, pp.238–241. [28] Watkins, A., Timmis, J. and Boggess, L. (2004) ‘Artificial Immune Recognition System (AIRS): An Immune-Inspired Supervised Learning Algorithm’, Genetic Programming and Evolvable Machines, Vol. 5 , no. 3, pp.291–317. [29] Watkins, A. and Boggess, L. (2002) ‘A New Classifier Based on Resource Limited Artificial Immune Sys- tems’, In Proceedings of the 2002 Congress on Evo- lutionary Computation CEC’O2, Part of the World Congress on Computational Intelligence., IEEE, Ho- nolulu, HI, USA, pp.1546–1551. [30] Keller, J.M., Gray, M.R. and Givens, J.A. (1985) ‘A Fuzzy K-Nearest Neighbor Algorithm’, IEEE Tran- sactions on Systems, Man and Cybernetics, Vol. SMC-15, no. 4, pp.580–585. [31] Grother, P. J. (1995) ‘NIST Special Database 19; Handprinted Forms and Characters Database’, Nati- onal Institute of Standards and Technology (NIST). [32] Suwa, M. and Naoi, S. (2004) ‘Segmentation of Handwritten Numerals by Graph Representation’, In Proceedings of the 9th International Workshop on Frontiers in Handwriting Recognition, IEEE, Tokyo, Japan, pp.334–339. [33] Ciresan, D. (2008) ‘Avoiding segmentation in multi- digit numeral string recognition by combining sin- gle and two-digit classifiers trained without negative examples’, In Proceedings of the 10th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC’08), Timisoara, Ro- mania, pp.225–230. [34] Cavalin, P. R. (2006) ‘An implicit segmentation-based method for recognition of handwritten strings of cha- racters’, In Proceedings of the 2006 ACM Symposium 106 Informatica 42 (2018) 95–106 H. Merabti et al. on Applied Computing, ACM, Dijon, France, pp.836– 840. [35] Britto Jr, A. D. S., Sabourin, R., Bortolozzi, F. and Suen, C.Y. (2003) ‘The recognition of handwritten numeral strings using a two-stage HMM-based met- hod’, International Journal on Document Analysis and Recognition, Vol. 5, No. 2-3, pp.102–117. Informatica 42 (2018) 107–116 107 Load Balancing for Virtual Worlds by Splitting and Merging Spatial Regions Umar Farooq University of Science and Technology Bannu, Pakistan E-mail: umar@ustb.edu.pk John Glauert University of East Anglia Norwich, UK E-mail: j.glauert@uea.ac.uk Kashif Zia Sohar University, Oman E-mail: kzia@soharuni.edu.om Keywords: virtual world, OpenSimulator, spatial partitioning, under utilisation, over utilisation Received: July 7, 2017 The aggregation algorithm, an integral part of our dynamic infrastructure (using an expansion and a con- traction model) for managing scalable virtual worlds, was proposed in our previous work, to overcome the limitations of the current methods using static and hierarchical approaches. The basic aim was to get two contiguous spaces made of smaller regions while distributing the load as balanced as possible among two servers. This algorithm performs well for the perfect square shaped spaces but fails when it is applied to spaces of other shapes. The current merging algorithms also assign non-contiguous spaces to servers during the contraction phase. This is due to the unavailability of an explicit continuity check in both aggregation and merging algorithms. In this paper, we provide state-of-the-art in scaling virtual worlds and outline their limitations. It provides both theoretical arguments and simulation results that contiguous spaces have potential benefits. This work, then, extends both the aggregation and merging algorithms and incorporates an explicit continuity check to cope with the issues introduced by allowing non-contiguous spaces. It is demonstrated with the help of results from our prototype that the extended methods strictly achieves the theoretical goals of the proposed methods. Povzetek: Podan je pregled skalirnih metod v navideznih svetovih (VWs) in nov algoritem za razširjanje in krčenje podprostorov. 1 Introduction Virtual Worlds (VWs) are the most advanced Virtual Envi- ronments (VEs) that allow users to immerse into 3D shared spaces. They provide real or imaginary content and users in them are represented by digital characters called ava- tars [14]. VWs are interactive and collaborative environ- ments that have distinguishing features such as coherence and persistence. They are general purpose and social in nature [21, 24]. They have attracted huge attention of indi- viduals, businesses, and organisations of various domains such as entertainment, design, government, and research and development communities. They are becoming a major tool for collaborative activities [16, 1]. Second Life (SL) [24, 27] is state-of-the-art in commercial VW development frameworks and it imitates the physical world. It is extensi- vely used for content development by various communities such as business and entertainment industries. The research and development community has, however, shown more in- terest in OpenSimulator (OSm) [13, 20] - an open source alternative to SL. Scalability is the major issue to dealt with in VEs. Tradi- tionally, it is achieved by splitting the whole virtual space and assigning it to a set of dedicated servers for simula- ting it [19]. Game environments are easily scalable as they exploit the concept of sharding that allows the duplication of content [21]. However, the space in VWs, is distribu- ted using spatial partitioning. VWs do not allow duplica- tion of content as they have to maintain a unified coherent space [19, 21]. VWs are very complex as they integrate in them the challenges of many hard simulation problems such as large scale, real time computation and communica- tion using a simulation centric architecture developed for standard simulation environments [21]. Therefore, they are much restricted and are able to host only a limited number of players per Simulator (Sim) [15]. Static and dynamic methods are currently in practice to assign a virtual space to a given set of servers. While a system is up and running, a statically assigned space never changes and manual reconfigurations are required to incor- porate changes in current allocation. On the other hand, dynamic spatial partitioning allows re-assignments while 108 Informatica 42 (2018) 107–116 U. Farooq et al. the system is running. This process, however, is too ex- pensive as it involves transferring both content and play- ers. Dynamic techniques are usually categorised into flat and hierarchical mechanisms. Flat mechanisms use either a local, global, or an adaptive strategy for load distribu- tion [4]. Hierarchical approaches adopt a parent child hier- archy for managing resources. In our previous work, we developed a hybrid infrastructure comprises an expansion and a contraction model to cope with the issues in both static and dynamic mechanisms presented in section 2.1.1 and 2.1.2 [6, 8]. It proposed an aggregation and assign- ment algorithm [7, 9] for the expansion phase and merging algorithms for the contraction phase [6]. The major goal of both types of algorithms was to provide contiguous spaces for a Sim to host. In this paper, we present the critical analysis of some of the well-known static and dynamic methods currently in use for scalable VEs including our proposed framework. It provides justification for using the continuity in spaces as- signed to different Sims. It determines the limitations in both aggregation and merging algorithms and, then, extend them to overcome these limitations. Simple illustrations are used to show that the extended models successfully as- sign contiguous spaces and avoid non contiguous spaces. The rest of the paper is structured as follows. Section 2 provides the Literature review, background and motivation for this work. The justification for using the continuity con- straint in expansion and contraction phases is provided in section 3. The basic and extended versions of the expansion and contraction algorithms and their illustrations with ex- amples from our prototype are presented in section 4 and 5. Finally, section 6 concludes the paper and provides future directions. 2 Background and motivation 2.1 The Literature The mechanisms for scaling VEs found in the Literature can be categorised as static, dynamic, and hybrid in nature. These mechanisms are critically analysed in this section. 2.1.1 Static mechanisms The underlying infrastructures for SL and OSm called SL Grid (SLG) [24], and OSm Grid (OSmG) [20] extend the Butterfly Grid (BG) [17]. They use static assignment for an improved performance and avoid the expensive trans- ferring activities. SL architecture is much restricted and it allows a server (usually, a Simulator (Sim)) to host only up-to a maximum of four regions. OSm uses the extended architecture of SL proposed by the Linden Lab [24] and is, therefore, more open than SL. It allows a Sim to manage an arbitrary number of regions but the environment remains static. SL and OSm both lack dynamic adjustments and, therefore, introduce resource provisioning issues. Resour- ces in this arrangement are greatly misused. Resources in some cases, when no players are visiting the content assign to them, might remain under-utilised - this case is termed as over-provisioning. On the other hand, system capacity is restricted as no additional resources are available when more players are interested to join a space - this is termed as under-provisioning [29]. 2.1.2 Dynamic mechanisms To cope with the issues in static assignment methods, a number of dynamic strategies are developed that are bro- adly categorised as flat and hierarchical in nature. Load ba- lancing in mechanisms using flat orientation uses either a local, global or an adaptive strategy. Local strategies (such as the one adopted in [25]) are not scalable as each server is capable of sharing its load only with the neighbouring servers. They fail to scale when neighbouring servers are also overloaded. Global strategies (such as those used in [23, 28]) use complex procedures to re-distribute the wor- kload evenly on all the servers and thus degrade interactive user experience. They are not suitable for those systems that involve frequent re-adjustments. Adaptive strategies adopt the simplicity of the local but the scalability of the global strategies. They scale better than local strategy as a server extends sharing its load with the servers next to the neighbouring servers, in case the neighbouring servers are also overloaded. Further, they are less complex than global strategies [22]. VEs prefer using hierarchical approaches which are, ge- nerally more flexible and scalable than flat mechanisms, as flat mechanisms put extra burden of user migration on the system [26]. Hierarchical methods (such as those presented in [18, 3, 2, 5]), however, suffer from complexity, latencies, and poor performance as they places no restrictions of the size of content assigned to a server and the levels in a re- source management tree [6]. 2.1.3 Hybrid mechanism: state-of-the-art in scalable VWs In our previous work, we presented a dynamic scalable in- frastructure and introduced the concept of a hybrid grid in- frastructure for its implementation. When the load is nor- mal, this hybrid mechanism behaves like a static grid infra- structure in which each Sim is hosting its assigned space. As the load increases, it dynamically adds additional re- sources at lower levels to cope with increasing load. The basic aim was to overcome the limitations of existing sta- tic and dynamic mechanisms. The proposed mechanism achieves this using an expansion and a contraction model. The expansion phase includes the split, aggregation, and assignment methods. The contraction phase provides two variation for merging process. In this work, each server in start, handles almost a square shaped space and a regular square pattern is used to split the overloaded space. The number of players a server can po- tentially host is represented by SimCapacity. However, it Load Balancing for Virtual Worlds by. . . Informatica 42 (2018) 107–116 109 initiates a split operation based on a parameter called Split- Capacity. MergeCapacity parameter is used by a server to initiate a merge operation [8]. The Expansion Phase (Splitting) The Split Process: When a Sim gets overloaded, it divides its assigned space into an equal sized sub-regions (normally either 4, 9, or 16 onwards) that achieves regions whose den- sity is less than the SplitCapacity and thus eases the load but against a boundary condition. A region representing an un-partitioned but varied size of space is divided during a split operation if it is not the ultimate space that cannot be further partitioned. The Aggregation Process: uses an aggregation algorithm [9] to determine two aggregates of the smaller spaces com- prising an assigned space, provided as input by the Split Process. It aims to minimise resource utilisation, and com- munication and implementation cost. It tries to obtain ag- gregates with fair load by combining adjacent regions and avoiding the diagonal ones. It combines only those regions (even those in a diagonal) sharing physical boundaries with the regions already in an aggregate. The main objective is to obtain two contiguous areas for assignment to minimise the number of connections/disconnections between servers when players move between regions. The levels in the ma- nagement tree are minimised by placing all servers hand- ling regions obtained in a split as siblings. The aggregation algorithm takes input in the form of a tiled grid. Keeping its goals in mind, it takes any two consecutive corner regions to start aggregation with. It uses four aggregation strategies, namely, Row by Row (RR), Column by Column (CC), Row and Column in Turn (RCnT), and Row and Column in Turn with Di- agonal (RCnTwD) which guarantee examining the entire set of unique and valuable combinations. The Assignment Process: assigns one of the aggregates determined in aggregation step to an additional server. The current implementation transfers the aggregate with less re- gions and smaller number of players. Each server that is hosting an aggregate maintains the identity of smaller regi- ons which are, then, re-assigned at later stages based on an increase in load until each of them is handled by an indi- vidual server. The split process is repeated at this stage on smaller regions unless the boundary conditions are met. The Contraction Phase The contraction phase implements the merging process and it ensures that the resources are utilised as per the require- ments. Merging is triggered by a server when it notices a decrease in the number of players it manages. In current implementation of our work, a Sim is either a parent or a child. However, only a child Sim initiates a merge process. Contraction allows two merging strategies called, Parent Merge (PntMrg) strategy and Child Merge (ChMrg) stra- tegy. In PntMrg strategy, a child Sim initiates a merge ope- ration only if it can return its full load to the parent Sim. However, it is believed that the system potentially holds the resources for more time and it is not efficient in terms of resources. This issue is resolved in the ChMrg strategy. In ChMrg strategy, a child Sim relocates its full load to one of its siblings, if it is unable to integrate the load with the parent Sim. When a Sim capacity goes beyond MergeCa- pacity, then it checks for an appropriate Sim (the parent or a sibling based on the strategy being used) and the mer- ging is initiated if and only if, the cumulative load of both the Sims is less than or equal to the MergeCapacity. In case of a successful merge, the Sim who initiated the merge re- leases itself. The Implementation, Worth and Limitations of Hybrid Grid Infrastructure Non existence of a specialised framework for developing highly scalable VWs motivated us to develop the hybrid grid infrastructure. The main goals were to assign coherent contiguous spaces using a resource management tree with minimum additional levels for an improved communication and implementation cost while distributing the load as ba- lanced as possible. We used OSm framework for the imple- mentation of this work. Since, the basic architecture does not support dynamic capabilities, we extended the OSm ar- chitecture to support dynamic scalability [10]. We inves- tigated the basic capabilities for various activities involved in the expansion and contraction phases and extended some of the costly activities [11]. We, then, developed a working prototype of this infrastructure using OSm framework by utilising its basic and extended methods [6, 12]. It moves the players in a transferring region into a transit region du- ring the re-allocation process. Our hybrid infrastructure achieved improvements against both static and dynamic mechanisms in multiple dimensions described using a set of parameters including scale, resource utilisation, complexity, communication and implementation costs, and interactive user experience. When compared with static assignment method that assigns multiple regions to a Sim, the proposed method scale beyond the capacity of static assignment. However, it scales exactly up-to the same capacity as the static method in which each Sim hosts a single region. In both cases, resource utilisation is improved by starting a Sim with more regions and assigning additional resources purely on current workload. The proposed mechanism, therefore, solves over-provision and under-provision of resources. By adopting a localised decentralised approach and reducing the levels in the resource management tree, it greatly reduces complexity, and communication issues. Since, the players never go off, it improves their interactive user experience. Various concepts of OSm framework and the extended methods developed for various activities involved in re-allocation process greatly reduced the implementation and transferring costs. During the implementation of our work, we discovered that the aggregation algorithm determines the two contigu- ous spaces when it is applied on a square shaped space. However, it fails to get contiguous spaces when it is app- lied to spaces of other shapes. Similarly, the merging al- gorithms also permit a merge of non-contiguous spaces, a clear violation of the basic goals set earlier for our scalable 110 Informatica 42 (2018) 107–116 U. Farooq et al. infrastructure. 2.2 Motivation, goals and contribution Hybrid grid infrastructure got improvements in multiple as- pects discussed above, however, the limitations in its cur- rent implementation greatly restricts its functionality. To get hold of the benefits of the proposed infrastructure moti- vated us to extend its aggregation and merging algorithms. The main goal of this work is to enable these algorithms to produce contiguous spaces for any shape of spaces com- prises various regions. It also aims to justify the use of continuity in assigning spaces to servers. This work reports justification for the contiguous spaces, and the extended algorithms for aggregation and merging followed by their illustrations with results from our imple- mentation. (a) (b) (c) (a)́ (b)́ (c)́ Figure 1: Odd and their equivalent valid aggregates. (a) Aggregates based on diagonals for a 4-region world; (a)́ Valid aggregates for Figure(a); (b) Aggregates based on a single diagonal for a 9-region world; (b)́ Valid combinati- ons for Figure(b); (c) Aggregates based on both diagonals for a 9-region world; (c)́ Valid combinations for Figure(c). 3 Evaluating continuity model This section provides justification and the benefits of the continuity model to be incorporated in basic aggregation and merging algorithms. It shows how odd and isolated cases introduce extra burden in terms of communication, implementation, and user migrations. Three parameters are used for this evaluation that are: total number of regional boundaries exposed to the external regions; total number of isolated regions managed by a single Sim; and number of user crossings between different Sims. Three example odd cases (from a wide range of possible combinations) which are presented in Figure 1(a)-(c) are used for evaluation and comparison with equivalent valid aggregates presented in Figure 1(a)́-(c)́. The regions in one aggregate in Figure 1 are marked black and white in the second aggregate. 3.1 Theoretical evaluation Current VWs treat each region as a complete isolated sy- stem and, therefore, introduce complex boundary crossings between the regions regardless of the fact that they might be on a single Sim. The concept of mega-regions is introduced in OSm to get bigger spaces and reduce intra-sim commu- nication. It also help in reducing the number of crossings between the regions. However, the current mega-regions only integrate the neighbouring and contiguous regions. It is difficult to take advantage of this exciting feature of OSm framework when isolated regions are allowed. The inclu- sion of continuity model in aggregation algorithm thus al- low us to get benefit of mega-regions during implementa- tion. Two parameters that are: the number of isolated spaces managed by a Sim, and the number of boundaries in an aggregate exposed to regions of other aggregate are used to provide theoretical justifications. Table 1 provides re- sults for these parameters where it can be seen that non- contiguous spaces normally provide a large number of iso- lated spaces. However, the inclusion of continuity mo- del reduced them to only and only two contiguous spa- ces. Excluded cases greatly increase the implementation complexity by managing different isolated areas compared with valid combinations. Similarly, communication and interaction in valid combinations are significantly reduced compared with odd cases. It can also be noted that when a system has more isolated regions, it generally increases the number of regional boundaries exposed to players of the external regions. It implies that the players have more spaces and chances to go across a Sim boundary to anot- her Sim served by a different server. It potentially increases communication among regions on the same Sim. The next section justifies this claim using a simple simulation envi- ronment in terms of players crossing the boundaries bet- ween different Sims. Overall, about 50% decrease is achie- ved in terms of number of exposed boundaries by selecting valid combinations by the extended algorithm as shown in Table 1. 3.2 Simulation based evaluation The most common parameter in scaling a parallel and dis- tributed system such as a VW is to determine, how much the distribution process increases the number of crossings between the servers in a given system. 3.2.1 Simulation environment The console window is partitioned into regions based on aggregates and different colours are used to represent the valid and odd combinations as shown in Figure 1. In each case, the odd and its corresponding valid combination are simulated for the same duration against the capacities in- cluding one, five, and ten randomly distributed objects (re- presenting players). Each object is allowed to select a random move in one of the four directions at each step Load Balancing for Virtual Worlds by. . . Informatica 42 (2018) 107–116 111 Table 1: Comparison of isolated spaces and their exposed boundaries for both odd and valid aggregates. Case Description Number of isolated spaces Number of Exposed boundaries 1 Odd Combination (Figure 1(a)) 4 4 Valid Combination (Figure 1(a)́) 2 2 2 Odd Combination (Figure 1(b)) 5 8 Valid Combination (Figure 1(b)́) 2 4 3 Odd Combination (Figure 1(c)) 9 12 Valid Combination (Figure 1(c)́) 2 4 where it moves a character in that direction from its current position. When it reaches either the end of a row or a co- lumn, it jumps to the other end of the corresponding row or column. The objects continues following this simple mo- bility model until the simulation is stopped. A crossing for a player is recorded when it moves to a different coloured region from its current region. Table 2: Comparison of player crossings for both odd and valid aggregates. Case Description Number of Players 1 5 10 1 Odd Combination (Figure 1(a)) 5 24 46 Valid Combination (Figure 1(a)́) 2 11 18 2 Odd Combination (Figure 1(b)) 9 51 86 Valid Combination (Figure 1(b)́) 5 21 46 3 Odd Combination (Figure 1(c)) 18 78 138 Valid Combination (Figure 1(c)́) 7 24 51 3.2.2 Evaluation results Table 2 summarises the simulation results for both odd and valid combinations. It can be seen in first case, that cros- sings for odd combination are almost twice the number of crossings for the valid combination. Case 2, has a similar outcome, however, the crossings for odd combination are slightly less than twice the number of crossings for valid combination. This is due to the player distribution, and the ratio between isolated spaces and exposed boundaries for both combinations. It can be seen in case 3, that when there are more isolated regions and exposed boundaries, there are more crossings. The crossings for odd case are almost three times the crossings for valid combinations. Overall, the si- mulation results revealed that odd cases greatly increases the crossings between the Sims in addition to the imple- mentation complexity and communication overhead. In the next sections, we provide detailed illustrations of the basic and extended algorithms. 4 The extended aggregation algorithm 4.1 Limitations in basic algorithm The basic aim of aggregation algorithm was to aggregate smaller regions into larger contiguous spaces for assign- ment. It initially takes regional grids of n×n dimensions as an input normally based on the split strategies of our scalable infrastructure. It repeatedly assigns different parts of the pre-processed space to additional Sims and it has to cope with varied shapes of spaces. In theory, the current aggregation algorithm should always yield valid combina- tions but in fact ‘practically’ it allows odd combinations for the non-square shaped grids obtained after the first split and assignment applied to a square grid. During imple- mentation, it failed to discard odd cases in the following iterations. In other words, starting with a square grid, the first iteration determines valid contiguous spaces but in la- ter iterations, when it is applied to non-square shaped wor- lds, it allows odd cases. Figure 2 illustrates these cases with the help of a simple square grid of nine regions (la- belled A to I). The first iteration of aggregation algorithm divides this grid into two aggregates (colours are used to differentiate aggregates from each other) having A and B in the first and the rest of the regions in the second aggre- gate (see Figure 2(a)). However, when it is applied to the second aggregate (a 7-region world) in second iteration, it selects an aggregate comprises of region C and D, which are both isolated than each other (an obvious odd case), as shown in Figure 2(b). This is because the basic algorithm only uses SplitCapacity constraint but does not check ex- plicitly the continuity constraint for the space comprises of smaller sub-regions. An extension to the current algorithm is presented in the next section to overcome these issues. 4.2 The extended algorithm In each step of the aggregation process, an additional step is added to make it sure that both prospective aggregates produce valid contiguous spaces. This additional step ex- plicitly use a flood fill algorithm to check continuity in the aggregated spaces. We use flood fill algorithm that spread in four ways as the one that spreads in eight ways consider the diagonals which are major source of odd combinations. Flood fill algorithms are normally used in bucket fill al- 112 Informatica 42 (2018) 107–116 U. Farooq et al. (a) (b) Figure 2: Illustrating limitations in the basic aggregation algorithm: (a) A valid outcome for a square grid of 9 regions; (b) An invalid outcome for a 7-region world. gorithms of paint programmes, and they are employed in board games such as Go and Minesweeper [30]. In each step, when the possible aggregates are determined by the aggregation algorithm, it checks these aggregates against the continuity constraint, and reject them when any of them are not constituting a valid contiguous space. The extended algorithm has the capability to determine and exclude odd cases against any size and shape of a given space. Figure 3: Expanding a 9-region world with the basic ag- gregation algorithm. 4.3 Illustration and comparison of basic and extended algorithms In this section, we illustrate the limitations in basic aggre- gation algorithm and the worth of extended algorithm eli- minating issues in the current algorithm with simple player distributions. These example illustrations use a SplitCa- pacity of 40 players and applies the aggregation strategies to Bottom Left (BL) and Bottom Right (BR) against a 9- region world in a grid form. This article is illustrating only the limitations of current algorithm and it is not demonstra- ting the aggregation strategies which are presented in detail in [7, 9]. Figure 3 illustrates odd cases allowed by ba- sic algorithm whose equivalent valid combinations which are obtained using the extended algorithm are presented in Figure 4. The partial steps (showing expansion up-to 4 child Sims) shown in these figures are highlighting impor- tant points during split and assignment processes. A Sim includes the number of players in each named region that it hosts, and the regions hosted by other Sims are crossed with respect to this Sim. Figure 4: Expanding a 9-region world with the extended aggregation algorithm. Figure 3, step-1, shows that the parent Sim is initially hosting the whole space comprises of nine regions. In step 2, the space is divided into two valid contiguous groups of regions and then the algorithm assigns the aggregate com- prises of region A and B to child Sim C1. However, the remaining steps assign odd combinations such as in, step 3, the parent Sim transfers region C and D to C2. Simi- larly, the parent Sim in step 4, assigns an aggregate of re- gion E, F and G to Child C3. Further, the child C3 assign a valid combination to child C4, but maintains itself a non- contiguous space, in step 5. It is important to note that only the RR strategy of the first root obtained the aggregates as- signed in Figure 3. Load Balancing for Virtual Worlds by. . . Informatica 42 (2018) 107–116 113 Figure 4 illustrates the extended aggregation algorithm for exactly the same player distribution used in Figure 3. It is obvious that the extended algorithm strictly allows only valid continuous spaces. The algorithm, in step-2, divides the space into two bigger spaces and assigns the aggregate comprises of region A and B to child C1, but after veri- fying the other aggregate being a contiguous one as well. It can be noted in step 3, that the extended algorithm de- termines the non-contiguous aggregate comprises of C and D to be an odd case and it is skipped by the algorithm to be an acceptable aggregate. The RR strategy was unable to determine further aggregates for a 7-region world at this stage and the algorithm, therefore, applied the CC strategy, which determined a valid aggregate comprises of D and G, in step-3 and assigned it to C2. Step-4, rejected the assign- ment of aggregate comprises of region C and E but instead assigned a contiguous space made of E and H to child C3 determined using the CC strategy. Region C is then assig- ned during step 5 to child C4. Figure 5: Contracting a 9-region world with the ba- sic PntMrg strategy. 5 The extended merging algorithms 5.1 Limitations of the basic algorithms The current merging process provides two algorithms (im- plementing a PntMrg and a ChMrg strategy) that differ by merging preferences either with a parent or a sibling Sim. Both a child and the parent have the capability to determine if a merge operation to be initiated when they notice a de- crease in their current capacities but the merging process is always initiated by a child Sim in current implementation. Both the strategies use a MergeCapacity constraint to ini- tiate a merge. Despite the status of a Sim being parent or a child, it first determines, if the cumulative load of both the Sims to combine their load, is less than or equal to the MergeCapacity. On satisfying this condition, the child Sim assigns its complete load (both content and players) to the participating Sim and releases itself. Both strategies have a flaw (similar to the one for split discussed earlier) that they allow odd combinations while integrating the load which violate the basic goals of our work. The MergeCapacity value of 20 players is conside- red in this work, a much smaller value to avoid immediate splits. Figure 6: Contracting a 9-region world with the exten- ded PntMrg strategy. 5.2 The extended algorithms To avoid assigning non-contiguous spaces, this work also explicitly incorporate an additional step which determines 114 Informatica 42 (2018) 107–116 U. Farooq et al. that either a combined space of two Sims are constituting a contiguous space or not using a flood fill algorithm in addi- tion to the MergeCapacity constraint. A merge is only al- lowed, if it passes through the continuity check, otherwise, the merge is rejected. This model might use more than re- quired number of Sims for a little longer but it achieves the benefits of assigning contiguous spaces to different Sims. 5.3 Illustration and comparison of basic and extended algorithms 5.3.1 The Parent Merge (PntMrg) strategy Figure 5 illustrates the basic PntMrg procedure. No merge is permitted with the parent Sim in initial two steps, because the cumulative load in each case is more than the MergeCapacity. However, it is clear in step 2 that child Sims C2 and C4 satisfies the merge condition but it is not allowed in PntMrg strategy. ChildC2 integrates its load with the parent Sim during step 3. However, it can be seen that this merge results-in a space comprises of two isola- ted spaces. In step 4, no merge is allowed though a merge is possible among child Sims C3 and C4. The aggregated space after the integration of space maintained by C4 with the parent Sim in step 5 also gives two isolated sets of regi- ons. The PntMrg strategy potentially holds more resources than required for longer time compared with the ChMrg strategy which tries to overcome this issue. Figure 6 illustrates the extended PntMrg strategy high- lighting the avoidance of odd cases allowed by the basic algorithm as shown earlier in Figure 5. No merge was per- mitted during the initial three steps. Capacity constraint did not allow merging of child Sims C1 and C4 with the parent Sim. The merge between child Sim C2 and pa- rent Sim at step-3 was rejected due to the continuity con- straint. Child C4 returned its space to the parent Sim at step-4. Child Sims C3, C2 and C1 integrated their load with parent Sim at step 5, 6 and 8 correspondingly. Figure 6 shows that the extended algorithm keep resources for more time than the basic algorithm as illustrated in Figure 5. 5.3.2 The Child Merge (ChMrg) strategy Figure 7 illustrates the basic ChMrg procedure. No Merge was allowed in step-1, due to the MergeCapacity con- straint. However, C4 was released after merging its load with C2 at step-2 (a case which was rejected by the PntMrg strategy) but constituting an obvious odd case. Step-3 and 5 obtained valid combinations (the first between the parent and C3, and the second one between C1 and C2) of space after merging, however, it was demonstrated that the ba- sic ChMrg merging strategy allows odd combination. Figure 8 illustrates the extended ChMrg algorithm that consider both the capacity and continuity constraints for initiating a merge. It always yields contiguous spaces and, therefore, rejected, a merge between child SimsC2 andC4. It improves over PntMrg strategy in a sense that it merges quicker by considering the child Sims as in step 4, where Figure 7: Contracting a 9-region world with the ba- sic ChMrg Strategy. Figure 8: Contracting a 9-region world with the exten- ded ChMrg strategy. Load Balancing for Virtual Worlds by. . . Informatica 42 (2018) 107–116 115 two integrations happened, one between the parent and C4 and the other between C1 and C2. However, it potenti- ally transfers the content and players multiple times which might degrade the overall system performance. 5.3.3 Discussion The merging strategies (both PntMrg and ChMrg) demon- strated in this work have worth and limitations. Both of them ultimately return the whole world back to the pa- rent Sim. Normally, a merge operation is initiated when player capacity is not high. The PntMrg strategy is sim- ple but takes more time and holds resources for longer than the ChMrg strategy. The ChMrg strategy copes with the issues in PntMrg strategy and release resources much quic- ker. However, the ChMrg strategy potentially transfers re- gions (both content and players) between Sims multiple ti- mes and it brings a bad experience to the users. We have demonstrated both the strategies, and they could be adop- ted according to requirements. However, the basic strate- gies were unable to avoid odd cases. Odd combinations are rejected by both the extended strategies. To manage bigger worlds and the un-predictable nature of users, we suggest using ChMrg strategy as PntMrg might be blocked for lon- ger. However, both have the potential to cope with resource under-utilisation issues. Further details on this are beyond the scope of this article and interested readers may read our detailed work on this in [6]. 6 Conclusion In this article, we presented the extended aggregation and merging processes, to cope with the limitations in basic ver- sions of these mechanisms. It provided an overview of our scalable infrastructure comprises of splitting, merging and load distribution algorithms in comparison with other me- chanisms found in the Literature. It examined current and extended operations for both the aggregation and merging, and provided a justification for the continuity model in ad- dition to SplitCapacity and MergeCapacity in their corre- sponding operations. The extended operations have poten- tial of getting aggregation and merging robustly and they are illustrated with some simple examples from the results obtained from our prototype for scalable virtual worlds. References [1] Pekka Alahuhta, Emma Nordbck, Anu Sivunen, and Teemu Surakka. Fostering team creativity in virtual worlds. Journal For Virtual Worlds Research, 7(3), 2014. [2] Rajesh Krishna Balan, Maria Ebling, Paul Castro, and Archan Misra. Matrix: Adaptive Middleware for Dis- tributed Multiplayer Games. volume 3790/2005 of Lecture Notes in Computer Science, pages 390–400. Springer Berlin/Heidelberg, 2005. [3] A. M. Burlamaqui, M. A. M.S. Oliveira, A. M. G. Goncalves, G. Lemos, and J. C. De Oliveira. A Sca- lable Hierarchical Architecture for Large Scale Multi User Virtual Environments. In IEEE International Conference on Virtual Environment, Human Compu- ter Interfaces and Measurement Systems, pages 114– 119, 2006. [4] Luther Chan, James Yong, Jiaqiang Bai, Ben Leong, and Raymond Tan. Hydra: A Massively-Multiplayer Peer-to-Peer Architecture for the Game Developer. In Proceedings of the 6th ACM SIGCOMM workshop on Network and system support for games, NetGames ’07, pages 37–42, New York, NY, USA, 2007. ACM. [5] F. Chang, C.M. Bowman, and W. Feng. XPU: A Distributed Architecture for Metaverses. Technical report, Department of Computer Science, Portland State University, 2010. Technical Report 10-04. [6] Umar Farooq. The Design of a Contemporary Infra- structure for Scalable and Consistent Virtual Worlds. PhD thesis, School of Computing Sciences - Univer- sity of East Anglia, 2012. [7] Umar Farooq and John Glauert. ARA: An Aggregate Region Assignment Algorithm for Resource Minimi- sation and Load Distribution in Virtual Worlds. In NDT ’09: Proceedings of the first IEEE Internatio- nal Conference on Networked Digital Technologies, pages 404–410, 2009. [8] Umar Farooq and John Glauert. Joint Hierarchi- cal Nodes based User Management (JoHNUM) Infra- structure for the Development of Scalable and Consis- tent Virtual Worlds. In DS-RT ’09: Proceedings of the 13th IEEE/ACM Symposium on Distributed Simula- tion and Real-Time Applications, pages 105–112, Wa- shington, DC, USA, 2009. IEEE Computer Society. [9] Umar Farooq and John Glauert. A Dynamic Load Distribution Algorithm for Virtual Worlds. Journal of Digital Information Management, 8(3):181–189, June 2010. [10] Umar Farooq and John Glauert. Scalable Virtual Wor- lds: An Extension to the OpenSim Architecture. In ICCNIT ’11: Proceedings of the IEEE International Conference on Computer Networks and Information Technology, pages 29–34, 2011. [11] Umar Farooq and John Glauert. Faster dynamic spa- tial partitioning in opensimulator. Virtual Reality, 21(4):193–202, Nov 2017. [12] Umar Farooq and John Glauert. Integrating dynamic scalability into the opensimulator framework. Simu- lation Modelling Practice and Theory, 72(2017):118– 130, 2017. 116 Informatica 42 (2018) 107–116 U. Farooq et al. [13] Paul A. Fishwick. An introduction to opensimulator and virtual environment agent-based m&s appli- cations. In Winter Simulation Conference, WSC ’09, pages 177–183. Winter Simulation Conference, 2009. [14] R. M. Fujimoto, K.S. Perumalla, and G.F. Riley. Net- work Simulation. Synthesis lectures on communica- tion networks. Morgan & Claypool Publishers, 2007. [15] N. Gupta, A. Demers, J. Gehrke, P. Unterbrunner, and W. White. Scalability for Virtual Worlds. In Pro- ceedings of the 2009 IEEE International Conference on Data Engineering (ICDE ’09), pages 1311–1314, 2009. [16] Marko Hakonen and Petra Bosch Sijtsema. Virtual worlds enabling distributed collaboration. Journal For Virtual Worlds Research, 7(3), 2014. [17] IDC. Butterfly.net: Powering next generation gaming with on-demand computing. Technical report, IBM: An IDC e-Business Case Study, 2004. [18] Beob Kyun Kim and Kang Soo You. A Hierarchi- cal Map Partition Method in MMORPG based on Vir- tual Map. In Frontiers of High Performance Compu- ting and Networking - ISPA 2006 Workshops, volume 4331/2006 of Lecture Notes in Computer Science, pa- ges 813–822. Springer Berlin/Heidelberg, 2006. [19] Dan Lake, Mic Bowman, and Huaiyu Liu. Distribu- ted Scene Graph to Enable Thousands of Interacting Users in a Virtual Environment. In Proceedings of the 9th Annual Workshop on Network and Systems Support for Games, NetGames ’10, pages 19:1–19:6, Piscataway, NJ, USA, 2010. IEEE Press. [20] Charles J. Lesko and Yolanda A. Hollingsworth. Ar- chitecting scalable academic virtual world grids: A case utilizing opensimulator. Journal For Virtual Worlds Research, 6(1), 2013. [21] H. Liu and M. Bowman. Scale Virtual Worlds through Dynamic Load Balancing. In DS-RT ’10: Pro- ceedings of the 2010 14th IEEE/ACM International Symposium on Distributed Simulation and Real-Time Applications, pages 43–52, Washington, DC, USA, 2010. IEEE Computer Society. [22] Qingqi Long, Jie Lin, and Zhixun Sun. Agent sche- duling model for adaptive dynamic load balancing in agent-based distributed simulations. Simulation Modelling Practice and Theory, 19(4):1021 – 1034, 2011. ¡ce:title¿Sustainable Energy and Environmen- tal Protection SEEP2009¡/ce:title¿. [23] John C. S. Lui and M. F. Chan. An Efficient Partiti- oning Algorithm for Distributed Virtual Environment Systems. IEEE Transaction on Parallel and Distribu- tion Systems, 13(3):193–211, 2002. [24] Thomas M. Malaby. Making Virtual Worlds: Linden Lab and Second Life. Cornell University Press, It- haca, United States, first edition, June 2009. [25] Beatrice Ng, Antonio Si, Rynson W. H. Lau, and Fre- derick Li. A Multi-server Architecture for Distri- buted Virtual Walkthrough. In ACM Symposium on Virtual Reality Software and Technology, pages 163– 170. ACM New York, NY, USA, 2002. [26] K. Prasetya and Z. D. Wu. Performance Analysis of Game World Partitioning Methods for Multiplayer Mobile Gaming. In Proceedings of the 7th ACM SIG- COMM Workshop on Network and System Support for Games, NetGames ’08, pages 72–77, New York, NY, USA, 2008. ACM. [27] Michael Rymaszewski, Wagner James Au, Mark Wal- lace, Catherine Winters, Cory Ondrejka, Benjamin Batstone-Cunningham, and Philip Rosedale. Second Life: The Official Guide. Wiley Publishing, Hoboken, New Jersey, December 2006. [28] Shervin Shirmohammadi, Ihab Kazem, Dewan Tanvir Ahmed, Madeh El-Badaoui, and Jauvane C. De Oli- veira. A Visibility-Driven Approach for Zone Mana- gement in Simulations. Simulation, 84(5):215–229, 2008. [29] Matteo Varvello, Fabio Picconi, Christophe Diot, and Ernst Biersack. Is There Life in Second Life? In Proceedings of the 2008 ACM CoNEXT Conference, CoNEXT ’08, pages 1:1–1:12, New York, NY, USA, 2008. ACM. [30] Wiki. Flood fill algorithm. http://en. wikipedia.org/wiki/Flood_fill, 2016. Accessed: December, 2016. Informatica 42 (2018) 117–125 117 Microscopic Evaluation of Extended Car-following Model in Multi-lane Roads Hajar Lazar, Khadija Rhoulami and Moulay Driss Rahmani LRIT-CNRST (URAC No. 29) Faculty of Sciences, Mohammed V University in Rabat, Rabat 10000, Morocco E-mail: hajar.lazar@gmail.com Keywords: car following models, velocity separation difference model, lane changes model Received: February 4, 2017 This paper describes a micro-simulation model which combined car following with lane change model. For that, we proposed a new car-following model which is an extended of velocity-separation difference model (VSDM) by introducing a new optimal velocity function, named a modified velocity-separation difference model (MVSDM) which react better in braking case. The problems of collision in urgent braking case existing in the previous models were solved. Furthermore, the simulation results show that (MVSDM) can exactly describe the driver’s behavior under braking case, where no collision occurs. Povzetek: Članek opisuje mikrosimulacijski model avtonomne vožnje, ki kombinira sledenje avta z za- menjavo voznega pasu. 1 Introduction The accelerated growth of the urban population and the ex- tension of cities, the intensification of economic exchanges have made road traffic and its management one of the major challenges of sustainable development. Recently, there has been a strong focus on improving the efficiency and safety of transportation and this has led to the development of the Intelligent Transportation Systems (ITS) (1). Among the most notable urban transport problems: – Traffic congestion occurs when, at a specific point in time and in a specific section, there is an imbalance between transport demand and supply . – Environmental impacts includes the pollution and noise problems generated by circulation. – Accidents and safety problems due to growing traffic in urban areas with a growing number of accidents and fatalities. In this context, traffic flow modeling and simulation has be- come a famous area of research in recent years, and consti- tute efficient tools to evaluate different tasks such as traffic prediction, traffic control and forecasting, the repercussion of the construction of new infrastructure onto the global be- havior of the traffic flow. For studying the traffic problem, traffic flow are classified into two different types of ap- proaches, namely, macroscopic and microscopic ones (2). Macroscopic models describe traffic flow as a continuous fluid, which describe entities and their activities and inte- ractions at a relatively low level of detail and established relationships between speed, flow and density. In contrast, microscopic model attempts to model the motion of indi- vidual vehicles and their interaction at a high level of de- tail and describe the reaction of every driver (accelerating, braking, lane changing, etc) depending on the surrounding traffic. Microscopic models are better adapted to the des- cription of more punctual elements of the network, while macroscopic models are adapted to the representation of networks of large sizes. On the other hand, mesoscopic models characterized by the high level of aggregation, low level of detail, and typically based on a gas-kinetic ana- logy in which driver behavior is explicitly considered (3). Figure 1 presented the different simulation approaches of traffic flow. In this context, we are mainly interested with the microscopic approach which road traffic is modeled by individual motion of each vehicle. In this model, the speed of a vehicle is directly according to the distance that sepa- rates it from the leading vehicle, modulo a delay time. This delay time is generally assimilated to the reaction time of the driver in order to take into account the variations in be- havior of his leading vehicle. This is a car-following pro- cess also known as longitudinal driving behavior. The mo- deling of traffic in the broader sense proposes to describe more finely the flow of vehicles on a road. For that, it is necessary to understand two behavioral sub-models which are responsible for vehicle movement inside the network: Car Following (CF) and Lane Changing (LC) models. Car- following process were developed to model the manner in which individual vehicles follow one another in the same lane where the driver adjusts his or her acceleration accor- ding to the conditions in front and following each other on a single lane without any overtaking (2). The purpose of this paper is to propose a extended car-following model taking into account the effects of lane changing behavior. The work presented in this paper is devoted to overcome the shortcomings such as the unrealistic deceleration and the collisions in braking cases of many existing car-following models. However, we implemented the proposed approach using the open source simulator for traffic flow (4), in order 118 Informatica 42 (2018) 117–125 H. Lazar et al. Figure 1: Traffic flow approaches to improve the efficiency of a proposed approach compared with the existing ones. The paper is organized as follows. The state-of-the art of car-following and lane changing models will be introduced in Section 2. The proposed approach will be presented in Section 3. In section 4, the simulation results are carried out. At last, the conclusion is given in Section 5. 2 Related work 2.1 Car-following models The most widely known class of microscopic traffic flow models is so-called the family of car-following or follow- the-leader models. Car-following theories describe the way in which each vehicle follow another in the same lane. The most car-following models have a significant impact on the ability of traffic micro-simulations to replicate real-world traffic behavior (5). Various models were formulated to re- present how a driver reacts to the changes in the relative positions of the vehicle ahead. Figure 2 describes the vehi- cular traffic sketch. We denote as i the car whose behavior is currently under investigation, at instant t, such vehicle is at a position xi(t), and travels with a speed vi(t), that means its instantaneous acceleration can be expressed as ai(t). Index i−1 identify the front vehicles with respect to i , which are located at xi−1(t) and travel at speed vi−1(t) at time t. The front bumper to back bumper distance bet- ween i and i− 1 is identified as S(t) = 4xi = xi−1 − xi. Since the 1990s, car following models have not only been of great importance in an autonomous cruise control system, but also as important evaluation tools for intelli- gent transportation system strategies (6). The car-following models have been designed for single-lane roads, based essentially on the following ordinary differential equation Figure 2: Car following process notation (ODE): ai(t) = vi−1(t)− vi(t) T (1) This model is based on the idea that the acceleration ai(t) of the vehicle i at time t depends on the relative speed of the vehicle i and its leader i−1 by means of a certain re- laxation time T . However the previous equation describes a phenomenon is not stable enough in the case of road traf- fic. Hence the appearance of several variants of this model includes: – Safe-distance models or collision avoidance models try to describe simply the dynamics of the only vehi- cle in relation with his predecessor, so as to respect a certain safe distance. – Stimulus-response models based on the assumption that the driver of the following vehicle perceives and reacts appropriately to the spacing and the speed dif- ference between the following and the lead vehicles (7). – Optimal velocity models are another approach gene- rally based on the difference between the driver’s de- sired velocity and the current velocity of the vehicle as a stimulus for the driver’s actions. In this paper, we focused on optimal velocity models and we give here a state of art of the famous ones. For more detailed information with respect to microscopic models, particularly, car-following models can be found in the over- view of (5) (8) (9)(10)(11) (12)(13)(14)(15). The optimal velocity models attempt to modify the acceleration mecha- nism, such that a vehicle’s desired speed is selected on the basis of its space headway, instead of only considering the speed of the leading vehicle (16). The first model defined the optimal velocity function using an equilibrium relation for the desired speed as a function of its space headway is (17). The acceleration of Newell model is determined by the following equation: ai(t) = Vopt(xi−1(t)− xi(t)) (2) Microscopic Evaluation of Extended Car- following Model in. . . Informatica 42 (2018) 117–125 119 Bando et al. later improved this model, by introducing the notion of desired velocity, chosen as a function of re- lative spacing or headway (18). They distinguished two major types of theories for car-following regulations. The first type called follow-the leader theory which was used by (17), based on the idea that each vehicle must maintain the legal safe distance of the preceding vehicle, which depends on the relative velocity of these two successive vehicles. The other type for regulation is that each vehicle has the le- gal velocity, which depends on the following distance from the preceding vehicle. Based on the latter assumption, the authors (18) investigated the equation of traffic dynamics and found a realistic model of traffic flow, resulting in the following equation that describes a vehicle’s acceleration behavior: ai(t) = k ∗ [Vopt((S(t))− vi(t)] (3) In which Vopt(S(t)) is the optimal velocity function which depends on the headway S(t) to the car in the front. The stimulus here was a function of the relative spacing and the sensitivity k was a constant. The optimal velocity function, generally, must satisfy the following properties: it is a monotonically increasing function and it has an upper bound (maximal velocity). The optimal velocity adopted here calibrated by using actual measurement data proposed by (19) as follows : Vopt(S(t)) = V1 + V2 tanh[C1(S(t)− l)− C2] (4) With V1, V2, C1, C2 parameters calibrated and l is the length of the car. Unfortunately, the model produces many problems of high acceleration, unrealistic deceleration and is not always free of collisions. For this reason, Helbing and Tilch proposed an extended model considering the he- adway and the velocity of the following car and the relative velocity between the preceding vehicle and the following vehicle when the following vehicle was faster than the pre- ceding vehicle (19). To solve the OVM problems, they ad- ded a new term which represents the impact of the negative difference in velocity on condition that the velocity of the front vehicle is lower than that of the follower. The GFM formula is: ai(t) = k ∗ [Vopt((S(t))− vi(t)] + λH(− ˙S(t)) ˙S(t) (5) Where H(.) is the Heaviside function, λ is another sen- sitivity coefficient, and ˙S(t) = vi−1(t) − vi(t) means the velocity difference between the current vehicle and the vehicle ahead. The main drawback of GFM doesn’t take the effect of positive velocity difference on traffic dyna- mics into account and only considers the case where the velocity of the following vehicle is larger than that of the leading vehicle (15). The basis of GFM and taking the po- sitive factor ˙S(t) into account. In 2001, the authors (20) obtained a more systematic model called Full Velocity Dif- ference Model (FVDM), one whose dynamics equation is as: ai(t) = k ∗ [Vopt((S(t))− vi(t)] + λ ˙S(t) (6) In 2005, the authors in ref (21) introduced a weighting factor which makes the OV model more reactive to bra- king . They extended the OVM by incorporating the new optimal velocity function obtained by the combination of optimal velocity function Eq (8) with the weighting factor. The modified optimal velocity function expressed as: V newopt (S, Ṡ) = Vopt(S(t)) ∗W (S(t), ˙S(t)) (7) Where the weighting factor is as follows: W (S(t), ˙S(t)) = 1 2 + 1 2 tanhB( Ṡ(t) S(t) + C) (8) In which B and C are the calibrated parameters. The dynamic equation of the system is obtained as: a(t) = κ(V newopt (S(t), ˙S(t))− vi(t)) (9) In 2006, (6) conducted a detailed analysis of FVDM and found out that second term in the right side of Eq (6) makes no allowance of the effect of the inter-car spacing indepen- dently of the relative velocity. For that, they proposed a velocity-difference-separation model (VDSM) which takes the separation between cars into account and the dynamics equation becomes: ai(t) = κ(Vopt(S(t))− vi(t)) (10) + λH( ˙S(t)) ˙S(t)(1 + tanh(C1(S(t)− l)− C2)3 + λΘ(− ˙S(t)(t)) ˙S(t)(1− tanh(C1(S(t)− l)− C2)3 2.2 Lane changing models The transfer of a vehicle from one lane to adjacent lane is defined as lane change. Lane change, as one of the basic driver behaviors, can never be avoided in the real traffic en- vironment. Lane changing models are therefore an impor- tant component in microscopic traffic simulation Modeling the behavior of a vehicle within its present lane is relatively straightforward, as the only considerations of any impor- tance are the speed and location of the preceding vehicle. Therefore the understanding of lane changing behavior is important in several application fields such as capacity ana- lysis and safety studies. These lane changing models are categorized into four groups: 120 Informatica 42 (2018) 117–125 H. Lazar et al. – Rule-based models are the most popular ones in mi- croscopic traffic simulators include those reported in (22),(23). For this type of models, the subject vehi- cle’s lane changing reasons is evaluated first. If these reasons warrant a lane change, a target lane from the adjacent lane(s) is selected. The gap acceptance model used to determine whether the available gaps should be accepted. – Discrete-choice-based models based on logit or pro- bit models. The lane changing process is usually mo- deled as either MLC or DLC. Mandatory lane chan- ges (MLC) are considered those which occur because of a blocked lane, traffic regulations or in order to fol- low one’s route to destination. Discretionary chan- ges (DLC) are made in order for the subject vehicle to achieve better lane conditions (24). Discrete-choice- based lane changing models follow three steps: 1) checking lane change necessity, 2) choice of target lane, and 3) gap acceptance. – Artificial intelligence models are fundamentally dif- ferent from the rule-based and discrete choice-based models. A major advantage of them is that they can better incorporate human experience and reasoning into the development of lane changing models. – Incentive-based models have been recently proposed to modeling lane changing behavior. From their per- spective, the attractiveness of a lane based on its uti- lity to the driver, and a safety criterion captures the risk associated with the lane change (25). A variety of factors included in these models such as the desire to follow a route, gain speed, and keep right (26), in addition to politeness factors that can describe the dif- ferent driver behaviors (25). In this paper, we describe briefly one the important incentive-based lane changes models. We chose MOBIL (25) as it is the only lane changing model which takes into account the effect of lane change decisions on the immedi- ate neighbors. This model based on the simplistic control rules and it was more appropriate to analyze the affects of usual lane change behaviors of drivers on the overall traffic (24). The lane changing algorithm MOBIL (Minimizing Overall Braking Induced by Lane Changes) is among the most important components of a microscopic traffic simu- lator based on a microscopic longitudinal movement mo- del. A lane change model depends on the two following vehicles on the present and the target lane, respectively as shown in Fig. 3. A specific MOBIL lane change based on the accelerations on the old and the prospective new lanes. To formulate the lane changing criteria shown in Fig. 3 we use the following notation: the vehicle i refers to the lane change of the successive vehicles on the target and present lane referred by n (new one in the target lane) and o (old follower in the current lane). The tildes ãi, ão and ãn denotes the new acceleration of vehicle i on the target lane, the acceleration of the old and new followers after the Figure 3: Vehicles involved in lane changing process lane change of vehicle i, respectively. All the accelerati- ons involved are calculated according to the car-following model (27). A lane change model based on a safety and incentive criterion. The safety criterion is satisfied, if the car-following braking deceleration ãi imposed on the old vehicle o of the target lane after a possible change does not exceed a certain limit bsafe this means: ãi > −bsafe (11) The second criterion determines the acceleration advan- tage that would be gained from the event. This criterion based on the accelerations of the longitudinal model before and after the lane change and focused on improving the traffic situation of an individual driver by letting him drive faster or avoid a slow leader (24). For symmetric overta- king rules, they neglect differences between the lanes and propose the following incentive condition for a lane chan- ging decision of the driver of vehicle i as follows: ︷ ︸︸ ︷ ãi − ai +p( ︷ ︸︸ ︷ ãn − an + ︷ ︸︸ ︷ ão − ao) > 4ath (12) Equation (11) states that the acceleration advantage to be gained by the lane change, must be greater than both a threshold acceleration 4ath used to dampen out changes with marginal advantage, and a politeness factor p deter- mines to which degree these vehicles influence the lane- changing decision. The factor p controls the degree of cooperation while considering a lane change, from a pu- rely egoistic behavior (p = 0) to an altruistic one (p ¿ 1) (25). The politeness factor can be thought of as accounting for driver aggressiveness. It is this balancing of accelera- tions that gives rise to the name MOBIL, as Minimizing Overall Braking Induced by Lane changes (27). 3 Proposed approach In comparison with the existing works above, our propo- sal in this paper provides a extended car following model with an interaction of lane change behavior that mainly im- portant to simulating and to representing the traffic flow in the real manner. The proposed approach is detailed in the following section. Microscopic Evaluation of Extended Car- following Model in. . . Informatica 42 (2018) 117–125 121 3.1 Flowchart of the proposed approach For an ideal flow of a dynamic traffic simulation study, we proposed the basic algorithm presented in Fig. 4 which based on three major steps given as: – Preparation of the traffic flow simulation: in this step, we must define the road environment and also we must specify the initial parameters and variables, including initialization of position, velocity, and so on. – Implementation of the model and validation of its dif- ferent scenarios: in this stage, we adopt our MVSD model to compute acceleration for each car and then compute the new speed and position on both lanes for the next time step. At the same time, we start lane changes rules, we determine which car change whe- reto and add these cars to the correct position on the lane and removed changed cars from their old lane. – Analysis of results: for the next time step, we update the network and information state to get a new velo- city and position state; then we jump to step 2, and we begin an another cycle. 3.2 Modified velocity separation difference model In this paper, we proposed a modified car following model introducing the lane changing rules just as other studies. In ref (21), the authors modified an OV model, introducing the new OV function without using the lane change behavior to get a model more reactive on braking situation called modi- fied optimal velocity model (MOVM). The motivation for our paper comes from the key idea behind the new optimal velocity function proposed by (21) which we incorporating this latter on the VSDM model using the lane changing be- havior. However, the new OV function combined between the OV function the reference Eq (2) and the weighting fac- tor Eq (8) that depends on the inverse of time to collision (TTC). The TTC concept was introduced by the US rese- archer (28) and it was used in different studies as a time based surrogate safety measure for evaluating collision risk (29)(30)(31). In car following situations the TTC indicator is only defined when the speed of the following vehicle is higher than the speed of the lead vehicle (31). Rear end collision risk is defined as the time for the collision of two vehicles if they continue at their present speed and on the same lane and at the same speed (see Fig. 5). The time to collision of a vehicle driver combination n at instant t with respect to a leading vehiclen1 can be calculated with: TTC = S(t) ˙S(t) ;∀ ˙S(t) > 0 (13) The new optimal velocity function V newopt (S, Ṡ) is ex- pressed as the combination of the optimal velocity function proposed by (18) based only on headway stimulus and the weighting factor established the inverse of time to collision to make the model more reactive in braking case. V newopt (S, Ṡ) = Vopt(S) ∗W (S, Ṡ) (14) Where the weighting factor is : W (S, Ṡ) = [A(1 + tanhB( Ṡ S + C)] (15) The weighting factor must satisfies some proprieties: – When the relative speed is positive ˙S(t) > 0, the weighting must maintain the reference OV function unchanged. – For negative decreasing relative speed ˙S(t) < 0, it has to be decreasing and has to go toward zero when ˙S(t)− > infini. There are several functions which behave similarly with varying only the headway stimulus. Therefore, Here the new OV function modulates the reactivity of the car follo- wing model according to the actual headway and relative speed between the follower and ahead car. In our contribu- tion, we revised and extended a velocity separation diffe- rence model by incorporating the new OV function to get a new model that called a Modified Velocity Separation Dif- ference (MVSDM). The MVSD model is expressed by the equation of motion as: ai(t) = κ(V new opt (S(t), ˙S(t))− vi(t)) (16) + λH( ˙S(t)) ˙S(t)(1 + tanh(C1(S(t)− l)− C2)3 + λΘ(− ˙S(t)(t)) ˙S(t)(1− tanh(C1(S(t)− l)− C2)3 To describe real driving behavior on multilane roads, we need the car following process and the lane changing process. The lane changing behavior has a significant ef- fect on traffic flow. Therefore the understanding of lane changing behavior is important in several application fields such as capacity analysis and safety studies. We interested, particularly, the lane changing algorithm MOBIL (Minimi- zing Overall Braking Induced by Lane Changes) which is among the most important components of a microscopic traffic simulator based on a microscopic longitudinal mo- vement model (25) and is adopted here. 4 Simulation results In this study, we carry out the simulations to investigate whether MVSDM can overcome the shortcomings of pre- vious models and compared MVSDM with MOVM propo- sed by (21). In this paper, for each model we establish the 122 Informatica 42 (2018) 117–125 H. Lazar et al. Figure 4: Flowchart of the proposed approach Figure 5: Time to collision for rear end collision sketch simulation results for two different scenarios. In the fol- lowing, we will test the proposed approach (accelerating and braking behavior) using an open source microscopic simulator proposed by (4) to validate our approach using these scenarios. We used two vehicle classes: cars and trucks. For all simulations, the parameter values used for optimal velocity function Eq (4) and are adapted from (19) are V1 = 6.75m/s, V2 = 7.91m/s, C1 = 0.13m−1, and C2 = 1.57m −1. The parameter values calibrated for weig- hting factor (21) are A = 0.5, C = 0.5, and B = 5s. The sensitivities parameters values are a = 0.6m/s2, and λ = 0.45m/s2. The parameters values for cars are the desired velocity V0 = 120km/h, the safe time headway T = 1.2s, the minimum gap S0 = 2m, and the vehicle length l = 6m. The parameters values for trucks are the desired velocity V0 = 80km/h, the safe time headway T = 1.7s, the minimum gap S0 = 2m, and the vehicle length l = 10m. The parameters values for lane chan- ging are the politeness factor p = 0, the changing thres- hold 4ath = 0.2m/s2, the maximum safe deceleration bsave = 12m/s 2, and the bias for the slow lane 4abias. For more information about the simulation results, we built a video to visualize clearly the validity of our proposed mo- del MVSDM and the existing model MOVM and VSDM in the following link https://www.youtube.com/ watch?v=LJ5ddRGVbgA&feature=youtu.be. When starting the simulation, we extract the necessary data in excel format in order to represent them in graph form, and this is done for each car following model and for each scenario. Figure 6 shows the resulting data (speed, acceleration, position, type of car, length, etc.) 4.1 Ramp scenario: behavior in stop and go traffic Stop and go scenario demonstrates the traffic breakdown provoking on the main road of the on-ramp. Usually, the traffic jam occur when the leading car decelerate for cer- tain reasons. For that, it’s important to study the vehicle behavior when simulating in such case. Simulation results depicted in Fig. 7d show that the proposed model avoids the collision when the leading car decelerate hardly. Howe- ver, simulating traffic flow with MOV model occurs crashes between different cars as we can see in Fig. 7b. At t = 0, all cars start up according to the MOVM, VSDM, and MVSDM, respectively. From Fig. 8, it can be seen that the speed maximum of MVSDM is under of MOVM and VSDM. We can see that MFVDM velocity be- gins to decrease before MOVM and VSDM velocity rea- ches its maximum. The simulation results demonstrate that MOVM and VSDM provokes crashes. In contrast, our pro- posal MVSDM avoid it and the traffic jams disappear. To simulate the car motion and to describe the traffic flow, we examine certain properties of traffic from each car. Microscopic Evaluation of Extended Car- following Model in. . . Informatica 42 (2018) 117–125 123 Figure 6: Example of resulting data according to MVSDM Figure 7: Simulation of ramp according to (a) OVM and (b) MOVM(c) VSDM and (d) MVSDM Figure 8: Time evolution of velocity variation according to MOVM, VSDM, and MVSDM Figure 9: Position variation according to MOVM, VSDM, and MVSDM Figure 9 gives the position evolution of four simulated cars, it’s seen that the previous models provokes the collision. In contrast, our proposed approach avoids it. 4.2 Traffic lights scenario: behavior at stopping and approaching traffic signal The traffic lights scenario describes the driving behavior of the vehicle when approaching a traffic signal. First a traf- fic signal is red and a queue of vehicles is waiting which the optimal velocity is 0. When the signal turn to green, at t = 0, vehicles start. For that, the traffic lights signal is represented by virtual obstacles in each lane which is re- moved when the light turns to green. Figure 10 represents the velocity variation of two vehicles using the MVSDM in the case of several changes. At the beginning, vehicle 1 follows vehicle 2 in the same lane 0, after a few mo- ments vehicle 1 change the lane 0 towards the lane 1 that is why two vehicles show themselves in parallel when appro- aching traffic lights at t = 57. In approaching phase, and at t = 72 vehicles should decelerate smoothly which clearly shown that the vehicles stopped completely at a red light, and their velocity goes to 0. When the signal changes to green, vehicles begin to accelerate. Figure 11 shows the behavior of vehicle according MOVM, MVSDM, and VSDM. Through these results, we deducted that the velocity of vehicle applying MOVM doesn’t go to 0 that means all vehicles don’t stop at a red light. However, when we simulate applying VSDM and MVSDM, all vehicles behave correctly by stopping at a 124 Informatica 42 (2018) 117–125 H. Lazar et al. Figure 10: Driving behavior of two vehicles according MVSDM in each lane Figure 11: Simulation results according to MOVM, VSDM and MVSDM when approaching traffic lights signal red light and moves when its turn to green. It’s show cle- arly that the MVSD model react the realistic manner than MOVM and VSDM in braking case. Figure 12 represents the snapshot of vehicle motion and their behavior according to MVSD, VSD, OV, and MOV models. Through these results, and when approaching traf- fic lights, it can be observed that the vehicles collide in the previous models. However, the problems of collision in emergency case were solved. Furthermore, the simulation results show that our proposed approach can exactly des- cribe the driver’s behavior when approaching traffic signal, where no crash occurs. Figure 12: Simulation at traffic signal results according to OVM, MOVM, VSDM and MVSDM when approaching and stopping traffic lights signal 5 Conclusion Through introducing the new optimal velocity function which takes into account not only the headway, but also the relative speed parameter into the VSDM, the modified velocity-separation difference model (MVSDM) is presen- ted considering the driving behavior of the vehicle in bra- king case. In addition, to simulate in a realistic manner, we proposed to combine the proposed model with lane change model. The MVSDM can exactly describe the driver beha- vior under two proposed scenarios: when approaching traf- fic signal and an on ramp road, where no collision occurs. We can see that MVSDM is much close to the reality. Ho- wever, the collision and crashes occur in the previous mo- dels. We proposed as a future work, to validate the model in bidirectional road scenario with multilane. Literature [1] Muhammad, J. F., (2015) Modeling and Analysis of Inter-vehicle Communication: A Stochastic Geome- try Approach. Thesis. pp. 1-100. [2] Zhu, W., Liu, Y., (2008) A Total Generalized Op- timal Velocity Model and Its Numerical Tests. J. Shanghai Jiaotong Univ. (Sci.). 13(2), pp. 166170. DOI: 10.1007/s12204-008-0166-9. [3] Sven, M., Bart, D. M., (2005) Transpor- tation Planning and Traffic Flow Models. ArXiv preprint physics/0507127. pp. 1-51. DOI:https://arxiv.org/abs/physics/0507127. [4] Treiber, M. and Kesting, A. (2010) An Open- Source Microscopic Traffic Simulator. IEEE Intelli- gent Transportation Systems Magazine, 2(3), pp.6- 320. DOI:10.1109/mits.2010.939208. [5] Aghabayk, K., Sarvi, M., Young, W. (2015) A State-of the-Art Review of CarFollowing Mo- dels with Particular Considerations of Heavy Vehi- cles. Transport Reviews. 35(1), pp. 82-105. DOI: 10.1080/01441647.2014.997323. [6] Zhi-Peng, L., Yun-Cai, L. (2006) A velocity- difference-separation model for car-following the- ory. Chinese Physics. 15(7), pp. 1570-1576. DOI: 10.1088/1009-1963/15/7/032. [7] Jabeena, M. (2013) Comparative Study of Traffic Flow Models And Data Retrieval Methods From Vi- deo Graphs. International Journal of Engineering Re- search and Applications. 3(6), pp. 1087-1093. [8] Brackstone, M., McDonald, M. (1999) Car- following: a historical review. Transportation Rese- arch Part F: Traffic Psychology and Behaviour. 2(4), pp. 181-196. DOI: 10.1016/s1369-8478(00)00005-x. Microscopic Evaluation of Extended Car- following Model in. . . Informatica 42 (2018) 117–125 125 [9] Darbha, S., Rajagopal, K., Tyagi, V. (2008) A re- view of mathematical models for the flow of traffic and some recent results. Nonlinear Analysis: Theory, Methods and Applications. 69(3), pp. 950-970. DOI: 10.1016/j.na.2008.02.123. [10] Bellomo, N., Dogbe, C. (2011) On the Modeling of Traffic and Crowds: A Survey of Models, Speculati- ons, and Perspectives. SIAM Review. 53(3), pp. 409- 463. DOI: 10.1137/090746677. [11] Hoogendoorn, S., Bovy, P. (2001) State-of-the-art of vehicular traffic flow modelling. Proceedings of the Institution of Mechanical Engineers, Part I: Journal of Systems and Control Engineering. 215(4), pp. 283- 303. DOI: 10.1177/095965180121500402. [12] Wilson, R., Ward, J. (2011) Car-following mo- dels: fifty years of linear stability analysis- a mathematical perspective. Transportation Plan- ning and Technology. 34(1), pp. 3-18. DOI: 10.1080/03081060.2011.530826. [13] Papageorgiou, M. (1998) Some remarks on ma- croscopic traffic flow modelling. Transportation Re- search Part A: Policy and Practice. 32(5), pp. 323- 329. DOI: 10.1016/s0965-8564(97)00048-7. [14] Orosz, G., Wilson, R., Stepan, G. (2010) Traffic jams: dynamics and control. Philosophical Transactions of the Royal Society A: Mathematical, Physical and En- gineering Sciences. 368(1928), pp. 4455-4479. DOI: 10.1098/rsta.2010.0205. [15] Lazar, H., Rhoulami, K., Rahmani, M. D. (2016) A Review Analysis of Optimal Velocity Models. Perio- dica Polytechnica Transportation Engineering, 44(2), pp.123-131. DOI : 10.3311/pptr.8753. [16] Helbing, D. (2001) Traffic and related self-driven many-particle systems. Reviews of Modern Physics, 73, pp.10671141. DOI:https://arxiv.org/abs/cond- mat/0012229. [17] Newell, G. F. (1961) Nonlinear effects in the dyna- mics of car-following. Operations Research. 9(2), pp. 209-229. DOI: 10.1287/opre.9.2.209. [18] Bando, M., Hasebe, K., Nakayama, A., Shibata, A., Sugiyama, Y. (1995) Dynamical model of traf- fic congestion and numerical simulation. Physical Re- view E. 51(2), pp. 1035-1042. DOI: 10.1103/phys- reve.51.1035. [19] Helbing, D., Tilch, B. (1998) Generalized force mo- del of traffic dynamics. Physical Review E. 58, pp. 133-138. DOI: 10.1103/physreve.58.133. [20] Jiang, R., Wu, Q., Zhu, Z. (2001) Full velocity diffe- rence model for a carfollowing theory. Physical Re- view E. 64(1). DOI: 10.1103/physreve.64.017101. [21] Mammar, S., Mammar, S., Haj-Salem, H. (2005) A Modified Optimal Velocity Model for vehicle fol- lowing. IFAC Proceedings Volumes, 38(1), pp.120- 125.DOI:10.3182/20050703-6-cz-1902.02043. [22] Gipps, P. G. (1986) A model for the structure of lane-changing decisions. Transportation Research B. 20(5),pp.403-414.DOI:https://doi.org/10.1016/0191- 2615(86)90012-3. [23] Halati, A., Lieu, H., Walker, S. (1997) CORSIMCor- ridor traffic simulation model. in Proc Traffic Congest Traffic Safety 21st Century Conf., pp. 570576 [24] Umer, K., Pavlos, B., Lars, Sc., Alexandros, N., Dimitrios, K. (2014) Analyzing Cooperative Lane Change Models for Connected Vehicles. Internati- onal Conference on Connected Vehicles and Expo (ICCVE). DOI:10.1109/ICCVE.2014.136 [25] Kesting, A., Treiber, M., Helbing, D. (2007) Gene- ral Lane Changing Model MOBIL for Car-Following Models. Transportation Research Record: Journal of the Transportation Research Board, 1999, pp.86- 94.DOI:http://dx.doi.org/10.3141/1999-10. [26] Schakel, W.,Knoop, V.,VanArem, B. (2012) Integra- ted lane change model with relaxation and synchro- nization. Transportation Research Record, pp. 4757. DOI: https://doi.org/10.3141/2316-06. [27] Caprani, C. C., Enright, B., Carey, C.(2012) Lane changing control to reduce traffic load effect on long- span bridges. F. Biondini, D.M. Frangopol, Eds, 6th International Conference on Bridge Maintenance, Sa- fety and Management, Stresa, Italy. Taylor and Fran- cis. [28] Hayward, J.C. (1972) Near miss determination through use of a scale of danger (traffic records 384). Highway Research Board, Washington,DC. [29] Behbahani, H., Nadimi, N., Alenoori, H., Sayadi, M. (2014) Developing a New Surrogate Safety Indicator Based on Motion Equations. PROMET - Traffic and Transportation, 26(5).DOI:10.7307/ptt.v26i5.1388. [30] Minderhoud, M., Bovy, P. (2001) Extended time-to- collision measures for road traffic safety assessment. Accident Analysis and Prevention, 33(1), pp.89- 97.DOI:10.1016/s0001-4575(00)00019-1. [31] Vogel, K. (2003) A comparison of headway and time to collision as safety indicators. Ac- cident Analysis and Prevention, 35(3), pp.427- 433.DOI:10.1016/s0001-4575(02)00022-2. 126 Informatica 42 (2018) 117–125 H. Lazar et al. Informatica 42 (2018) 127–136 127 Prediction of Sentiment from Macaronic Reviews Sukhnandan Kaur and Rajni Mohana Department of CSE, JUIT, Waknaghat, 173234, India E-mail: sukhnandan.kaur@mail.juit.ac.in, rajni.mohana@juit.ac.in Technical paper Keywords: macaronic language, sentiment analysis, supervised learning, normalization Received: March 11, 2017 Web-sphere is the vast ocean of data. It allows its users to write their opinion, suggestions over various social platforms. The users often prefer to write in their native language or some hybrid content (i.e., combination of two or more languages). It’s also observed that people use a word or two of their native language in a text of base language. The presence of native words along with base language is known as macaronic languages. For example: Dunglish (Dutch and English), Chinglish (Chinese and English), Hin- glish (Hindi and English) The use of macaronic languages over the web is on the rise these days. This type of text generally doesn’t follow any syntactic structure, thus making processing of the content difficult. This paper deals with extracting meaningful information of a text containing macaronic content. It also facilitates the need of expert analysers for the processing of such content to take effective decisions. The performance of various decision support systems is dependable over these analysers. Therefore, this paper presents an algorithm which initially normalizes the content to its base language; later performs sentiment analysis over it. The experimental results using proposed algorithm indicates a trade-off between various performance aspects. Povzetek: Prispevek predstavi iskanje razumevanja makaronskega besedila, tj. besedila z dodanimi bese- dami drugega jezika. 1 Introduction Online review communities successfully allow its users to write their opinion, suggestions over various social plat- forms. These reviews greatly affect the decision to buy or sell any product and to use any service. It is fruitful to the manufacturer or service provider to enhance the pro- ductivity. Automatic decision support systems take these reviews into account for sentiment analysis. However, it is extremely difficult to have reviews in uniform language. During an automatic processing of reviews written online, it is found that 2/3 of the internet users are non-English [5] . The reason behind this is that most of the people have the ability to learn only 2 or 3 languages proficiently. In this technological world, people have equal priority to write over the internet among different languages. People who write reviews belong to different communities from different regions of the world; they have the freedom to use their native language too. When a text contains more than one language, it is called as multilingual text. If a single sentence contains more than one language, then it is called as macaronic text[18]. Example 1: Samsung aQCA cellphone , In the above mentioned text, it is taken as macaronic con- tent containing Hindi and English languages. These irregularities found in the data over the internet make the processing more complex. Due to the scarcity of the language resources over the web, it becomes very diffi- cult to handle all the possible languages over the globe. It is a challenging task of a natural language processing group. The formalism in sentiment analysis limits the system to specific users. The reviews from all the users of a particu- lar entity are valuable. It increases the need of automated systems to handle multilingual content. Derkacz et.al.[12] stated some of the requirements to have a multilingual au- tomated system. These requirements are further taken care by language processors to build a multilingual system. In case multilingual systems, the language of whole document is taken into account whereas for macaronic language pro- cessing, we need to detect the language of each word. This paper proposes a sentiment analyser which deals with the macaronic text. Initially, reviews are to be normalized du- ring pre-processing stage. Later, these reviews are proces- sed through sentiment analyser. This paper is organised as: section 2 describes the state of the art sentiment analysers. In section 3, system design and algorithm is proposed. Experimental analysis using va- rious performance metrics are presented in section 4. Fi- nally, the whole work is concluded in section 5. 128 Informatica 42 (2018) 127–136 S. Kaur et al. 2 Related work Numerous researchers have worked in the field of natural language processing. Kaur et.al.[14] presented sentiment analysis of reviews written in Punjabi language. The rese- archers collected the reviews written in Punjabi which af- terwards segregated into positive or negative reviews. Das et.al.[8] found the need of having SentiWordNet for Ben- gali language. Their work helped the researchers in the field of sentiment analysis. The researchers annotated the required lexicon. Das et.al.[7] worked for sentiment ana- lysis of reviews written in Bengali language. In this paper, the researchers have used support vector machine (SVM) with Bengali SentiWordNet. The paper presents the fea- ture extraction for Bengali language. Das et.al.[6] deve- loped subjectivity clues based on theme detection techni- ques. Bengali corpus is used in their work and later com- pared the results with English subjectivity detection. Das et.al.[9] developed a gaming theory by which researchers can easily build the SentiWordNet in the required language. This work demands the respective linguistic experts. Joshi et.al.[13] used supervised learning approach for their work by using Hindi- SentiWordNet for their work. In this pa- per, researchers used standard translation techniques to pre- serve the polarity of each document while translating it. Bakliwal et.al.[2] worked for detecting subjectivity based on graph theory. Researchers explored the effect of syno- nym and antonym over the subjective nature of the docu- ment. The results were good for Hindi and English. The researchers claimed that their strategy will work well in ot- her languages too. Das et.al.[10] developed a system for deducing the emotion and intensity of emotion based on sentiment hidden in the data. In this work, researchers have used supervised learning methods for their work. Ri- cha et.al.[21] presented a survey for sentiment analysis in Hindi language. The results have shown that sentiment analysis in Hindi language is complex as compared to En- glish language. The reason behind this complexity is the non-uniform nature of the Hindi language. Various rese- arch challenges are also discussed. Researchers[21] deve- loped a system which depicts the polarity of the text and tested their system over the Hindi movie reviews. Parul et.al.[1] developed a sentiment analyser for movie reviews written in Punjabi language using various machine learning algorithms. Raksha et.al.[20] used semi-supervised techni- que for polarity detection in Hindi movie reviews. In their work, researchers reported 87% accuracy of the proposed system by using bootstrapping and graph based approach for sentiment analysis. Pooja et.al.[17] used Hindi Senti- WordNet for finding opinion orientation of the reviews. Re- searchers used unsupervised learning for their work. Ker- stin et.al.[11] developed a system for multilingual text for obtaining the polarity of reviews written in language other than resource rich language English. Researchers used a standard translation methodology and supervised learning for sentiment analysis. C. Banea et.al.[3] developed a sy- stem which focused on the sentiment analysis based on translation of input document other than English. In their work, researchers used English as a source language. They used supervised learning approach for their work. For the translation of the text correctly various available translators are used. i.e. Goggle, Moses, Bing translators. The work by different researchers is summarized into ta- ble 1 . It is noticed that researchers are focusing well in the area of multilingual sentiment analysis. Researchers focused in finding document language for translating any document into base language instead of language of indi- vidual word. This sometimes discard the opinion bearing word written in any foreign language. As in example 1, the word aQCA, means good is discarded if the document language is detected as English. The efficient processing of such documents is required to increase the effectiveness of the decision support system. 2.1 Motivation After looking into the scenario, we found that we need Sen- tiWordNet in almost every language all over the global. It is very complex task. The motivation behind the proposed system is that the existing system for multilingual senti- ment analysis is unable to process macaronic data. The rise in the volume of macaronic data over the internet arise the need of proposed system. The reasons for having macaro- nic content over the web in huge volume are as follows: 1. Scarcity of Resources: Sentiment analysis task de- mands for the availability of lexicons or data of any particular language. There is huge variation in every language model. This makes the model used for one language cannot be used for other languages. For example: Chinese language model does not con- sider spaces while as other models focus mainly over spaces to tokenize. 2. Lack of uniformity of languages: Most of the lan- guages often follow their own traditional structures. Thus, processing of each language data with the gene- ral structure model gives unsatisfactory results. For example: English language use Subject-Verb- Object(SVO) while Hindi Language model follow Subject-Object-Verb (SOV) 3. Freedom of writing in native language: People these days have number of followers from different coun- tries through various online applications. They are also able to propagate their ideas through it. Someti- mes, few words they prefer writing in their own native language, which may not be understandable by some of the followers. In case of an automated system, during pre-processing through one language model, these native words may be neglected taken as foreign language words. Sometimes, we may lose meaningful information during this type of pre-processing. For example: s{ms\g is on great demand. s{ms\g(Samsung) is negected by English language Prediction of Sentiment from Macaronic Reviews Informatica 42 (2018) 127–136 129 Author Work Level Language Results Technique Corpus Year Danet et.al.[5] Classification of re- views into positive or negative opinion Document level Punjabi Accuracy = 75% Machine Learning Blogs 2014 Derkacz et.al.[18] Classification of re- views into positive, negative, neutral or emotion (sad, happy, etc) Document level Bengali Precision = 70.04%, Recall = 63.02% Machine Learning Custom Lexicon 2010 Das et. al.[14] Document are separa- ted based on Domain independent subjecti- vity and factual con- tent Sentence Level Bengali Precision = 70.04%, Recall = 63.02% Machine Learning Custom Lexicon 2009 Bandyopadhyay et. al.[6] Sentiment analysis of Hindi reviews, English reviews using Hindi SentiWordNet Document Level Hindi, English Precision = 70.04%, Recall = 63.02% Supervised Movie reviews 2012 Joshi et. al.[9] Subjectivity clues ba- sed on antonym and synonym using graph theory Document Level Hindi, English Accuracy = 79% Supervised Movie reviews 2012 Sharma et. al.[10] Polarity detection of movie reviews using unsupervised techniques Sentence Level Punjabi NA Unsupervised Movie reviews 2015 Arora et. al.[21] Sentiment orientation of reviews written in Hindi language Document Level Hindi Precision = 70.04%, Recall = 63.02% Unsupervised Movie reviews 2014 Sharma et. al.[1] Sentiment analy- sis using Semi- Supervised techniques Document Level Hindi Accuracy = 87% Semi- Supervised Movie reviews 2014 Pandey et. al.[20] Opinion orientation of Hindi movie reviews is deduced using Hindi-WordNet Document Level Hindi NA Unsupervised Movie reviews 2015 Denecke et. al.[17] Polarity detection from reviews using standard translation of German reviews in English afterwards find the polarity Document Level German, English Accuracy = 66% Supervised Movie review 2008 Banea et. al.[11] Enabling Multilingual question answering system Document Level French, Ger- man and Spanish NA Supervised Question Answers 2016 Table 1: State of Art Multilingual Sentiment Analysis 130 Informatica 42 (2018) 127–136 S. Kaur et al. model. Thus, it becomes difficult to extract samsung as an entity. 4. For getting point of attraction: People use the mul- tilingual content or some fancy words in various applications like product advertisements, shop names, etc. This makes the task of processing such web content complex. For example: samsung (Samsung) is on great demand. samsung (Samsung) is on great demand. m ona(Mona) is feeling so good. Hence, from the above examples, Samsung is hard to detect as it is being neglected by chosen language model. Due to the above mentioned reasons, it is very much ne- cessary to have an efficient system to process macaronic language content. Our contribution is to enhance the per- formance i.e.precision, recall and accuracy using supervi- sed sentiment analysers. The proposed system is with less fallout which shows its high efficiency. 3 System design The proposed system as shown in Figure 1 applies a vari- ant of techniques for normalization of macaronic text and classification of reviews. The system consists three major components: 1. Language Processing 2. Text Processing and 3. Sentiment Analysis A component based on language detection is carried out using algorithm 1. The core idea of this component is to normalize the macaronic content. Other two components are carried out using algorithm 2. It normalizes the content to extract the SentiStrength of each document. Combina- tion of these two algorithms (Algorithm 1 and Algorithm 2) is used to carry out sentiment analysis for multilingual or macaronic language documents. Figure 1: Proposed System Design 1. Language Processing: It is the primary component of the proposed system. In this component tokenization, language detection and conversion of tokens to its base language is carried out. These sub-components are described as follows: (a) Tokenization: It is the basic unit of any language processing task. A sequence of sentences, words or characters are passed as an input to any sy- stem. The output of this phase is tokens. It can be done at different levels depending upon the level of granularity: sentence level, word level, character level as shown in table 2. The proposed system is based on word level tokenization for macaro- nic language. E.g. Samsung has a good market value. Users are happy with its mobile products. (b) Language Detection/ Translation: For language detection, we have used PoS[19] tagging, as shown in table 3. The unrecognized or untagged tokens can be passed through language detection module. The output of this phase is the tokens in the base language of the system. i.e. Taking En- glish as a base language. If the token is found in Hindi WordNet then Hindi to English translator is applied to it. On the other end, if the word be- longs to Punjabi language, it is passed through the Punjabi to English translator. It is a general procedure which can be applied to various other languages too. 2. Text Processing: It is the second important component of the proposed system. It carries various sub-tasks described as follows: (a) Normalization: After filtration of subjective sen- tences, normalization is to be done. The pro- cess of normalization is to regularize or process the grammatical variants present in the sentence. Grammatical variants include past verbs (regu- lar and irregular) / present verbs, classification of noun phrases in singular and plural. In norma- lization, finding the abbreviations, case folding, etc.Normalization is a process having data in a well format as required for appropriate proces- sing. It includes: Level of Pro- cessing Number of Tokens Sentence Level 2 Word Level 13 Character Le- vel 74 Table 2: Tokenization at different levels Prediction of Sentiment from Macaronic Reviews Informatica 42 (2018) 127–136 131 i. Handling Slangs: Slangs are playing in- dispensable role in opinion mining. So, it will be worthless to reject all the slangs by counting them as stop words. Various algo- rithms are applied to handle different types of slangs.Types of slangs[5]: – Emoticons: Bad , happy – Interjections:Mmmmm-pleasure, hmmmm-wondering, Mhmmm- confirmation – Intensionally misspelled: cooooooool, goooooooood, nyt, etc – Alphanumeric strings: gr8, 9t, etc. Test sentence: She is flying high by having this cellphone. She is flying high by having this cellphone. Happy ii. Idiomization / Replacement of idioms with their actual meaning: In English literature, idioms play very important role in fixing the opinion from sentence about the particular entity. If the stops words are removed then some words which may or may not the part of the idiom can be rejected. In reality, these words are highly contributed to the opinion. Test sentence: She is flying high by having this cellphone. Happy She is very happy by having this cellphone. Happy (b) Tokenization: In our work, we have used word level tokenizer as mentioned in table 2 . The reason behind this to process each token accor- ding to its own language instead according of language of the document. (c) PoS Tagging: Part of speech tagging plays a vital role in natural Language processing tasks. Initi- ally, we have tried to focus whether the state of art PoS taggers are able to recognise a foreign word. For this purpose, we have used NLTK tagger[15] and Stanford Tagger[16]. We have shown the results of both the taggers for vari- ous test sentences in table3. We have found va- rious untagged tokens which are then processed through language processing phase. 3. Sentiment Analysis: In this module, the potency of each review is calculated. The magnitude of the senti- ment associated with each document is calculated by aggregating all the review’s sentiscore corresponding to that document. SentiWordNet is the base for get- ting the actual magnitude of the sentiment of a do- cument. For our work, we have used SentiWordNet v3.0.0. Sentiscore corresponds to each document is taken as an output as shown in table4 . 4 Evaluation 4.1 Dataset We have extracted a corpus of reviews of 10 movies contai- ning 200 movie reviews i.e.100 positive and 100 negative; 160 reviews were used for training and 40 for testing. Each review has a size ranges from 500 to 1000 words. Initi- ally, classification of the corpus is elaborated according to user′s scoring: reviews are marked between 3 and 5 star rating are classified as positive whereas reviews marked between 0 and 2 are taken as negative. This prior classi- fication is based on the assumption that the star rating is correlated to the sentiment of the review. For experiment evaluation, the data was pre-processed with the TreeTag- ger5, POS tagger and lemmatization tool. We have used Support Vector Machine (SVM), Nave Bayes, kNN and convolutional network as classification models to train the system and classify movie reviews. The reviews are not monolingual. These reviews are macaronic in nature i.e. it consists of more than one language i.e. Hindi and English in a single review. We manually annotate the reviews based on language of each token. The guidelines for annotation are stipulated the need of retaining the semantic structure of tokens. Five different graduate students participated in the reviewing process to formulate Gold Standard. To eva- luate the inter-personnel disagreement, we have used kappa measure[4] and score 0.61 is obtained. 4.2 Performance Formally, the performance of proposed sentiment analyser, PSA is a function of four factors as follows: PSA(l,Ld,t,Es) Where Ld is Language Detection l is a Learning Algorithm t is a Tagger Es is a Experimental Setup The performance of the analyser is directly affected by the choice of optimal parameters for each factors mentio- ned above. In the case of optimal parameters choice for each of the factor, sentiment analyser gives maximum per- formance (PSAmax). On the other end, training consists machine translated data and testing of the learning algorithm is based on the human annotated dataset i.e. Gold Standard. The perfor- mance of sentiment analyser (PSA) is negatively affected by error in language detection phase (ELd) as given in equation 1 . PSA = PSAmax − ELd (1) In case of optimal parameters, ELd → 0, PSA = PSA max 132 Informatica 42 (2018) 127–136 S. Kaur et al. Test Sentence Pos tagging by NLTK tagger Stanford tagger mFEwyA gyAn kA ek ÿ aQCA srot h{\ mFEwyA—NN gyAn—:kA—:ek—:ÿ aQCA—srot—h{\— mFEwyA/VBZ gyAn/NNP kA /NNP ek /NNP ÿ aQCA/NNPsrot /NNP h{\ /NNP media is aQCA source of knowledge media—NNS is—VBZ ÿ aQCA—: source—NN of—IN knowledge—NN media/NNS is/VBZ aQCA/JJ source/NN of/IN knowledge/NN mFEwyA gyAn kA ek good srot h{\ mFEwyA—NN gyAn—:kA—:ek—:good —JJ srot —h{\— mFEwyA/VBZ gyAn/NNP kA /NNP ek /NNP good/JJ srot /NNP h{\ /NNP media gyAn kA ek ÿ aQCA srot h{\ media—NNS gyAn—:kA—:ek—:ÿ aQCA—srot—h{\— media/NNS gyAn/NNP kA /NNP ek /NNP aQCA/NNP srot /NNP h{\ /NNP Table 3: Tagging of various test sentences using NLTK and Stanford Tagger Test Sentence SentiStrength texttt mFEwyA is good source of knowledge 0.47 media is good source of knowledge 0.47 mFEwyA gyAn kA ek ÿ aQCA srot h{\ 0 media is aQCA source of knowledge 0 mFEwyA gyAn kA ek good srot h{\ 0.47 media gyAn kA ek ÿ aQCA srot h{\ 0 Table 4: Sentiscore Associated With Review Metric Target Target Selected tp fn Selected fp tn Table 5: Confusion metric used to evaluate performance 4.3 Performance metric For the analysis of results, the following performance me- trics are used by various natural languages processing task including sentiment analysis. It includes precision, recall, F-measure and accuracy. These measures can be calculated using the confusion metric given in table 5. Precision: It is defined as fraction of retrieved documents that are relevant. It is calculated using equation 2. P = number of correct positive or negative documents detected by the system no. of positive/negative documents detected by the system (2) Recall: It is defined as fraction of relevant documents that are retrieved. It is calculated using equation 3. R = number of positive or negative documents detected by the system no. of positive/negative documents present in the Gold Standard test set (3) F-measure: It is a harmonic mean with takes precision and recall both into account. It is a consecutive average of precision and recall. F-measure with α = 0.5, means ta- king precision and recall at equal weightage.It is calculated using equation 4. F = (α2 + 1)× P ×R α2(P +R) (4) Accuracy: it is the fraction of classifications that is correct. . It is calculated using equation 5. A = tp + tn tp + tn + fn + fp (5) Fall-out: It is a measure of the proportion of mistakenly se- lected non- targeted items. . It is calculated using equation 6 FO = fp tn + fp (6) 4.4 Results and analysis The outcomes of our experimental study are presented in Table 6 and Table 7. We can easily notice that every ma- chine learning approach has its own pros and cons. Each of them is valuable in different aspects i.e. precision, re- call, accuracy, fallout and execution time. To validate our results we have used 10-fold cross validation. For the ex- perimental setup, we have used Support Vector Machines Prediction of Sentiment from Macaronic Reviews Informatica 42 (2018) 127–136 133 Learning Approaches Precision Recall Accuracy Fallout Time(sec) NB 51.58 50.4 50.4 92.8 422 SVM 62.29 62 62 45.6 428 kNN 52.01 52 52 49.6 421 Convolutional network 54.96 54 54 24 751 Table 6: Un-normalized Macaronic Sentiment Analysis (a) Comparing different learning approaches based on Precision (b) Comparing different learning approaches based on Recall (c) Comparing different learning approaches based on Accuracy (d) Comparing different learning approaches based on Fallout Figure 2: comparision of various methods Figure 3: Comparison of execution time various machine learning Algorithms based on Proposed Scheme for nor- malized and unnormalized data (SVM), Nave Bayes (NB), kNN and Convolutional net- work(Deep Learning) to analyse the performance of propo- sed algorithm. The results are shown in Table 6 and Table 7. Precision, recall, accuracy, fallout are taken in percen- tage and time is taken in seconds. Time taken by each of the learning technique is very dependent on data size, data types, number of columns, computer hardware, memory, background running processes, cores, etc. This may vary with the change in any of the mentioned attribute. Hence, the time taken in table 6 and table 7 helped in deducing the time trend of each learning model. It is shown as an in- creasing order and noticed the reduction in the time to the marginal level in normalized content. Figure 4: Comparison of proposed technique with State of art Order for unnormalized content: kNN < NaiveBayes < SVM < Convolutionalnetwork Order for normalized content: NaiveBayes < SVM < kNN < Convolutionalnetwork The results have shown in Figure 2 clearly evident the performance of proposed system using various learning ap- proaches. These figures highlight the proposed system per- formance in various aspects. The proposed scheme out- performs the existing system using Nave Bayes by the rise in the values of precision, recall by 17.88% and 18.22%. Observing the results of other classifiers i.e. SVM, kNN and convolutional network also shows significant impro- 134 Informatica 42 (2018) 127–136 S. Kaur et al. Learning Approaches Precision Recall Accuracy Fallout Time(sec) NB 69.46 68.62 68.63 28.79 18 SVM 71.72 71.69 71.75 20.21 21 kNN 65.41 65.31 65.47 40.21 29 Convolutional network 58.03 54.56 55.00 13.04 440 Table 7: Proposed normalized Macaronic Sentiment Analysis Approach Precision Recall Accuracy Fallout Baseline 55.21 54.6 54.6 53 Proposed 66.15 65.04 65.21 25.56 Table 8: Comparison with Existing Sentiment Analysis vements in performance levels. Using SVM and kNN more than 9% and 13% improvement is noticed in precision and recall values using proposed approach. It is also noticea- ble that there is a trade-off between various performance aspects. The effectiveness of system is shown by convolu- tional network but it takes more time than other classifiers for macaronic sentiment analysis. Through observing Figure 3, we have found that the pro- posed algorithm also greatly affect the time taken by each model. It is noticeable that the normalized content reduces the training time in every learning approach. By observing Table 8, results are compared to the baseline approaches; the average value of precision, recall is increased while the fallout is decreased significantly. Figure 4 shows that how effective the proposed approach is as compared to the state of the art sentiment analysis for macaronic language. 5 Conclusion Over the web where huge user generated content has al- ready existed; the need for sensible computation for de- cision support system is rising. The multilingual online content has led to the increase of web debris, which is in- evitably and negatively affecting information retrieval and extraction for decision support systems. To analyse this negative trend and propose possible solution, this paper fo- cused on the evolution of sentiment analysis based on bag- of-words for macaronic reviews. Different supervised ma- chine learning approaches gave different cross validated re- sults. This is done by borrowing the concept of training and testing from the field of machine learning. After successful evaluation, it is concluded that there is a trade-off between various performance measures. In this study, we have in- vestigated the need to normalize the macaronic text. We have also performed sentiment analysis over the macaronic language text consists English and Hindi. We have found an average of about 11% rise in precision and recall va- lues. It is also noticeable that training time is also reduced significantly using proposed approach. We further plan to develop a system to handle with more than two languages as a macaronic text for sentiment analysis. We also plan to apply our proposed algorithm for entity extraction. References [1] Arora, P. and B. Kaur (2015). ”Sentiment Analysis of Political Reviews in Punjabi Language.” International Journal of Computer Applications 126(14). [2] Bakliwal, A., P. Arora, et al. (2012). Hindi sub- jective lexicon: A lexical resource for hindi polarity classification. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC). [3] Banea, C., R. Mihalcea, et al. (2008). Multilingual subjectivity analysis using machine translation. Pro- ceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Com- putational Linguistics. [4] Bunt, H., V. Petukhova, et al. (2016). Dialogue Act Annotation with the ISO 24617-2 Standard. Multimo- dal Interaction with W3C Standards, Springer: 109- 135. [5] Danet, B. and S. C. Herring (2003). ”Introduction: The multilingual internet.” Journal of Computer Me- diated Communication 9(1): 0-0. [6] Das, A. and S. Bandyopadhyay (2009). Theme de- tection an exploration of opinion subjectivity. Af- fective Computing and Intelligent Interaction and Workshops, 2009. ACII 2009. 3rd International Con- ference on, IEEE. [7] Das, A. and S. Bandyopadhyay (2010). Opinion- Polarity Identification in Bengali. International Con- Prediction of Sentiment from Macaronic Reviews Informatica 42 (2018) 127–136 135 ference on Computer Processing of Oriental Langua- ges. [8] Das, A. and S. Bandyopadhyay (2010). ”SentiWord- Net for Bangla.” Knowledge Sharing Event-4: Task 2. [9] Das, A. and S. Bandyopadhyay (2010). ”SentiWord- Net for Indian languages.” Asian Federation for Na- tural Language Processing, China: 56-63. [10] Das, D. and S. Bandyopadhyay (2010). Labeling emotion in Bengali blog corpusa fine grained tagging at sentence level. Proceedings of the 8th Workshop on Asian Language Resources. [11] Denecke, K. (2008). Using sentiwordnet for multilin- gual sentiment analysis. Data Engineering Workshop, 2008. ICDEW 2008. IEEE 24th International Confe- rence on, IEEE. [12] Derkacz, J., M. a. Leszczuk, et al. Definition of Re- quirements for Accessing Multilingual Information and Opinions. Multimedia and Network Information Systems, Springer: 273-282. [13] Joshi, A., A. Balamurali, et al. (2010). ”A fall-back strategy for sentiment analysis in hindi: a case study.” Proceedings of the 8th ICON. [14] Kaur, A. and V. Gupta (2014). ”Proposed Algorithm of Sentiment Analysis for Punjabi Text.” Journal of Emerging Technologies in Web Intelligence 6(2): 180-183. [15] Kothapalli, M., E. Sharifahmadian, et al. ”Data Mi- ning of Social Media for Analysis of Product Re- view.” International Journal of Computer Applicati- ons 156(12). [16] Nguyen, D. Q., D. Q. Nguyen, et al. ”A robust transformation-based learning approach using ripple down rules for part-of-speech tagging.” AI Commu- nications 29(3): 409-422. [17] Pandey, P. and S. Govilkar (2015). ”A Framework for Sentiment Analysis in Hindi using HSWN.” Interna- tional Journal of Computer Applications 119(19). [18] Renduchintala, A., R. Knowles, et al. ”Creating in- teractive macaronic interfaces for language learning.” ACL 2016: 133. [19] Seih, Y.-T., S. Beier, et al. ”Development and Exami- nation of the Linguistic Category Model in a Compu- terized Text Analysis Method.” Journal of Language and Social Psychology: 0261927X16657855. [20] Sharma, R. and P. Bhattacharyya ”A Sentiment Ana- lyzer for Hindi Using Hindi Senti Lexicon.” [21] Sharma, R., S. Nigam, et al. (2014). ”Polarity de- tection movie reviews in hindi language.” arXiv pre- print arXiv:1409.3942. Algorithm 1: Input: Document D where D = d1, d2, d3, ....., dk ’k’ is the total no. of documents ’m’ is the total number of words in a document Ls = language of segment Lb = Base language (English) Output: Ws(weightedSentiStrengthofeachdocuemnt) Begin for k = 1 to k do Tokenization for i = 1 to m do Encoding based on UTF8 end for {Similar category segments are combined} Segmentation based on encoding. Language detection for each segment. if Ls = Lb then goto S1 else Apply translation end if S1 Assemble segments Compute SentiStrength end for 136 Informatica 42 (2018) 127–136 S. Kaur et al. Algorithm 2: Input: Document D where D = d1, d2, d3, ....., dk ’k’ is the total no. of documents ’m’ is the total number of words in a document Output: Ws(weightedSentiStrengthofeachdocuemnt) {Token list(TL) = (t1,t2,.....,tn)} {Word List(WL)= (w1,w2,w3,......wx)} {’q’ is the total number of tokens in a document} {P = list of ’positive category words} {N = list of ’negative category words} {Pw = weight assigned to a token belongs to positive category as per SentiWordnet} {Nw = weight assigned to a token belongs to negative category as per SentiWordnet} Begin for d = 1 to k do Tokenization Stemming Normalization for k = 1 to m do if (tk ∈W ) ⋂ (tk ∈ P ) then wpos(k) = Pw(tk) else if (tk ∈W ) ⋂ (tk ∈ N ) then wneg(k) = Nw(tk) else if (tk ∈W ) ⋂ (tk 6∈ N) ⋂ (tk 6∈ N ) then wneu(k) = 0 end if end for Ws = m∑ j=1 wpos(j)± m∑ j=1 wneg(j) (7) end for Informatica 42 (2018) 137–143 137 Application of Distributed Web Crawlers in Information Management System Bo Wen School of Computer Science and Technology, Huaibei Normal University, Huaibei, 235000, China E-mail: bowen1983@yeah.net Technical paper Keywords: web crawlers, Hadoop, information management system Received: February 7, 2018 In the Internet era, cloud data and big data constantly develop, and Internet has become the main platform for enterprises and individuals to release information. As a result, a large amount of data generates, and people spend more energy on finding information that they want. The desire for accurately acquiring information needed becomes increasingly stronger. This study designed a distributed web crawlers system based on Hadoop and used it to do large-scale information management. The simulation experiment verified that the system could operate stably in information management system, which offers a reference for the application of distributed web crawlers in information management systems. Povzetek: Razvit je distribuirani spletni preiskovalnik na osnovi Hadoopa za upravljanje informacij. 1 Introduction Internet rapidly develops in the 21st century, accompanied by data volume increasing in exponential form on Internet. With the diversification of information, the management of information has become more and more difficult. How to timely and accurately search information through search engine and manage the information becomes crucial. Requirements on relevant technologies are also being improved constantly. With the development of computer, information management system has emerged. More efficient and simple information management systems are being developed. Qin [1] designed a SG-UAP development tool based on Eclipse development environment which was applicable to Windows operation system; a database platform was developed based on Oracle to provide Tomcat network information management service; the system managed network information through service-oriented architecture. Gupta et al. [2] analyzed management information service and proposed to manage network information with management information service and found that management information system could optimize network information and accurately collect and manage data. Zhao et al. [3] established a topic-focused crawler based scientific research information system to improve the information management level. Web crawlers can capture webpage information from the network and extracted and stored the key information to solve the urgent problem of information acquisition. But information collection based on web crawlers is facing with difficulties such as information repetition and existence of dynamic pages. Therefore distributed technologies are needed to solve the problems and enhance crawling efficiency. In the study of Su et al. [4], single-thread and multi-thread web crawlers were implanted into a distributed system to capture and store data with diversified and personalized operations, which enhanced the capturing speed. In the study of Zhang et al. [5], Hadoop based distributed web crawler system was optimized. The parameters were optimized through analysis on factors influencing crawling efficiency. Distributed web crawlers have great advantages in collecting and storing information; hence it can help establish a practical and high-efficient information management system. In this study, web crawlers were analyzed, and then a Hadoop based distributed web crawlers system was designed to manage network information. The simulation experiment suggested that the system could effectively collect and store network information and enhance the performance of single-node web crawlers, which provides a reference for the application of distributed network crawlers in information management system. 2 System related technologies 2.1 Web crawlers Distributed web crawler is a program which crawls Web resources on the Internet according to some rules and provides the obtained network information to search engine. Therefore it is an indispensable part of search engine [6]. To achieve a high crawling ability, a web crawler should have the five characteristics [7]. (1) High performance A large amount of information involves mass Uniform Resource Locator (URL). Distributed web 138 Informatica 42 (2018) 137–143 B. Wen crawlers should timely and effectively capture useful information in webpage. The more the information in unit time is, the better the performance of web crawlers is. (2) Expandability Expandability should be improved to achieve a high performance of web crawlers. Expandability means that the whole crawler system will not be affect when the current web crawlers are being updated or doing other operations. Better expandability is needed in efficient crawling of information in different sites as the programming language and code editor are different in different websites. (3) Robustness. Facing with a large number of servers, web crawlers may encounter emergencies such as crawler trap in the process of work. Reasonable processing of these conditions is a character of an excellent web crawler. Only when web crawlers have favorable robustness can they get back to work after interruption. Moreover the previously crawled content should be restored after setup. (4) Friendliness: Web crawlers should protect relevant information of websites as per robots protocols. The crawling scope of web crawlers should be defined. Moreover additional burdens to websites should be avoided when web crawlers capture information. (5) Updatability. Web crawlers should be able to perceive the alternation of websites and timely acquire new website content to replace the old one. Information management system needs to collect and store diversified data on the Internet. With the explosive growth of data, the traditional stand-alone web crawlers have gradually been not as good as before. Hence stronger and more comprehensive information management systems are needed. 2.2 Hadoop Hadoop, a basic framework of distributed system developed by Apache Software Foundation, is composed of many ordinary, low-cost single computers. It can rapidly and flexibly process mass data. It has the following advantages. (1) High reliability. Its ability in processing data is highly reliable. (2) Strong fault tolerance. Hadoop can automatically replicate many copes and allocate failed tasks. (3) High scalability. Hadoop can process and allocate data between hundreds of servers and easily expand to thousands of nodes. (4) High efficiency. Hadoop can efficiently transfer data between different nodes. (5) Low cost. Compared to other commercial data warehouse, Hadoop is open-source. Hadoop has two core parts. One is distributed file system, i.e. Hadoop Distributed File System (HDFS). HDFS is capable of storing large files, for example, files in a size of more than 100 TB. HDFS is also featured by strong fault tolerance. It can operate on low-cost hardware. The other core is MapReduce computational model which can concurrently calculate mass data and have favorable extensibility and fault tolerance. It has a huge advantage in data processing. 2.3 Application values of distributed web crawlers in information management system In view of the advantages of distributed system and the properties of web crawlers, distributed web crawler is feasible. Distributed web crawler is composed of web crawler and distributed system, which is capable of fulfill different tasks by making the best use of information on the Internet. It effectively makes up the defects of the stand-alone web crawler. It can capture more websites and collect and store more data. Therefore Hadoop based distributed web crawler has high application values in information management system. 3 Design of information management system 3.1 Design of distributed web crawler system architecture 3.1.1 Design of physical architecture To satisfy the aforementioned characteristics, cost of PC server should be saved, and moreover Hadoop based distributed architecture should be extensible [8]. The system should allocate the crawled page data on different nodes using its ability of distributed storage capacity. Moreover a strong fault tolerance was needed to set the number of data copies and reallocate the failed tasks on other nodes. The distributed architecture could enhance the overall performance of crawlers to the large extent. The physical architecture of web crawlers in this study included Hadoop cluster and Storm cluster [9]. To reduce the pressure on Hadoop cluster during operation, separate deployment was adopted. Crawler tasks were divided into multiple tasks and operated on multiple Slave nodes based on the distributed storage and calculation abilities of distributed architecture. The collected data were stored in clusters. Then the data generated when crawlers crawled and analyzed webpage were written into Kafka, and Storm was used to calculate index results in real time. The physical architecture is shown in Figure 1. 3.1.2 Design of logic structure The logic structure of distributed web crawlers is shown in Figure 2. It included batch processing part and real- time calculation part. Batch processing was mainly realized based on Hadoop platform, and it was responsible for achieving crawling tasks and storing data in Hbase. Real-time calculation was realized based on Storm platform, and it was responsible for calculating relevant data generated in system operation and storing the results in iRedis. Application of Distributed Web Crawlers … Informatica 42 (2018) 137–143 139 3.2 Modules of distributed web crawlers The system module of the distributed web crawler was composed of the following parts. (1) URL splitting and injection module: firstly read the URL path of user, then obtain URL list, and split it into several parts and allocate to TaskTracker. (2) Webpage access module: acquire webpage according to URL links and download and save it locally. (3) Webpage analysis module: analyze the captured webpage in aspects of structure and content. (4) Link filtering module: filter the acquired URL and eliminate ineffective and repeated links. (5) Data storage module: Save data in the database of HDFS. 3.3 Design of key technology 3.3.1 URL standardization URL is a kind of character which can show information resources on www, and information resource has one and only has one URL [10]. URL standardization meant standardizing URL and transforming a URL to a qualified equivalent URL. Its transformation was realized by replacing /xx/../ with /, /../ with /, /./ with / and xx//yy with /. 3.3.2 Allocation of crawler tasks Before crawling based on the distributed architecture, tasks were allocated to the distributed clusters [11]. When some node failed, tasks should be reallocated. For Hadoop cluster with n nodes, a URL was selected from URL set, Topn URLs were divided into N sets, and the sets were allocated to different nodes of Hadoop set to do crawling tasks. If some node failed, Master would allocate the failed task to other nodes without affecting the crawling speed of the current nodes. The network pressure of websites should be considered in the process of crawling. 3.3.3 Balance politeness Crawling of the same website should follow the principle of balance politeness [12]. URLs were ranked according to score rules; then URL was taken out one by one from the URL set and allocated to N subsets; the number of URLs in one set and the number of URLs from the same Host in one set should be limited. In this way, the pressure of webpage could be reduced when web crawlers were crawling information. 3.3.4 Webpage revisit Network usually has favorable dynamic property. When web crawlers fulfilled a crawling task, then the webpage might change. Therefore web crawlers should update website content at a certain time interval and the content which needed to be crawled. 3.3.5 Data deduplication There are many same data on the network. Therefore network data should be processed by deduplication. (1) Webpage content was separated into words, i.e. Figure 1: Design of the physical architecture of distributed web crawlers. Figure 2: Design of logic architecture of distributed web crawlers. 140 Informatica 42 (2018) 137–143 B. Wen characteristic vectors. The occurrence frequency of every word in documents was taken as weight. (2) The Hash value of every characteristic vector was calculated [13], and moreover those vectors were processed by weighed accumulation. (3) The result larger than 0 was denoted as 1 and otherwise as 0, and the final results were Simhash signature values [14]. (4) The similarity of data was determined according to different Simhash signature values. 4 Concrete implementation of distributed crawler URL initial module was combined with parallel circulation model to analyze the procedures of URL insertion, URL list generation, web crawling and data update in the data crawling experiment of distributed crawler. A module circulation formed from link update in link library, crawl list generation, URL crawling execution, key information analysis to link update in link library. The module composition and flow circulation can benefit the concrete implementation of distributed crawler. The concrete implementation flow is shown in Figure 3. 5 System test and results analysis Before testing of the network management, the test environment should be adjusted. VMware Workstation was installed in window host. Then Hadoop cluster was established in the virtual machine. In the development of the system, Java codes written by Eclipse IDE were installed in the host. Hadoop Eclipse plug-in units was installed and connected to Hadoop clusters. Data were processed using Hadoop Distributed File System (HDFS) and MapReduce calculation model. 5.1 Functional test 5.1.1 Test content and scheme Functional test included the following content. (1) Webpage crawling test In the initial URL set, 0, 1 and 4 URL link seeds were added. Then three conditions, i.e. effective crawling, partially effective crawling and ineffective crawling, were considered. After crawling, whether the downloaded target data satisfied standards or not were checked. (2) Filter test on URL link The URL link log sheet which was subject to be crawled was checked to determine whether link standardization and deduplication operations should be performed or not. (3) Webpage data extraction test Whether the analysis module was corrected and could effectively extract data on webpage and store the data in relevant documents or not was checked. (4) Test on webpage category classification The system classified webpage into different categories and checked whether the classification was corrected or not. Figure 3: The implementation flow of the distributed crawler. Application of Distributed Web Crawlers … Informatica 42 (2018) 137–143 141 5.1.2 Test results The system could do webpage crawling according to the prescribed initial URL set and added the crawled URLs into URLs which were subject to be crawled. Standardization and deduplication were performed before addition. The extracted data were stored in relevant documents. Moreover it could rapidly classify webpage. 5.2 Performance test 5.2.1 Test content and scheme (1) Test on collection scale After a period of webpage crawling, the size of the collected webpage data was calculated to measure the collection scale. (2) Test on operation speed During crawling, the size of the collected web data, i.e. x, was calculated after n hours of movement. The computational formula for crawling speed v was v = x/n. 5.2.2 Test results Table 1 shows the data collection speed of the clusters based on four nodes. The operation of the system included webpage downloading, web analysis, extraction of record information on the network and classification of web text. This study could basically satisfy the requirements according to the data in Table 1. 5.3 Test on expandability 5.3.1 Test content and scheme Test on expandability: the number of nodes on Hadoop platform was changed. Then test was performed when the number of coordinated nodes was 1, 2 and 3 to determine whether the operation was normal and what were the effects on the performance of the system. 5.3.2 Test results Figure 4 demonstrated the data collection and analysis of the system when the time and number of nodes were different. It could be noted that the operation speed was the highest when there was only one node; the operation speed had remarkable improvement with the increase of nodes, but the speed of each node had no significant changes. Through test, it was concluded that the expandability could satisfy the predetermined requirements. Internet plays an increasingly important role in the production and life of people and has been the main source of information. Distributed web crawlers can grab key data among mass data, which is greatly helpful to information acquisition. Bal et al. [15] put forward intelligent distributed crawler crawling network based on client-server architecture. In the architecture, load is managed by server. Every time when crawlers were loaded, URLs were dynamically allocated to allocate load to others, which enhanced the ability of information crawling. Kumar et al. [16] developed distributed semantic web crawlers and successfully crawled and Table 1: The operation results of the information management system. Number 1 2 3 4 5 Segment name Segment20171 002093417 Segment20171 00213672 Segment20171 00360349 Segment20171 00413725 Segment20171 00547436 Size (MB) 39.21 82.61 180.44 305.14 400.62 Operation time (h) 0.6 1.1 2.4 4.5 5.7 Figure 4: The test results of system expandability. 142 Informatica 42 (2018) 137–143 B. Wen utilized HTML compiled by owl/RDf and semantic web. In information management system, distributed web crawlers can give full play to its advantages because it can effectively crawl information needed among mass data and efficiently collect and manage them. The application of distributed web crawlers can achieve efficient and safe management of information and has high practicability. 6 Conclusion In conclusion, distributed network crawlers based information management system could precisely satisfy the requirements of web crawling, with a high performance and expandability. Moreover it can effectively reduce repeated visit and download of resources to improve efficiency of information searching. It can also reduce the time and money spent on resource acquisition because of the low cost. Therefore it can be applied for extracting network information. This work provides a reference for the application of distributed network crawlers based information management system in data extraction. 7 References [1] Qin Y., Xuan H., Zhang B. (2016). Intelligent Management System of Power Network Information Collection Under Big Data Storage. 13th Global Congress on Manufacturing and Management (GCMM 2016), MATEC Web of Conferences, Zhengzhou. [2] Gupta C. L. P., Sharma S., Tripathi S. (2015). Importance of Management Information System in Electronic-Information Era. East Carolina University, 1(2). [3] Zhao Q. A. (2016). Research and Implementation of Scientific Research Information Management System Based on the Topic Web Crawler. Anhui: Anhui University, pp. 1-46. [4] Su L., Wang F. (2017). Web crawler model of fetching data speedily based on Hadoop distributed system. IEEE International Conference on Software Engineering and Service Science, Beijing, pp. 927- 931. [5] Zhang X., Xian M. (2015). Optimization of Distributed Crawler under Hadoop. International Conference on Engineering Technology and Application, 22:02029. [6] Qu X., Hu R., Zhou L., Wang L., Zhu Q. (2015). Expert Achievements Model for Scientific and Technological Based on Association Mining. International Symposium on Distributed Computing and Applications for Business Engineering and Science, Guiyang, pp. 272-275. [7] Bahrami M., Singhal M., Zhuang Z. (2015). A cloud-based web crawler architecture. International Conference on Intelligence in Next Generation Networks, Paris, pp. 2016-223. [8] Pu Q. (2016). The Design and Implementation of a High-Efficiency Distributed Web Crawler. Dependable, Autonomic and Secure Computing, International Conference on Pervasive Intelligence and Computing, International Conference on Big Data Intelligence and Computing and Cyber Science and Technology Congress, Auckland, pp. 100-104. [9] Kim M., Han S., Cui Y., Lee, H. Cho H., S. Hwang. (2014). CloudDMSS: robust Hadoop-based multimedia streaming service architecture for a cloud computing environment. Cluster Computing, 17(3): 605-628. [10] Bhagyashree E., Tanuja K. (2015). Phishing URL Detection: A Machine Learning and Web Mining- based Approach. International Journal of Computer Applications, 123. [11] Santhosh K. D. K., Kamath M. (2014). Design and implementation of competent web crawler and indexer using web services. International Conference on Advanced Communication Control and Computing Technologies, Ramanathapuram, pp. 1672-1677. [12] Dąbek Osb T. M. (2012). Strengthen the faith as the task of the Pastors of the Church. The Apostles Peter and Paul as examples for the Pastors of the Church for proclaim and, Scriptura Sacra, (16): 19. [13] Dong C. (2015). Asymmetric color image encryption scheme using discrete-time map and hash value. Optik - International Journal for Light and Electron Optics, 126(20): 2571-2575. [14] Qiao Y., Yun X., Zhang Y. (2016). Fast Reused Function Retrieval Method Based on Simhash and Inverted Index. Trustcom/BigDatase/ISPA, Tianjin, PP. 937-944. [15] Bal S. K., Geetha G. (2016). Smart distributed web crawler. International Conference on Information Communication and Embedded Systems, Chennai, pp. 1-5. [16] Kumar N. and Singh M. (2016). Framework for Distributed Semantic Web Crawler. International Conference on Computational Intelligence and Communication Networks, Jabalpur, pp. 1403-1407. Informatica 42 (2018) 143–143 143 JOŽEF STEFAN INSTITUTE Jožef Stefan (1835-1893) was one of the most prominent physicists of the 19th century. Born to Slovene parents, he obtained his Ph.D. at Vienna University, where he was later Director of the Physics Institute, Vice-President of the Vienna Academy of Sciences and a member of several scientific institutions in Europe. Stefan explored many areas in hydrodynamics, optics, acoustics, electricity, mag- netism and the kinetic theory of gases. Among other things, he originated the law that the total radiation from a black body is proportional to the 4th power of its absolute tem- perature, known as the Stefan–Boltzmann law. The Jožef Stefan Institute (JSI) is the leading indepen- dent scientific research institution in Slovenia, covering a broad spectrum of fundamental and applied research in the fields of physics, chemistry and biochemistry, electro- nics and information science, nuclear science technology, energy research and environmental science. The Jožef Stefan Institute (JSI) is a research organisation for pure and applied research in the natural sciences and technology. Both are closely interconnected in research de- partments composed of different task teams. Emphasis in basic research is given to the development and education of young scientists, while applied research and development serve for the transfer of advanced knowledge, contributing to the development of the national economy and society in general. At present the Institute, with a total of about 900 staff, has 700 researchers, about 250 of whom are postgraduates, around 500 of whom have doctorates (Ph.D.), and around 200 of whom have permanent professorships or temporary teaching assignments at the Universities. In view of its activities and status, the JSI plays the role of a national institute, complementing the role of the uni- versities and bridging the gap between basic science and applications. Research at the JSI includes the following major fields: physics; chemistry; electronics, informatics and compu- ter sciences; biochemistry; ecology; reactor technology; applied mathematics. Most of the activities are more or less closely connected to information sciences, in particu- lar computer sciences, artificial intelligence, language and speech technologies, computer-aided design, computer ar- chitectures, biocybernetics and robotics, computer automa- tion and control, professional electronics, digital communi- cations and networks, and applied mathematics. The Institute is located in Ljubljana, the capital of the in- dependent state of Slovenia (or S♥nia). The capital today is considered a crossroad between East, West and Medi- terranean Europe, offering excellent productive capabilities and solid business opportunities, with strong international connections. Ljubljana is connected to important centers such as Prague, Budapest, Vienna, Zagreb, Milan, Rome, Monaco, Nice, Bern and Munich, all within a radius of 600 km. From the Jožef Stefan Institute, the Technology park “Ljubljana” has been proposed as part of the national stra- tegy for technological development to foster synergies be- tween research and industry, to promote joint ventures be- tween university bodies, research institutes and innovative industry, to act as an incubator for high-tech initiatives and to accelerate the development cycle of innovative products. Part of the Institute was reorganized into several high- tech units supported by and connected within the Techno- logy park at the Jožef Stefan Institute, established as the beginning of a regional Technology park ”Ljubljana”. The project was developed at a particularly historical moment, characterized by the process of state reorganisation, privati- sation and private initiative. The national Technology Park is a shareholding company hosting an independent venture- capital institution. The promoters and operational entities of the project are the Republic of Slovenia, Ministry of Higher Education, Science and Technology and the Jožef Stefan Institute. The framework of the operation also includes the University of Ljubljana, the National Institute of Chemistry, the Institute for Electronics and Vacuum Technology and the Institute for Materials and Construction Research among others. In addition, the project is supported by the Ministry of the Economy, the National Chamber of Economy and the City of Ljubljana. Jožef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia Tel.:+386 1 4773 900, Fax.:+386 1 251 93 85 WWW: http://www.ijs.si E-mail: matjaz.gams@ijs.si Public relations: Polona Strnad Informatica 42 (2018) INFORMATICA AN INTERNATIONAL JOURNAL OF COMPUTING AND INFORMATICS INVITATION, COOPERATION Submissions and Refereeing Please register as an author and submit a manuscript at: http://www.informatica.si. At least two referees outside the aut- hor’s country will examine it, and they are invited to make as many remarks as possible from typing errors to global philoso- phical disagreements. The chosen editor will send the author the obtained reviews. If the paper is accepted, the editor will also send an email to the managing editor. The executive board will inform the author that the paper has been accepted, and the author will send the paper to the managing editor. The paper will be pu- blished within one year of receipt of email with the text in Infor- matica MS Word format or Informatica LATEX format and figures in .eps format. Style and examples of papers can be obtained from http://www.informatica.si. Opinions, news, calls for conferences, calls for papers, etc. should be sent directly to the managing edi- tor. SUBSCRIPTION Please, complete the order form and send it to Dr. Drago Tor- kar, Informatica, Institut Jožef Stefan, Jamova 39, 1000 Ljubljana, Slovenia. E-mail: drago.torkar@ijs.si Since 1977, Informatica has been a major Slovenian scientific journal of computing and informatics, including telecommunica- tions, automation and other related areas. In its 16th year (more than twentyfour years ago) it became truly international, although it still remains connected to Central Europe. The basic aim of In- formatica is to impose intellectual values (science, engineering) in a distributed organisation. Informatica is a journal primarily covering intelligent systems in the European computer science, informatics and cognitive com- munity; scientific and educational as well as technical, commer- cial and industrial. Its basic aim is to enhance communications between different European structures on the basis of equal rights and international refereeing. It publishes scientific papers accep- ted by at least two referees outside the author’s country. In addi- tion, it contains information about conferences, opinions, critical examinations of existing publications and news. Finally, major practical achievements and innovations in the computer and infor- mation industry are presented through commercial publications as well as through independent evaluations. Editing and refereeing are distributed. Each editor can conduct the refereeing process by appointing two new referees or referees from the Board of Referees or Editorial Board. Referees should not be from the author’s country. If new referees are appointed, their names will appear in the Refereeing Board. Informatica web edition is free of charge and accessible at http://www.informatica.si. Informatica print edition is free of charge for major scienti- fic, educational and governmental institutions. Others should subscribe. Informatica WWW: http://www.informatica.si/ Referees from 2008 on: A. Abraham, S. Abraham, R. Accornero, A. Adhikari, R. Ahmad, G. Alvarez, N. Anciaux, R. Arora, I. Awan, J. Azimi, C. Badica, Z. Balogh, S. Banerjee, G. Barbier, A. Baruzzo, B. Batagelj, T. Beaubouef, N. Beaulieu, M. ter Beek, P. Bellavista, K. Bilal, S. Bishop, J. Bodlaj, M. Bohanec, D. Bolme, Z. Bonikowski, B. Bokovi, M. Botta, P. Brazdil, J. Brest, J. Brichau, A. Brodnik, D. Brown, I. Bruha, M. Bruynooghe, W. Buntine, D.D. Burdescu, J. Buys, X. Cai, Y. Cai, J.C. Cano, T. Cao, J.-V. Capella-Hernndez, N. Carver, M. Cavazza, R. Ceylan, A. Chebotko, I. Chekalov, J. Chen, L.-M. Cheng, G. Chiola, Y.-C. Chiou, I. Chorbev, S.R. Choudhary, S.S.M. Chow, K.R. Chowdhury, V. Christlein, W. Chu, L. Chung, M. Ciglari, J.-N. Colin, V. Cortellessa, J. Cui, P. Cui, Z. Cui, D. Cutting, A. Cuzzocrea, V. Cvjetkovic, J. Cypryjanski, L. ehovin, D. erepnalkoski, I. osi, G. Daniele, G. Danoy, M. Dash, S. Datt, A. Datta, M.-Y. Day, F. Debili, C.J. Debono, J. Dedi, P. Degano, A. Dekdouk, H. Demirel, B. Demoen, S. Dendamrongvit, T. Deng, A. Derezinska, J. Dezert, G. Dias, I. Dimitrovski, S. Dobriek, Q. Dou, J. Doumen, E. Dovgan, B. Dragovich, D. Drajic, O. Drbohlav, M. Drole, J. Dujmovi, O. Ebers, J. Eder, S. Elaluf-Calderwood, E. Engstrm, U. riza Erturk, A. Farago, C. Fei, L. Feng, Y.X. Feng, B. Filipi, I. Fister, I. Fister Jr., D. Fier, A. Flores, V.A. Fomichov, S. Forli, A. Freitas, J. Fridrich, S. Friedman, C. Fu, X. Fu, T. Fujimoto, G. Fung, S. Gabrielli, D. Galindo, A. Gambarara, M. Gams, M. Ganzha, J. Garbajosa, R. Gennari, G. Georgeson, N. Gligori, S. Goel, G.H. Gonnet, D.S. Goodsell, S. Gordillo, J. Gore, M. Grar, M. Grgurovi, D. Grosse, Z.-H. Guan, D. Gubiani, M. Guid, C. Guo, B. Gupta, M. Gusev, M. Hahsler, Z. Haiping, A. Hameed, C. Hamzaebi, Q.-L. Han, H. Hanping, T. Hrder, J.N. Hatzopoulos, S. Hazelhurst, K. Hempstalk, J.M.G. Hidalgo, J. Hodgson, M. Holbl, M.P. Hong, G. Howells, M. Hu, J. Hyvrinen, D. Ienco, B. Ionescu, R. Irfan, N. Jaisankar, D. Jakobovi, K. Jassem, I. Jawhar, Y. Jia, T. Jin, I. Jureta, . Jurii, S. K, S. Kalajdziski, Y. Kalantidis, B. Kalua, D. Kanellopoulos, R. Kapoor, D. Karapetyan, A. Kassler, D.S. Katz, A. Kaveh, S.U. Khan, M. Khattak, V. Khomenko, E.S. Khorasani, I. Kitanovski, D. Kocev, J. Kocijan, J. Kollr, A. Kontostathis, P. Koroec, A. Koschmider, D. Koir, J. Kova, A. Krajnc, M. Krevs, J. Krogstie, P. Krsek, M. Kubat, M. Kukar, A. Kulis, A.P.S. Kumar, H. Kwanicka, W.K. Lai, C.-S. Laih, K.-Y. Lam, N. Landwehr, J. Lanir, A. Lavrov, M. Layouni, G. Leban, A. Lee, Y.-C. Lee, U. Legat, A. Leonardis, G. Li, G.-Z. Li, J. Li, X. Li, X. Li, Y. Li, Y. Li, S. Lian, L. Liao, C. Lim, J.-C. Lin, H. Liu, J. Liu, P. Liu, X. Liu, X. Liu, F. Logist, S. Loskovska, H. Lu, Z. Lu, X. Luo, M. Lutrek, I.V. Lyustig, S.A. Madani, M. Mahoney, S.U.R. Malik, Y. Marinakis, D. Marini, J. Marques-Silva, A. Martin, D. Marwede, M. Matijaevi, T. Matsui, L. McMillan, A. McPherson, A. McPherson, Z. Meng, M.C. Mihaescu, V. Milea, N. Min-Allah, E. Minisci, V. Mii, A.-H. Mogos, P. Mohapatra, D.D. Monica, A. Montanari, A. Moroni, J. Mosegaard, M. Mokon, L. de M. Mourelle, H. Moustafa, M. Moina, M. Mrak, Y. Mu, J. Mula, D. Nagamalai, M. Di Natale, A. Navarra, P. Navrat, N. Nedjah, R. Nejabati, W. Ng, Z. Ni, E.S. Nielsen, O. Nouali, F. Novak, B. Novikov, P. Nurmi, D. Obrul, B. Oliboni, X. Pan, M. Panur, W. Pang, G. Papa, M. Paprzycki, M. Parali, B.-K. Park, P. Patel, T.B. Pedersen, Z. Peng, R.G. Pensa, J. Per, D. Petcu, B. Petelin, M. Petkovek, D. Pevec, M. Piulin, R. Piltaver, E. Pirogova, V. Podpean, M. Polo, V. Pomponiu, E. Popescu, D. Poshyvanyk, B. Potonik, R.J. Povinelli, S.R.M. Prasanna, K. Pripui, G. Puppis, H. Qian, Y. Qian, L. Qiao, C. Qin, J. Que, J.-J. Quisquater, C. Rafe, S. Rahimi, V. Rajkovi, D. Rakovi, J. Ramaekers, J. Ramon, R. Ravnik, Y. Reddy, W. Reimche, H. Rezankova, D. Rispoli, B. Ristevski, B. Robi, J.A. Rodriguez-Aguilar, P. Rohatgi, W. Rossak, I. Roanc, J. Rupnik, S.B. Sadkhan, K. Saeed, M. Saeki, K.S.M. Sahari, C. Sakharwade, E. Sakkopoulos, P. Sala, M.H. Samadzadeh, J.S. Sandhu, P. Scaglioso, V. Schau, W. Schempp, J. Seberry, A. Senanayake, M. Senobari, T.C. Seong, S. Shamala, c. shi, Z. Shi, L. Shiguo, N. Shilov, Z.-E.H. Slimane, F. Smith, H. Sneed, P. Sokolowski, T. Song, A. Soppera, A. Sorniotti, M. Stajdohar, L. Stanescu, D. Strnad, X. Sun, L. ajn, R. enkek, M.R. ikonja, J. ilc, I. krjanc, T. tajner, B. ter, V. truc, H. Takizawa, C. Talcott, N. Tomasev, D. Torkar, S. Torrente, M. Trampu, C. Tranoris, K. Trojacanec, M. Tschierschke, F. De Turck, J. Twycross, N. Tziritas, W. Vanhoof, P. Vateekul, L.A. Vese, A. Visconti, B. Vlaovi, V. Vojisavljevi, M. Vozalis, P. Vraar, V. Vrani, C.-H. Wang, H. Wang, H. Wang, H. Wang, S. Wang, X.-F. Wang, X. Wang, Y. Wang, A. Wasilewska, S. Wenzel, V. Wickramasinghe, J. Wong, S. Wrobel, K. Wrona, B. Wu, L. Xiang, Y. Xiang, D. Xiao, F. Xie, L. Xie, Z. Xing, H. Yang, X. Yang, N.Y. Yen, C. Yong-Sheng, J.J. You, G. Yu, X. Zabulis, A. Zainal, A. Zamuda, M. Zand, Z. Zhang, Z. Zhao, D. Zheng, J. Zheng, X. Zheng, Z.-H. Zhou, F. Zhuang, A. Zimmermann, M.J. Zuo, B. Zupan, M. Zuqiang, B. alik, J. ika, Informatica An International Journal of Computing and Informatics Web edition of Informatica may be accessed at: http://www.informatica.si. Subscription Information Informatica (ISSN 0350-5596) is published four times a year in Spring, Summer, Autumn, and Winter (4 issues per year) by the Slovene Society Informatika, Litostrojska cesta 54, 1000 Ljubljana, Slovenia. The subscription rate for 2018 (Volume 42) is – 60 EUR for institutions, – 30 EUR for individuals, and – 15 EUR for students Claims for missing issues will be honored free of charge within six months after the publication date of the issue. Typesetting: Borut Žnidar. Printing: ABO grafika d.o.o., Ob železnici 16, 1000 Ljubljana. Orders may be placed by email (drago.torkar@ijs.si), telephone (+386 1 477 3900) or fax (+386 1 251 93 85). The payment should be made to our bank account no.: 02083-0013014662 at NLB d.d., 1520 Ljubljana, Trg republike 2, Slovenija, IBAN no.: SI56020830013014662, SWIFT Code: LJBASI2X. Informatica is published by Slovene Society Informatika (president Niko Schlamberger) in cooperation with the following societies (and contact persons): Slovene Society for Pattern Recognition (Simon Dobrišek) Slovenian Artificial Intelligence Society (Mitja Luštrek) Cognitive Science Society (Olga Markič) Slovenian Society of Mathematicians, Physicists and Astronomers (Marej Brešar) Automatic Control Society of Slovenia (Nenad Muškinja) Slovenian Association of Technical and Natural Sciences / Engineering Academy of Slovenia (Stane Pejovnik) ACM Slovenia (Matjaž Gams) Informatica is financially supported by the Slovenian research agency from the Call for co-financing of scientific periodical publications. Informatica is surveyed by: ACM Digital Library, Citeseer, COBISS, Compendex, Computer & Information Systems Abstracts, Computer Database, Computer Science Index, Current Mathematical Publications, DBLP Computer Science Bibliography, Directory of Open Access Journals, InfoTrac OneFile, Inspec, Linguistic and Language Behaviour Abstracts, Mathematical Reviews, MatSciNet, MatSci on SilverPlatter, Scopus, Zentralblatt Math Volume 42 Number 1 March 2018 ISSN 0350-5596 Introduction to the Special Anniversary Issue on AI in Slovenia M. Luštrek, J. Žabkar, M. Grobelnik 1 Early Machine Learning Research in Ljubljana I. Kononenko 3 AlphaZero - What’s Missing? I. Bratko 7 Explanation of Prediction Models with ExplainPrediction M. Robnik-Šikonja 13 Semantic Annotation of Documents Based on Wikipedia Concepts J. Brank, G. Leban, M. Grobelnik 23 Continuous Blood Pressure Estimation from PPG Signal G. Slapničar, M. Luštrek, M. Marinko 33 Quantitative Score for Assessing the Quality of Feature Rankings I. Slavkov, M. Petković, D. Kocev, S. Džeroski 43 Arguments in Interactive Machine Learning M. Možina 53 An Inter-Domain Study for Arousal Recognition from Physiological Signals M. Gjoreski, M. Luštrek, M. Gams 61 Computational Creativity Conceptualisation Grounded on ICCC Papers S. Pollak, G.A. Wiggins, M. Žnidaršič, N. Lavrač 69 Towards Creative Software Blending: Computational Infrastructure and Use Cases M. Martinc, M. Žnidaršič, N. Lavrač, S. Pollak 77 Graph Theoretical View on Text Understanding J. Zupan 85 End of Special Issue / Start of normal papers A Segmentation-Recognition Approach with a Fuzzy-Artificial Immune System for Unconstrained Handwritten Connected Digits H. Merabti, B. Farou, H. Seridi 95 Load Balancing for Virtual Worlds by Splitting and Merging Spatial Regions U. Farooq, J. Glauert, K. Zia 107 Microscopic Evaluation of Extended Car-following Model in Multi-lane Roads H. Lazar, K. Rhoulami, M.D. Rahmani 117 Prediction of Sentiment from Macaronic Reviews S. Kaur, R. Mohana 127 Application of Distributed Web Crawlers in Information Management System B. Wen 137 Informatica 42 (2018) Number 1, pp. 1–143