Volume 39 Number 1 March 2015	ISSN 0350-5596
Informatica
An International Journal of Computing and Informatics
Editorial Boards
Informatica is a journal primarily covering intelligent systems in the European computer science, informatics and cognitive community; scientific and educational as well as technical, commercial and industrial. Its basic aim is to enhance communications between different European structures on the basis of equal rights and international refereeing. It publishes scientific papers accepted by at least two referees outside the author's country. In addition, it contains information about conferences, opinions, critical examinations of existing publications and news. Finally, major practical achievements and innovations in the computer and information industry are presented through commercial publications as well as through independent evaluations.
Editing and refereeing are distributed. Each editor from the Editorial Board can conduct the refereeing process by appointing two new referees or referees from the Board of Referees or Editorial Board. Referees should not be from the author's country. If new referees are appointed, their names will appear in the list of referees. Each paper bears the name of the editor who appointed the referees. Each editor can propose new members for the Editorial Board or referees. Editors and referees inactive for a longer period can be automatically replaced. Changes in the Editorial Board are confirmed by the Executive Editors.
The coordination necessary is made through the Executive Editors who examine the reviews, sort the accepted articles and maintain appropriate international distribution. The Executive Board is appointed by the Society Informatika. Informatica is partially supported by the Slovenian Ministry of Higher Education, Science and Technology.
Each author is guaranteed to receive the reviews of his article. When accepted, publication in Informatica is guaranteed in less than one year after the Executive Editors receive the corrected version of the article.
Executive Editor - Editor in Chief
Anton P. Železnikar
Volariceva 8, Ljubljana, Slovenia
s51em@lea.hamradio.si
http://lea.hamradio.si/~s51em/
Executive Associate Editor - Managing Editor
Matjaž Gams, Jožef Stefan Institute
Jamova 39, 1000 Ljubljana, Slovenia
Phone: +386 1 4773 900, Fax: +386 1 251 93 85
matjaz.gams@ijs.si
http://dis.ijs.si/mezi/matjaz.html
Executive Associate Editor - Deputy Managing Editor
Mitja Luštrek, Jožef Stefan Institute mitja.lustrek@ijs.si
Executive Associate Editor - Technical Editor
Drago Torkar, Jožef Stefan Institute
Jamova 39, 1000 Ljubljana, Slovenia
Phone: +386 1 4773 900, Fax: +386 1 251 93 85
drago.torkar@ijs.si
Contact Associate Editors
Europe, Africa: Matjaz Gams N. and S. America: Shahram Rahimi Asia, Australia: Ling Feng Overview papers: Maria Ganzha
Editorial Board
Juan Carlos Augusto (Argentina)
Vladimir Batagelj (Slovenia)
Francesco Bergadano (Italy)
Marco Botta (Italy)
Pavel Brazdil (Portugal)
Andrej Brodnik (Slovenia)
Ivan Bruha (Canada)
Wray Buntine (Finland)
Zhihua Cui (China)
Hubert L. Dreyfus (USA)
Jozo Dujmovic (USA)
Johann Eder (Austria)
Ling Feng (China)
Vladimir A. Fomichov (Russia)
Maria Ganzha (Poland)
Sumit Goyal (India)
Marjan Gušev (Macedonia)
N. Jaisankar (India)
Dariusz Jacek Jaköbczak (Poland)
Dimitris Kanellopoulos (Greece)
Samee Ullah Khan (USA)
Hiroaki Kitano (Japan)
Igor Kononenko (Slovenia)
Miroslav Kubat (USA)
Ante Lauc (Croatia)
Jadran Lenarcic (Slovenia)
Shiguo Lian (China)
Suzana Loskovska (Macedonia)
Ramon L. de Mantaras (Spain)
Natividad Martinez Madrid (Germany)
Sando Martincic-Ipišic (Croatia)
Angelo Montanari (Italy)
Pavol Nävrat (Slovakia)
Jerzy R. Nawrocki (Poland)
Nadia Nedjah (Brasil)
Franc Novak (Slovenia)
Marcin Paprzycki (USA/Poland)
Ivana Podnar Žarko (Croatia)
Karl H. Pribram (USA)
Luc De Raedt (Belgium)
Shahram Rahimi (USA)
Dejan Rakovic (Serbia)
Jean Ramaekers (Belgium)
Wilhelm Rossak (Germany)
Ivan Rozman (Slovenia)
Sugata Sanyal (India)
Walter Schempp (Germany)
Johannes Schwinn (Germany)
Zhongzhi Shi (China)
Oliviero Stock (Italy)
Robert Trappl (Austria)
Terry Winograd (USA)
Stefan Wrobel (Germany)
Konrad Wrona (France)
Xindong Wu (USA)
Yudong Zhang (China)
Rushan Ziatdinov (Russia & Turkey)
Discriminating Between Closely Related Languages on Twitter
Nikola LjubešiC
University of Zagreb, Faculty of Humanities and Social Sciences, Ivana Lucica 3 E-mail: nikola.ljubesic@ffzg.hr, http://nlp.ffzg.hr/
Denis Kranjcic
University of Zagreb, Faculty of Humanities and Social Sciences, Ivana Lucica 3 E-mail: dkranjcic@ffzg.hr
Keywords: microblogging, language identification, closely related languages Received: October 25, 2014
In this paper we tackle the problem of discriminating Twitter users by the language they tweet in, taking into account very similar South-Slavic languages - Bosnian, Croatian, Montenegrin and Serbian. We apply the supervised machine learning approach by annotating a subset of 500 users from an existing Twitter collection by the language the users primarily tweet in. We show that by using a simple bag-of-words model, univariate feature selection, 320 strongest features and a standard classifier, we reach user classification accuracy of ^98%. Annotating the whole 63,160 users strong Twitter collection with the best performing classifier and visualizing it on a map via tweet geo-information, we produce a Twitter language map which clearly depicts the robustness of the classifier.
Povzetek: V prispevku raziščemo problem ločevanja uporabnikov družabnega omrežja Twitter glede na to, v katerem jeziku tvitajo, pri čemer obravnavamo zelo podobne južnoslovanske jezike: bosanščino, hrvaščino, srbščino in črnogorščino. Uporabimo pristop nadzorovanega strojnega učenja, kjer označimo vsakega uporabnika iz že obstojece podatkovne množice 500 uporabnikov z jezikom, v katerem najvec tvita. Pokažemo, da z uporabo enostavnega modela vrece besed, univariantno izbiro značilk, 320 najbolj pomembnih značilk in standardnim klasifikatorjem, dosežemo ^97 % tocnost klasifikacije posameznega uporabnika. Če uporabimo najboljši razviti klasifikator za oznacevanje naše celotne zbirke, ki zajema 63.160 uporabnikov, in rezultat prikažemo na zemljevidu z uporabo geografske informacija na tvitih, smo izdelali Twitter zemljevid jezikov, ki jasno pokaže robustnost razvitega pristopa.
1 Introduction	this respect, because it does not transcribe foreign words
and proper nouns, as the others do. Moreover, due to the
The problem of language identification, which was consid-	fairly recent standardization of Montenegrin, its additional
ered a solved task for some time now, has recently gained in	phonemes are extremely rarely represented in writing, es-
popularity among researchers by identifying more complex	pecially in informal usage. The Serbian language is the
subproblems, such as discriminating between language va-	only one where both Ekavian and Ijekavian pronunciation
rieties (very similar languages and dialects), identifying	and writing are standardized and widely used, while all the
languages in multi-language documents, code-switching	other languages use Ijekavian variants as a standard. The
(alternating between two or more languages) and identify-	languages share a great deal of the same vocabulary, and
ing language in non-standard user-generated content which	some words differ only in a single phoneme / graPheme,
often tends to be very short (such as tweets).	because of phonological, morphological and etymological
In this paper we address the first and the last problem, circumstances-There are some grammatical differences re-
namely discriminating between very similar languages in garding phonology, morphology and syntax, but they are
Twitter posts, with the relaxation that we do not identify arguably scarce and they barely influence mutual intelligi-
language on the tweet level, but the user level.	bility. The distinction between the four languages is based
on the grounds of establishing a national identity, rather
The four languages we focus on here, namely Bosnian, ,,	. ,,
than on prominently different linguistic features.
Croatian, Montenegrin and Serbian, belong to the South Slavic group of languages and are all very similar to each
other.	2 Related work
All the languages, except Montenegrin, use the same phonemic inventory, and they are all based on the write- One of the first studies incorporating similar languages in a as-you-speak principle. Croatian is slightly different in language identification setting was that of [9] who, among
others, discriminate between Spanish and Catalan with the accuracy of up to 99% by using second order character-level Markov models. In [11] a semi-supervised model is presented to distinguish between Indonesian and Malay by using frequency and rank of character trigrams derived from the most frequent words in each language, lists of exclusive words, and the format of numbers. [3] use a bag-of-words approach to classify Chinese texts from the mainland and Taiwan with results of up to 92% accuracy. [13] propose a log-likelihood estimation method along with Laplace smoothing to identify two varieties of Portuguese (Brazilian and European) obtaining 99.5% accuracy.
In the first attempt at discriminating between the two most distant out of the four languages of interest, namely Croatian and Serbian, [6] have shown that by using a second-order character Markov chain and a list of forbidden words, the two languages can be differentiated with a very high accuracy of ^ 99%. As a follow-up, [12] add Bosnian to the language list showing that most off-the-shelf tools are in no way capable of solving this problem, while their approach by identifying blacklisted words reaches the accuracy of ^97%. [11] have worked with the same three languages as a subtask of producing web corpora of these languages. They have managed to outperform the best-performing classifier from [12] by training unigram language models on the entire content of the collected web corpora, decreasing the error related to the Croatian-Serbian language pair to a fourth. Recently, as a part of the DSL (Discriminating between Similar Languages ) 2014 shared task of discriminating between six groups of similar languages on the sentence level [14], the language group A consisted of Bosnian, Croatian and Serbian and the best result in the group yielded 93.6% accuracy, which is not directly comparable to the aforementioned results because classification was performed on the sentence level, and not on the document level as in previous research.
Language identification on Twitter data has become a popular problem in recent years. [1] use language identification to create language specific Twitter collections of low-resource languages such as Nepali, Urdu, and Ukrainian. [2] use character n-gram distance with additional microblogging characteristics such as the language profile of a user, the content of an attached hyperlink, the language profile of mentioned users and the language profile of a hashtag. [7] review a wide range of off-the-shelf tools for Twitter language identification, and achieve their best results with a simple voting over three systems.
To the best of our knowledge, there has been only two attempts at discriminating between languages of high level of similarity on Twitter data. The first attempt dealt with Croatian and Serbian [4], where word unigram language models built from Croatian and Serbian web corpora were used in an attempt to divide users from a Twitter collection according to the two languages. An analysis of the annotation results showed that there is a substantial Twitter activity of Bosnian and Montenegrin speakers in the collection and that the the collected data cannot be described
with a two-language classification schema, but rather with a 4-class schema that includes the remaining two languages. The second attempt focused on Spanish varieties spoken in five different countries [8] using geo-information as a gold standard, obtaining best results with a voting meta-classifier approach that combines the results of four single classifiers.
Our work builds on top of the research presented in [4] by defining a four-language classification schema, inside which Montenegrin, a language that gained official status in 2007, is present for the first time. Additionally, this is the first focused attempt at discriminating between those languages on Twitter data.
3 Dataset
The dataset we run our experiments on consists of tweets produced by 500 randomly picked users from the Twitter collection obtained with the TweetCat tool described in [4]. This Twitter collection consists currently of 63,160 users and 42,744,935 tweets. The collection procedure is still running which opens the possibility of the collection becoming a monitor corpus of user-generated content of the four languages.
For annotating the dataset there was only one annotator available. Annotating a portion of the dataset by multiple users and inspecting inter-annotator agreement is considered to be future work.
Having other languages in the dataset (mostly English) was tolerated as long as more than 50% of the text was written in the annotated language. Among the 500 users there were 10 users who did not comply to any of the four classes and were therefore removed from the dataset. One user, tweeting in Bosnian, had most of the tweets in English, there was one user tweeting in Macedonian and 8 users were tweeting in Serbian, but used the Cyrillic script. The users tweeting in Serbian and using the Cyrillic script were discarded from the dataset because we wanted to focus on discriminating between the four languages based on content and not the script used.
The result of the annotation procedure is summarized in the distribution of users according to their language, presented in Table 1. We can observe that Serbian makes up 77% of the dataset. There is a similar amount, around 9%, of Bosnian and Croatian data, while Montenegrin is least represented with around 5% of the data. These results are somewhat surprising because there is a much higher number of speakers of Croatian (around 5 million) than of Bosnian (around 2 million) or Montenegrin (below 1 million). Additionally, Croatia has the highest GDP of all the countries and one would expect that the adaptation rate of such new technology should be higher and not lower than in the remaining countries.
Because we plan to discriminate between the four languages on the user level, we are naturally interested in the amount of textual data we have at disposal for each in-
language	instance #	percentage
Bosnian (bs)	45	9.18%
Croatian (hr)	42	8.57%
Montenegrin (me)	25	5.10%
Serbian (sr)	378	77.14%
Table 1: Distribution of users by the language they tweet in.
stance, i.e. user. Figure 1 represents the amount of data available per user, measured in the number of words. The plotted distribution has the minimum at 561 words and the maximum at 29,246 words, whereas the arithmetic mean lies on 6,607 words. This distribution shows that we have quite a large amount of textual data available for the majority of users. We will inspect the impact of data available for predicting the language in Section 4.5.
i? -
IT
O
-

irOc
-1-1-1-1-1-1
5000 10000 15000 20000 25000 30000
# of words per instance
Figure 1: Distribution of dataset instances given the size in number of words.
4 Experiments
We perform data preprocessing, feature extraction and data formatting using simple Python scripts. All the machine learning experiments are carried out with scikit-learn [10]. Our evaluation metric, if not stated otherwise, is accuracy calculated via stratified 10-fold cross-validation.
We extract our features only from the text of the tweets. Using geolocation and user metadata (such as name, bio and location) is considered future work.
We experiment with the following preprocessing procedures:
-	no preprocessing
-	filtering out mentions, hashtags and URLs (making
the data more representative of the user-generated content in general)
-	dediacritizing the text (thereby lowering data sparsity) and the following sets of features:
-	words
-	character 3-grams
-	character 6-grams
-	words and character 6-grams
Because no significant difference in accuracy was observed when using either different preprocessing procedures or sets of features (except for a slight drop when using character 3-grams), in the remainder of this section we present the results obtained by filtering out mentions, hash-tags and URL-s and using words as features. By skipping dediacritization we keep the preprocessing level to a minimum, while by using words as features we ensure easy understandability of procedures such as feature selection. Finally, by removing textual specificities of Twitter like mentions and hashtags we ensure maximum applicability of the resulting models to other user-generated content besides tweets.
4.1 Initial experiment
The aim of the initial experiment was to get a feeling for the problem at hand by experimenting with various classifiers and features.
We experiment with traditional classifiers, such as the multinomial Naive Bayes (MultinomialNB), K-nearest neighbors (KNeighbors), decision tree (DecisionTree) and linear support-vector machine (LinearSVM). We use the linear SVM because the number of features is much greater than the number of instances. For each classifier we use the default hyperparameter values except for the linear SVM classifier for which we tune the C hyperparameter for highest accuracy.
classifier	accuracy ± stdev
DecisionTree	0.896 ± 0.026
KNeighbors	0.772 ± 0.040
LinearSVM	0.884 ± 0.034
MultinomialNB	0.806 ± 0.029
Table 2: Accuracy with standard deviation obtained with different classifiers using all words as features.
In the results presented in Table 2 we can observe that the LinearSVM and DecisionTree produce the highest accuracy. The significantly lower accuracy of the Multino-mialNB classifier, which normally gives state-of-the-art results on bag-of-words models, but which has no inherent feature selection, provokes us to hypothesize that our results could improve if we applied explicit feature selection on our data. This follows our intuition that similar
0
m
CD
(J CD
O CD
m t-^ CD
DecisionTree
KNeighbors
LinearSVM --- MultinomialNB

10000
20000
# of features
30000
40000
Figure 2: Classification accuracy as a function of number of most informative features used.
languages can be discriminated through a limited number of features, i.e. words, and not through the whole lexicon, which is normally shared to a great extent among such closely related languages.
4.2 Feature selection
Although there are stronger feature selection algorithms, we opt for a simple univariate feature selection algorithm which calculates p-value for each feature regarding the response variable through the F1 ANOVA statistical test. Finally it simply returns the user-specified number (or percentage) of features with lowest p-values. We use this simple feature selection method because we assume independence of our features, i.e. tokens or character n-grams, which is a reasonable assumption for language identification.
classifier	# of feats	acc ± stdev
DecisionTree	100	0.927 ± 0.019
KNeighbors	100	0.911 ± 0.041
LinearSVM	320	0.961 ± 0.025
MultinomialNB	320	0.980 ± 0.016
implicit feature selection / weighting, operate similarly on the whole scale of number of features available, but still show better performance when using only a few hundred strongest features. On the other hand, MultinomialNB and KNeighbors show significantly better performance when they have to deal with the strongest features only. The best results are obtained with the MultinomialNB classifier at 320 features, reaching the accuracy of 97.97%. A numerical comparison of the best results obtained with the four classifiers is given in Table 3.
We present more detailed results obtained with the best-performing MultinomialNB classifier, trained on 320 features, in Table 4. It contains the confusion matrix of the classification process along with precision, recall and F1 obtained on each class. We can observe that the classification process is most successful on Serbian and Croatian, while the worst results are obtained on Montenegrin, which gets confused with both Bosnian and Serbian.
Table 3: Maximum accuracy obtained with each classifier with the number of strongest features used.
During these experiments we calculate accuracy via 10fold cross-validation, performing feature selection each time on 90% of data used for model estimation.
The results of experimenting with up to 20% (cca. 42,000) of strongest word features are shown in Figure 2. Here we can observe a series of properties of the classifiers used. First of all, LinearSVM and DecisionTree, having
	bs	hr	me	sr	P	R	F1
bs	42	0	3	0	0.95	0.93	0.94
hr	1	41	0	0	0.98	0.98	0.98
me	0	0	23	2	0.82	0.92	0.87
sr	1	1	2	374	0.99	0.99	0.99
Table 4: Confusion matrix and precision, recall and F1 per class on the best performing classifier.
4.3 Evaluation on the test set
To perform a final test of our best performing classifier we produced an independent test set consisting of 101 annotated users. The MultinomialNB classifier, trained on all 490 users available from our development set, with 320
strongest features identified on that dataset, produces accuracy of 99.0%, having just one Bosnian user identified as Montenegrin. This experiment emphasizes the robustness of our classifier.
4.4 Analysis of the selected features
Using words as features, and not character 6-grams that perform equally well, enables us to easily interpret our final model. In Table 5 we present a systematization of the 320 features selected on the whole development set by language and the linguistic type of feature.
	bs	hr	me	sr
yat reflex	40.8	42.2	44.4	11.3
phonological	6.0	29.3	9.0	2.3
lexical	6.0	48.6	9.8	2.6
orthography	7.5	7.5	2.0	0.0
toponym, cultural	5.0	19.0	26.0	0.0
sum	65.3	146.8	91.3	16.2
Table 5: Feature type distribution across languages.
The features are divided into five categories across the four languages: yat reflex, phonological differences, lexical differences, orthography and toponym or cultural differences. Each feature contributes one point to the table: if a feature is present in more than one language, this point is divided among languages, and if a feature belongs to more than one feature type, the point is divided among those feature types. Almost half of the features belong to the "reflex of yat" category, which is least informative because most of the Ijekavian features are equally present in Croatian, Bosnian and Montenegrin. The exceptions are the words that are distinct both by the "reflex of yat" category and the lexical category, and few examples of Montenegrin-specific reflex of yat in words such as "nijesam" or "de" (which also belongs to the "phonological differences" category). The "phonological differences" category contactins a lot of words present only in Croatian, such as "itko", "kava" or "vecer" ("iko", "kafa" and "vece" in the other three languages). On the other hand, words that differ in only one phoneme and are not specific for Croatian are often spread among the remaining three languages. The category of lexical differences is similar in this respect: more than 70 percent of these features are Croatian. This can be explained by the fact that lexical purism is much more pronounced in Croatian than in the other three languages, which can be observed in the names of the months and some everyday words, such as "obitelj" (family), "glazba" (music), "izbornik" (menu) etc. In place of these words, Bosnian, Montenegrin and Serbian use words with evident foreign origin: "familija" (family), "muzika" (music), "meni" (menu) etc. The category of "orthography" predominantly contains infinitive verb forms without the final "i" letter, which appear in the future tense in Croatian orthography and which are also allowed in Bosnian. Finally, there is the category containing toponyms and culturally-
specific items, such as country and city acronyms, names for residents, currency, TV-stations and even some public figures.
Although the features in the table are divided according to their real distribution among the languages, their distribution in the model sometimes differs. The reason for this is a significant difference between Croatian users and their language on the one side, and the rest on the other. Whereas Bosnian, Montenegrin and Serbian users are predominantly young people who use Twitter for chatting and sharing their everyday experiences, Croatian users are frequently news portals, shops, musicians, politicians etc. Consequentially, Croatian language on Twitter is marked by a much more formal register compared to the casual register of the other languages in our model.
4.5 Impact of amount of data available for prediction
Having a test set at our disposal opened the possibility of performing one additional experiment on the impact of the amount of data available for our language predictions. In our test set the user with the least amount of textual material contains 864 words. Therefore we evaluated the classifier trained on the whole development set by using only first N words from each user in the test set, N ranging from 10 to 850.
In Figure 3 we present the obtained results, representing each language with an F1 curve and all the languages with a micro-Fl curve. We can observe that the results peak and stabilize as we have 470 words at disposal for our prediction. This is an interesting result, showing that the large amount of data we have available for each user is actually not necessary. On the other hand, the results show quite clearly that discriminating between these languages on the level of each tweet would, at least with the presented classifier, be impossible given that the average tweet size is 10 words. Having significantly more training data available for each language could make a tweet-level classification possible since for Serbian, which covers 77% of the training data, on 10 words we already obtain a decent F1 of 0.88.
5 Corpus annotation and visualization
To be able to distribute separate Twitter collections of the four languages, we annotated each of the 63,160 users from our Twitter collection of Bosnian, Croatian, Montenegrin and Serbian. The annotation was performed with the MultinomialNB classifier trained on both the 490 development and the 101 testing instances, again selecting the 320 strongest features on that dataset.
Once we had our collection annotated, we decided to present the result of our language discriminator on a map.
CD
CD
CD
C^ CD
100
500
200	300	400
# of words per testing instance
Figure 3: F1 per specific language as a function of the length of the testing instances.
600
Figure 4: The Twitter language map. Impact of amount of data.
We presented each of the 576,786 tweets having geo-location available as a point on the map, encoding the predicted language of the author of the tweet with a corresponding color. We call the map presented in Figure 4 a "Twitter language map".
The Figure shows the area of the four countries in which the four languages have official status. We can observe that the tweets follow quite consistently the country borders, which is an additional argument that our classifier works properly. From the plot we can also confirm that Twitter is much more popular and widespread in Serbia than in the remaining countries. Mixing of the four languages occurs, as one would expect, mostly in big cities, primarily Belgrade, the capital of Serbia. There we can observe a significant number of Montenegrin speaking Twitter users. To perform a sanity check regarding the correctness of these data, we manually inspected ten random users classified as being Montenegrin and tweeting in the wider Belgrade area. The inspection showed that all ten users actually tweet in Montenegrin.
Overall, we can observe that Croatia and Serbia have a higher amount of foreign-tweeting users which is easily explained by the well-known migrations from Bosnia to both Croatia and Serbia, and from Montenegro primarily to Serbia.
6 Conclusion
In this paper we have presented a straight-forward approach to discriminating between closely related languages of Twitter users by training a classifier on a dataset of 490 manually labeled users. By using the bag-of-words model, 320 strongest features regarding univariate feature selection and the multinomial Naive Bayes classifier, we obtained a very good accuracy of 97.97% on the development set and 99.0% on the test set. Best results were obtained on Croatian and Serbian while most errors occurred when identifying the Montenegrin language.
Analyzing the impact of data available for classification showed that classification accuracy stabilizes at ^470 words per user which still does not enable us to use this classifier on the tweet level.
Finally we annotated the whole 63k-user-strong collection of tweets and presented the collection on a map we call the "Twitter language map". The map shows that the language used on Twitter quite precisely follows the country borders, large cities being an exception to this rule.
Future work includes adding more information to our model besides words from tweets. Strongest candidates are the content to which users link and user meta-information such as username, location and bio. Using the geoloca-tion information from tweets when available is surely a good source of information as well. Additionally, using the geolocation information as our response variable, i.e. redefining our task as predicting the location of a Twitter user is also a very interesting line of research. This surely
increases the complexity of the task, but opens the door towards identifying dialects and sociolects.
Acknowledgement
The research leading to these results has received funding from the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement PIAP-GA-2012-324414 (Abu-MaTran).
References
[1]	Bergsma, S., McNamee, P., Bagdouri, M., Fink, C., and Wilson, T. (2012). Language identification for creating language-specific twitter collections. In Proceedings of the Second Workshop on Language in Social Media, LSM '12, pages 65-74, Stroudsburg, PA, USA. Association for Computational Linguistics.
[2]	Carter, S., Weerkamp, W., and Tsagkias, M. (2013). Microblog language identification: overcoming the limitations of short, unedited and idiomatic text. Language Resources and Evaluation, 47(1):195-215.
[3]	Huang, C.-R. and Lee, L.-H. (2008). Contrastive approach towards text source classification based on top-bag-of-word similarity. In PACLIC, pages 404-410. De La Salle University (DLSU), Manila, Philippines.
[4]	Ljubešic, N., Fišer, D., and Erjavec, T. (2014). Tweet-CaT: a Tool for Building Twitter Corpora of Smaller Languages. In Chair), N. C. C., Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., and Piperidis, S., editors, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), Reykjavik, Iceland. European Language Resources Association (ELRA).
[11] Ljubešic, N. and Klubicka, F. (2014). {bs,hr,sr}WaC - web corpora of Bosnian, Croatian and Serbian. In Proceedings of the 9th Web as Corpus Workshop (WaC-9), pages 29-35, Gothenburg, Sweden. Association for Computational Linguistics.
[6]	Ljubešic, N., Mikelic, N., and Boras, D. (2007). Language identification: How to distinguish similar languages. In Lužar-Stifter, V. and Hljuz Dobric, V., editors, Proceedings of the 29th International Conference on Information Technology Interfaces, pages 541-546, Zagreb. SRCE University Computing Centre.
[7]	Lui, M. and Baldwin, T. (2014). Accurate language identification of twitter messages. In Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM), pages 17-25. Association for Computational Linguistics.
[8]	Maier, W. and Gömez-Rodn'guez, C. (2014). Language variety identification in spanish tweets. In Proceedings
of the EMNLP'2014 Workshop on Language Technology for Closely Related Languages and Language Variants, pages 25-35, Doha, Qatar. Association for Computational Linguistics.
[9]	Padrö, L. and Padrö, M. (2004). Comparing methods for language identification. Procesamiento del Lenguaje Natural, 33:155-162.
[10]	Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, v., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour-napeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825-2830.
[11]	Ranaivo-Malancon, B. (2006). Automatic Identification of Close Languages - Case study: Malay and Indonesian. ECTI Transactions on Computer and Information Technology, 2(2):126-134.
[12]	Tiedemann, J. and Ljubešic, N. (2012). Efficient discrimination between closely related languages. In Proceedings ofCOLING 2012, pages 2619-2634, Mumbai, India.
[13]	Zampieri, M. and Gebre, B. G. (2012). Automatic identification of language varieties: The case of Portuguese. In Proceedings of KONVENS2012 - The 11th Conference on Natural Language Processing.
[14]	Zampieri, M., Tan, L., Ljubešic, N., and Tiedemann, J. (2014). A Report on the DSL Shared Task 2014. In Proceedings of the VARDIAL workshop.
Call Routing Based on a Combination of the Construction-Integration Model and Latent Semantic Analysis: A Full System
Guillermo Jorge-Botana13, Ricardo Olmos2 3 and Alejandro Barroso3
1Universidad Nacional de Educacion a Distancia (UNED), Calle Juan del Rosal, 10, Madrid, Spain.
2Universidad Autonoma de Madrid, Campus de Cantoblanco, 28049, Madrid, Spain.
3Semantia Lab. c/ Bravo Murillo n° 38, 28015-Madrid, Spain.
E-mail: gdejorge@psi.uned.es
Keywords: call routing, call steering, natural language, latent semantic analysis, construction-integration framework,
psycholinguistically motivated algorithms
Received: June 24, 2014
This study stems from a previous article [1] in which we found that a psycholinguistically motivated mechanism based on the Construction-Integration (C-I) model [2,3] could be used for call classifiers in systems based on Lament Semantic Analysis (LSA). In it we showed that with this model more robust results were obtained when categorizing call transcriptions. However, this method was not tested in a content of calls in audio format, where a voice recognition application would be involved. The most direct implication of a voice recognition application is that the text to be categorized may be impoverished and is subject to noise. This impoverishment normally translates into deletions and insertions which are semantically arbitrary but phonetically similar. The aim of this study is to describe the behavior of a complete system, with calls in audio format that are transcribed by a voice recognition application using a Stochastic Language Model (SLM), and then categorized wit^h an LSA model. This process optionally includes a mechanism based on the C-I model. In this study different parameters were analyzed to assess the automatic router's rate of correct choices. The results show that once again the model based on C-I is significantly better, but the benefits are more remarkable when the ut^terances are long. The paper describes the system and examines both the full results and the interactions in some scenarios. The economy of resources and flexibility of the system are also discussed.
Povzetek: Razvitje sistem za prepoznavanje govora z namenom uspešnega iskanja storitev ali entitet.
1 Introduction
Interactive Voice Response (henceforth, IVR) systems	in terms of busy channels (the call will only be in one of
enjoy extremely widespread use today to provide	the switchboard channels for a few seconds), and also
customer service (Self Service). One of their drawbacks	beneficial in terms of convenience, as callers will be
is that many of them consist of menus that limit user	spared the effort associated with linking the
options to a few items, meaning that many intentions are	representation of what they want to the representation of
not described within these items. This happens, for	the categories expressed by the menu items, something
example, when a person wants something but believes	which is not always intuitive [5, 6]. that it is not represented in the menu. This constitutes a Generically speaking, the process of Call-Routing
findability problem, since the category that represents	described above involves two steps. The first step, voice
what the person wants is not found. One of the	recognition, involves phonetic models of the language in
alternatives that have been used is to employ	which the service is offered, as well as a Stochastic
spontaneous speech recognition techniques and	Language Model (henceforth SLM), which is a formal
subsequently categorize spontaneous utterances1 into	representation of the probabilities of a word occurring if
subject areas. These techniques are usually called "call	others have occurred beforehand. These SLM are
routing" or "call steering" [4]. In them, rather than	habitually 3-gram or 4-gram models, frequently
choosing between several items on a menu, the person	calculated using Maximum Likelihood Estimation and
simply hears an input such as "what would you like to	corrected by means of some Smoothing method (for
do?" and responds spontaneously in natural language.	example Good-Turing). In the end, the ASR module
What the person says is recognized with the help of an	formulates its recognition hypotheses, taking the
Automatic Speech Recognition (ASR) module, and once	phonetic models of the language into account, as well as
converted to text it is categorized and sent to a route	the probabilities provided by the SLM. The relative
where the user's needs are catered for. This is beneficial	importance of each element (phonetic model vs. SLM) -
--in other words, the way of weighting the phonetic model
1 Utterance: This is how we habitually refer to the phrase returned	over the SLM or vice-versa - can often be configured in
by the ASR module. We will also refer to the audio transcription of	voice recognition devices. Finally, from the scores each call as an utterance.
yielded by the phonetic and SLM models, the ASR module generates a number of recognition hypotheses ordered by the confidence level which it assigns to each of them (a list of possible utterances). What the user has said is already in a text format, and now is the time to assign it to a destination - this will be the second step. To this end, classification techniques will be used to determine when a text (in this case the user request) belongs to one category or another (in this case to one destination or another).
Many classification techniques have been used to this end, with various results, depending also on the text sample and parameters [7]. Some examples are probabilistic models such as Maximum Entropy, artificial neural networks, and high-dimensional spaces, the category to which Latent Semantic Analysis (from now on LSA) would belong. All of the techniques mentioned have been extensively tested, but the current challenge is to achieve a more optimal and less biased representation of the words employed in the utterances, doing so even when data are not labeled or are retrieved and reanalyzed from samples which were not classified due to their difficulty [1, 8, 9].
In this study we have carried out a classification task employing LSA, but, as in these last studies, we have also opted to provide a better vectorial representation of the utterances. To this end, before performing the classification, we pre-processed the utterances using a technique based on a cognitive (or psycholinguistic) model that tries to account for the involvement of prior knowledge in the construction of meaning, and which also perfectly suits the philosophy underlying LSA: the Construction-Integration model [2, 3]. In fact, this study stems from a previous article [1] in which we found that a C-I based technique could be used for call classifiers, but we wanted to test it in a case which involved a voice recognition application, in order to show the performance of the entire system. Both LSA and C-I will be described in more detail in later sections.
2 LSA and Call Routing
LSA was originally described as a method for Information Retrieval [10], although some authors went beyond the original conception and adapted it as a psychological model of the acquisition and representation of lexical knowledge. In recent years its capacity to simulate aspects of human semantics has been widely demonstrated. For example, it adequately reflects why children learn the meanings of words without the need for massive exposure [11]. It is, to summarize, a technique derived from the field of application, but which models some parts of the human linguistic behavior. This in turn can provide benefits for the field of application.
LSA is a vectorial representation of language which is constituted by exploiting word occurrences in different contexts using large linguistic corpora (or text samples). It can be conceived as an automatic sequence of mathematical and computational methods, including pruning, weighting and dimension reduction via Singular
Value Decomposition (SVD), to represent the meaning of words and text passages [10, 11]. A key issue is that every term or document (a document is a paragraph or sentence, or in the case of call categorization, a destination cluster or a call) in the corpus is expressed as a k-dimensional vector. Once the whole process is carried out, the cosine of the angle between two vectors is frequently used to evaluate the semantic relationship between the two terms or between the two documents corresponding to the vectors (formula 1). A close semantic relationship between two words, or between two documents, is shown by a high cosine, with a high value that is close to one, whilst two semantically unrelated words or documents have a cosine that is close to zero (orthogonally) or even a slightly negative one. In addition, sentences or texts that are not in the document matrix can be introduced as if they actually were included, using a technique commonly known as Folding-in, which projects this new document into the document matrix (formula 2). In the case of categorization, a vector of the user's utterance will be projected, plus one vector for each of the exemplar texts that represent the destinations, in such a way that if they are similar it will be inferred that what the user wants must be routed to this destination. A new vector d can be created by computing an utterance c (a new vector column in the occurrence matrix including all the terms that occur in it) and then multiplying it by the term matrix, usually called U, and the inverse of the diagonal matrix, usually called S; c is also computed by applying the same weights as in the creation of the original space.
Cos(V1, V2) = ^1^2
V2
d = cTUSr1
(1) Similarity
(2) Folding-in
In the field of Call Routing, LSA has been used mainly with two motivations: correction of speech recognition hypotheses and assignation of utterances to destinations (both are occasionally used in the same router). Concerning the correction of speech recognition output, some studies have tested lists of common potential confusions on the part of speech recognition applications (homophones and near-homophones, {affect, effect}, {quiet, quite}), obtaining good results if such mistakes were corrected by checking the contextual coherence with indices of semantic similarity taken from an LSA model [12]. Satisfactory results were obtained if these confusions were corrected by checking the contextual coherence by means of indices of semantic similarity from an LSA model. These results were better than those of a model that combines a trigram model of the parts of speech with a Bayesian classifier, leading to the conclusion that LSA is a good option in these conditions.
Another study confirmed the usefulness of LSA in conjunction with a model based on n-grams [13]. This study concludes that the analysis by means of n-grams is limited to a text window that is too local (normally 3 or 4 words), and that the LSA model greatly enriches the
results from the router, if the probabilities of the predicted word due some history sequence are also calculated in terms of the LSA model, that is, by means of the similarity between the word and the text of the history, which is projected as a pseudodocument. This probability, based on the Semantic Space, will be highest for words whose meaning aligns most closely with the meaning of the history and lowest for words that do not convey any particular information about it, for example, function terms. It is argued that this fact is complementary to n-grams models, because it is its exact opposite. n-gram models tend to assign higher probabilities to (frequent) function words than to (rarer) content words. So the added value of this study is to integrate both kind of probabilities in the generic LM (Language Model) equations.
Some authors extend this same procedure, improving the representation of the history sequence that acted as a pseudodocument in the original study [14]. Due to the briefness of the history for the utterances (and hence its pseudodocument), they replace the history pseudodocument by a more representative one, extracted from its semantic vicinity. They also use the first nearest semantic neighbors to estimate unseen sequences, as is usually done when using smoothing methods.
Again, other studies also obtained good results by interpolating the LSA model and the n-gram model [15], directly correcting speech recognition hypotheses and reassigning new probabilities to them, taking into account the coherence of the lexical context that accompanies each word [16]. Similarly, a new study performed a series of experiments to correct classification errors [17]. They corrected speech recognition outputs using indices of syntactic and semantic coherence (and a combination of both). As for LSA, the authors reported that it yielded results comparable to those of Word-Net [18, 19] (LSA being more economical and flexible), and that, when certain parameters were used, a combination of the two measures led to an improvement in error correction. This same philosophy has also motivated some studies in which the probabilities of the sequences of SLMs - in other words, the likelihood of a word occurring given a history h (sequence of words that precede it) - are recalculated on the basis of the semantic similarity between the word vector and the vector of its history [20].
As for assigning utterances to destinations, a first study proposed a system in which the user response to the typical "say anything" cue is classified by an LSA module according to candidate destinations [4]. The module will compare the vector representation of what the user has said (utterance vector) with the vectors that represent each of the destinations, made up of a compilation of all the calls that belong to each of those destinations. This module also has a disambiguation mechanism in the event that the utterance vector is similar to several destinations. In this case, terms will be found that represent the difference vectors between the utterance vector and each of the destination vectors. Once found, only the terms that may form bigrams or
trigrams with a term from the original request will be used. These terms are used to formulate questions for the users in order to disambiguate. One peculiarity of this study is that they used 4,497 transcriptions from a banking services call-center in the LSA training phase. The occurrence matrix to begin the LSA process consists of terms and routes (rather than terms and transcriptions), and as a result few columns are produced - 23, to be precise. This is precisely the criticism made in a later study [21]: the authors specified that they took the possible destinations rather than call transcriptions as documents, so that the LSA training was quite limited. They obtained better results in their laboratory if the documents of the matrix were composed of call transcriptions. Another study introduced a variant in the preprocessing stage [22]. When the corpus was trained, the Information Gain (IG) was calculated in order to identify the terms that actually contributed useful information to the router. The IG index is based on the variations in document entropy (the amount of information carried by a document) with and without the term analyzed. Good results were later also reported when using a training variant in which the labels flagging the transcriptions (the destinations or routes) were entered as terms [23]. This enabled the authors to bring together transcriptions that had been routed to the same places. Other authors have also obtained improvements by introducing an additional step between recognition and Call Routing [24]. These authors did not enter the "utterance" as collected by the ASR module (with a generated SLM) directly -- rather they corrected it by using the LSA model confidence indices. Using this method, calls that were more than eight words long (about 12 words) were routed slightly better, and this improvement is greater if the ASR module has laxer criteria (a lower confidence threshold of acceptance).
As for study [24], the motivation for this study is also to introduce an additional step between recognition and LSA based Call Routing in order to differentially reassign the importance of each term of each utterance. This is done by means of the Construction-Integration model [3,2] (henceforth, C-I). C-I is a psycholinguistically motivated model that seeks to simulate working memory and real-time comprehension mechanisms, but it can be applied to categorization tasks as in the case of Call Routing. We will return to this model in a later section.
3 Objectives
There are two main objectives of this study, a general one and a specific one:
1.	To implement (and describe in detail) a real full system in which we used LSA and a mechanism based on the Construction-Integration model (C-I).
2.	To study the efficiency of the system in a real speech recognition situation by analyzing the percentage of correct classified utterances using different parameters. The percentage of correct classified utterances, thus, showed the quality of the system.
The manipulated parameters were (1) the use or not of an additional step based in C-I (over a LSA-based Call Routing), (2) the length of utterances, (3) the decrease in recognizer performance, and (4) the number of hypotheses processed from the n-best list derived from the voice recognition application.
4 Functional description of the system
One of the aims of this paper is to present a detailed study of the results obtained in a real call router in which a mechanism based on Kintsch's Construction-Integration model [2] was introduced. This model had already been partially tested in a previous study [1], but in this study the full process is described, including voice recognition for spontaneous speech and classification according to possible destinations. The system comprises several modules and is domain-independent, as it was trained using texts from different subject areas. For this study, the domain subject area was a customer service at a Telco (credit balance, top-ups, complaints, phones, etc.)
4.1 The voice recognition phase
Prior to any semantic interpretation, we need a voice recognition stage. The process consists of several layers. The first layer is responsible for recognizing words using the phonetic model of a language. There are numerous packages that perform voice recognition using a phonetic model of a particular language, based on sequences of letters and their pronunciation. These packages usually have standard dictionaries that specify the pronunciation of very common words, as well as dictionaries for the pronunciations that the integrator itself considers to be correct. In general, for recognition of phrases to occur, we need to explicitly specify which phrases we expect will be uttered. They are specified using deterministic grammars in different formats (abnf, grxml, gsl, etc.).
But if our aim is the recognition of spontaneous speech, we must generate a statistical language model (SlM), which uses a large linguistic corpus to generate a model where the probabilities of some words appearing are specified, given those that have occurred previously. To calculate these probabilities, Maximum Likelihood Estimation (from now on ML) is commonly used, corrected by a smoothing method that estimates the occurrences of words within some ranges [7]. One of these smoothing methods is Good-Turing. The package that we use to calculate probabilities in our model is SRILM [25]. It has Good-Turing as the default method (see
http://www.speech.sri.com/projects/srilm/manpages/ngra m-discount.7.html). It works as follows: by default, the unigrams that occur more than once and the n-grams that occur more than seven times are considered reliable. For this reason, standard ML is applied to calculate probabilities. But if the n-grams occur less than seven times, a correction is applied to the probability extracted from ML using the Good-Turing smoothing technique. It is also possible to estimate n-grams that do not occur in
the reference corpus with the Katz method, using the BOW (back-off weight) of the history of each n-gram and the smoothed probabilities, but for simplicity's sake, this is not implemented in our system. Therefore, our model only contains conditional probabilities of n-grams that appear in the training corpus that have been smoothed (using the Good-Turing method).
As was previously mentioned , all these calculations are carried out by the SRILM package, in which individual scripts are used to program SLM generation -in our case with classes that group words (days of the week, months, countries, cities, etc.). In the end, the complete model is output into a file with the .arpa extension (Advanced Research Projects Agency format), whose main use is for exchanging language models. Using this .arpa file, we generate a .grxml file where the corrected probabilities from the input file are recorded several times in the form of a tree, but this time in SGRS (Speech Grammar Recognition Specification) format, which can be read by many commercially available recognition packages. In our case, we will use the grxml of the generated SLM in a Nuance 9 recognition engine.
4.2 The call-routing phase
Once the recognition hypotheses were established - in other words, what was recognized by the recognition engine- we proceeded with the second process: categorization by destination. In the case of this study, as noted earlier, the categorization procedure is implemented using LSA. So we needed a LSA semantic space to project the user's utterance as well as each of the exemplar texts that represent the destinations in it.
4.2.1 Semantic Space
To obtain that semantic space, we used a training corpus of digitized utterances belonging to several phone companies. The LSA was trained and the semantic space was created using this corpus. These utterances were extracted using the Wizard of Oz procedure (a technique where the users are made to believe that they are interacting with a computer but in fact they interact with a person) and the transcriber labeled each of them with the destination to which it was routed. We should note that the labeling was performed using different criteria, was carried out at different times and at different companies, and is not exhaustive. In any case, to provide cohesion for interrelated words, these labels were regarded as additional words, as in Featured LSA [23]. In the end, the LSA comprises the terms and labels that occur in utterances. In a previous study it was demonstrated that retaining these labels boosts categorization performance and produces a positive interaction with usage of C-I, even if they are non-exhaustive [1]. It should also be borne in mind that the only utterances that must be exhaustively labeled are those that are part of the destination sample (see section 4.2).
In the LSA training we used Gallito® (see www.elsemantico.es), a tool that has been used in other occasions for the creation of semantic spaces [26, 27,
28]. The words matching a special stop-list for this domain were eliminated, as were all function words. We also eliminated words that did not occur at least three times. Some words which are relevant within the telephony corpus were artificially grouped into a single class, for example countries, mobile phone brands or telecommunications companies, substituting them with the name of the class. In the end, we obtained a matrix of 1,421 terms and 34,905 utterances. In a next step, log-Entropy is calculated in this matrix. log-Entropy estimates the amount of information that a word carries in the documents where it appears. In this way, the terms that might contain the most information are given a heavier weight. So from this last calculation we got a weighed matrix to which SVD is applied and the three resulting matrices are reduced to 270 dimensions. We chose such a dimensionalization based on the assumptions made in previous studies [28]. In those studies it was suggested that the optimal number of dimensions for specific domain corpora does not have to be extremely low, sometimes even approaching the 300 dimensions recommended by Landauer, et al. [11]. In summary, the result of the entire process, the three reduced matrices (terms, diagonal and utterances matrix), is the semantic space of the mobile telephony domain that will be used as a basis for projecting the user's utterance as well as texts that represent the destinations.
4.2.2 Destination Management
The service we wish to evaluate has 29 basic destinations, covering the needs of a telephone company call center. Previous LSA-based Call Routing research compared the vector representation of what the user says (what the ASR module returns) in real time with each of the utterances used in training that are labeled with a destination. After the comparison is performed, the label of the most similar exemplar is selected [4]. This label will be the destination selected. The philosophy of our system is similar, although three important points should be highlighted:
The first one is that in our system we do not use all utterances from the training corpus as destination exemplars, but rather focus on a representative sample. In particular, a total of 2,329 calls are part of this sample (which are loaded and transformed into a vector of the semantic space only once the router has been launched). This is important because it is not necessary for all the calls from the training corpus to be labeled and participate as exemplars. We manually selected only a few representative calls (an excerpt is shown in Table 1), making the process more economical and the response to changes in the routing model (a change in the definition of the call-center destinations) faster. This is a very important issue if someone wants a successive deployment of a Call Routing service [29]. Massive annotation used to be very time-demanding. We made this pre-selection very carefully given that, as was pointed out in some studies [8], system performance depends on the quality of these destination exemplars.
	ALTASBAJASC	well I'd like to cancel this phone number
	ALTASBAJASC	switch to a monthly contract
	ALTASBAJASC	to cancel a monthly contract
	ALTASBAJASC	activate a number
	ALTASBAJASC	cancel the mobile
	ALTASBAJASC	I'd like to get a phone with you
Table 1: Excerpt from the destination exemplars.
The second point is that to decide what the most credible destination or destinations are we do not simply individually compare the user utterance with each of the utterances in the destination sample. In this system we use a method called average-4 which proved to be better in a previous study [1]. This method averages the four exemplars with the greatest cosine for each destination. The chosen destination is the one where the average of the four exemplars is the highest. In this way, any bias which an anomalous exemplar (seemingly very similar to what the user said) might have is eliminated.
The third point is that after converting the user utterance into a vector and selecting the most likely destinations, a list of the first four destinations is returned in descending order of their cosines. When the user is not sent toward the first option, this allows the first destination and the next most likely destinations to be used as candidates, disambiguating with an explicit question. This will depend on the design of the dialogs.
What is common to earlier studies is that both the destination exemplars and the user utterance will be converted into vectors that can be interpreted by the LSA space (in other words, use of the semantic vector space will be essential). The way the destinations and user utterances are converted into vectors will depend on the effectiveness of the system. In the following section we explain different ways of doing so.
4.2.3 Construction-Integration Model vs. direct method
There are two ways of representing each of the utterances vectorially in the router, whether utterances are destination exemplars or user utterances. One is Direct routing, where the utterances are projected onto the latent semantic space without any kind of additional algorithm. This is the standard LSA method for constructing new documents in the vector space, and is known as Folding-In [10]. The second is C-I routing, which, in contrast to Direct Routing, has an intermediate step in the construction of both destination exemplars and user utterances. This intermediate step is based on a Construction-Integration network (Figure 1). The importance that a Construction-Integration network might have for routers lies in the fact that the words in an utterance are modulated to their correct meaning, taking into account the entire lexical context, which constrains the final meaning. This is particularly relevant for words that are ambiguous (such as "card") - the meaning that best matches the context will be given priority, thus avoiding predominant sense inundation and other biases frequently observed in vector space models [26, 27]
Figure 1: Graphical view of the Call Routing process for Direct routing and C-I routing.
Computational models based on Construction-Integration [2, 3] are based on the idea that the meaning of a text is constructed and integrated in the working memory. The mechanism retrieves the content hnked to eveiy word of a text from long-term memoiy, and, once retrieved, ensures that all this content is integrated in real time using a mutual constraint mechanism in the working memory. Therefore, it is a model that seeks to simulate working memory and real-time comprehension mechanisms [30]. Looking at the model details, interpretation of a text is carried out in two phases. In the first phase, all terms related to each of the words that make up the text or phrase are evoked. A network is constructed from all of them, where each one is joined to all the others - this phase is known as the Construction phase. In the second phase, known as the integration phase, each of these terms receives activation from the others proportionally with respect to the similarity they have with each other. After this phase, the most activated terms are those related to the main idea in the text [3,30]. This takes into account the fact that a text is not the sum of the terms included in it, but rather the integration of all of them into a final idea. Each term is constrained by the meaning of the other terms, thus generating the meaning in real time. This type of mechanism has been used on predicate and metaphorical structures [31, 32, 33] and on structures enriched with syntactic dependency relationships, thus demonstrating that the meaning of complex phrases can be modeled [34]. In operational terms, this mechanism makes it possible to differentially reassign the importance of each term. Ultimately, it is an estimate of the relevance of the utterance terms, as was done in previous studies [8].
How is this model implemented in our system? For each utterance, whether it is a user utterance or destination exemplars, a network is constructed based on the Construction-Integration network whose launch will lead to the extraction of new terms (see Figure 2). We might say that these terms produce a better definition of the utterance, an idea common to all the words contained in the original utterance. The procedure, in terms of its functionality, is analogous to that used by a study [35]
implemented to improve classical methods for evaluating text with LSA. Figure 2 provides a graphic description of the procedure. Firstly (in the Construction phase), each term of the utterance is compared to all of the terms present in the semantic space, and the 200 neighbors most similar to each of them are extracted (this similarity is calculated using the cosine). A connectionist network is created between all these neighbors (neighbor node layer) and each of the original terms of the utterance (utterance node layer), where the weight of each connection is given by the cosines^ between each of the connected terms (figure 2). Once the weights of the connections have been assigned, the activation of each node is calculated based on the connections received, in the second Integration phase (see formula 3). Thus, the greater the weight of the connections received, the greater the activation. The activation function also favors instances where the source of activation derives from several terms in the utterance, and not just one or two (due to the parameter 5 in formula 3).
n
4 =	Activation function
Where J is the sub-index of the utterance layer, i is the sub-index of the neighbors layer, Cij is the strength of the connection that node i received from node J (the latter node belonging to the first layer), and is a correction factor to avoid unilateral activation (based on the standard deviation of the connections received), as defined in formula 4:
5,=
1
(4) Correction factor
-+1
The cosines are calculated using the previously trained semantic space, in other words each of the terms to be compared is represented by a vector in this space. Any term vector might then be compared to another term vector using the cosine.
Where n is the total number of connections received by i (or the number of nodes in the neighbors layer) and
Ci is the mean of all the strengths that node i received.
Finally, the 20 most highly activated neighbors are chosen from all of those activated in the Integration phase, and a new utterance is constructed with them. In other studies an utterance has been replaced by a more representative one [14], but now the aim is also for this new utterance to contain terms which are closer to the meaning originally intended by the user. These 20 terms are used to form a new pseudodocument, this time using Folding-in (see Figure 1), and they will be compared with the destination exemplars (created in the same way) in order to assign them a destination.
As the reader can realize, there are a few differences between the procedure to perform C-I used in this study and the original mechanism proposed by Kintsch [3]. Because call utterances are shorter and simpler than propositions within colloquial language, the algorithm used here is not exactly the original construction-integration algorithm. The integration part proposed by Kintsch is a spreading activation algorithm which is iterative until the net is stable (the cycle when the change in the mean activation is lower than a parameterized value), whereas our algorithm is a "one-shot" mechanism. The activation of each node is calculated based on the connections received. Another difference with the C-I algorithm as proposed by Kintsch is that we only consider words and not propositions nor situations. In any case, note that the original C-I is more complete and fine-grained, but our mechanism is sufficient for our purposes and may be more flexibly programmed, because an OOP (Object Oriented Programming) paradigm has been used, with classes such as net, layer, node, connection, etc., instead of the iterative vector x matrix multiplication in the original (see [39] for details of the original conception).
Figure 2: C-I based net implemented in this study. The K most strongly activated neighbors (with a square) will be represented in a pseudodocument.
5 Software and Architecture used
This section describes the application that we have implemented in our lab, and the auxiliary Software and technologies that has been used. In the last part, we describe the place of each module in the global architecture.
5.1	IVR Application
The IVR Application is implemented using VXML technology (Voice Extensible Markup Language), located on a Tomcat application server (icon D in Figure 3). The server dynamically generates VXML files and sends them to the VXML interpreter, in this case VoxPilot (icon B in Figure 3), as it requests them and according to the application flow. First, the interpreter requests the start and welcome VXML files, then the VXML where the user is asked to request a service ("say anything"). This second VXML has voice recognition and therefore requires a grammar - in this case a grxml grammar generated using the SLM. In this way, Voxpilot sends the user response to the voice recognition application (icon C in Figure 3) along with the route where the grammar is located on the application server. Finally, the ASR module returns the text recognized with its confidence interval. This recognized text is sent in the actual request to the next VXML located on the application server, using the SOAP protocol3, to the semantic router (icon E in Figure 3). The semantic router returns a list with the four most likely destinations. The first destination from the list will be inserted into the VXML that is sent once again to Voxpilot for it to run the definitive Call Routing routine.
5.2	ASR Module
The recognition engine used is Nuance 9® (icon C in Figure 3), located on the Speech Server®, also by Nuance, with a test license that allows the use of 4 channels in non-production environments. This recognition engine accepts grammars programmed in the standard .grxml (SGRS Speech Grammar Recognition Specification) so it is a good match for our SLM.
5.3	Semantic Router
The router is basically dedicated to the task of receiving texts from the application server (icon D in Figure 3) and returning a list with the 4 most likely routes, or simply returning a rejection if it does not reach a confidence threshold. As we explained above, the router application is based in the LSA and C-I framework and was programmed in VB.NET using the object-oriented programming (OOP) paradigm. It is accessible as a web service on a Microsoft IIS server, which offers a series of functions and procedures such as loading the destination exemplars and converting them into vectors, and loading the reference semantic space (the semantic space
SOAP (Simple Object Access Protocols) is a standard protocol that defines how two objects in different processes can communicate by exchanging XML data.
previously generated with the Gallito® tool mentioned above). It also has a configuration file which specifies the acceptance thresholds of the logistic function, the method of converting utterances into pseudodocuments (C-I Routing and its parameters or Direct Routing) and the classes that will be used according to the training (months, days, mobile phone brands, countries, etc.). The fact that it is implemented as a Web Service means it can be integrated into a SaaS (Software as a Service) structure that allows it to route service utterances in a centralized manner. When the destination has been assigned, this module returns a list with the four most likely destinations and their cosines.
recognized words than using only the first option, the independent variable contains two quantities: the first recognized utterance and the concatenation of the first and second utterance. We named the two levels Captur1 and Captur2.
A third independent variable is Accuracy, which is measured by a value F' (see later in the method section). This value F' is broken down into two groups: high and low accuracy (see later for a more detailed description).
Lastly, another independent variable is the number of tokens that is the length in words of transcriptions of the utterances. Two groups are again formed: Short and long utterances.
Figure 3: Software, technologies, modules and system architecture used in our lab.
6 Evaluation of the system
In this section we describe the procedure that we followed to examine the efficiency of the system in different configurations. First we will present the variables that were manipulated. Second, we will explain the procedure and the method to test them.
6.1 Dependent variable and independent variables
The dependent variable used to assess the efficiency of the system was whether the destination was correctly classified or not.
The first independent variable is called Routing Method and has to do with how LSA vectorially represents the test utterances and the utterances that represent the destinations. There are two methods (see section 4.2.3): Direct routing and C-I routing.
The second independent variable is the Number of Captures, the number of utterances captured by the voice recognition application that were passed as input to the categorization module. The voice recognition application can be programmed to return only one text containing what it has recognized of the participant's voice message. However, as this captured text derives from a probability-based model, it can also be programmed to return the second, or the third most probable text, and so on. Bearing in mind the idea that by combining the first two options we can create a text including more correctly
6.2 Procedure and Method
The evaluation method is as follows: 1,872 (randomized) audio files are obtained from real calls to several telecommunications companies. All of these audio files were then transcribed, to create the ideal condition where the ASR module captures the utterance perfectly. Recognition of each of these audio files was forced using the Nuance® ACC_TEST tool. The outputs from this tool provided the first two captures of what had been recognized (the two first outputs from the n-best list).
To obtain objective measures of the third independent variable, Accuracy, measured as the deviation between what is recognized and what is transcribed, Information Retrieval measures were used. Their usage is justified by some studies in substitution of the WER (Word Error Rate) [36]. The measures used for IR were Recall, Precision and the combination of both in F'. Recall is the proportion of the transcription that is present in recognition; Precision is the proportion of the recognition that is present in transcription. These two measures range from 0 to 1, and are combined in an index, F'. The latter is the harmonic mean of the precision and the recall measures. To calculate them, we used the most popular natural language package in the Python environment, the Natural Language Toolkit (NlTK) [37] available at http://www.nltk.org/. Using this tool we extracted recall, precision and F' of each recognition compared to its transcription as well as revealing the recognition loss, and these were introduced as variables in the overall analysis of the router (becoming the Accuracy variable). It is important to note that both precision and recall have been calculated using only relevant terms, meaning those that will be considered by the router. Function words and words from the stop-list, for example, will be excluded from the calculations. At the same time, we also counted the number of tokens in each transcription to obtain a measure of the fourth independent variable, which gave us an idea of the length of the phrases uttered by users to make system requests.
In the second phase the router is forced to categorize the text of each audio clip, beginning with the first recognition capture (Captur1), followed by the first two captures concatenated together (Captur2). This operation is repeated twice, once with the router working in Direct Routing mode, and once with C-I Routing. As a result we
obtain all the combinations of independent variables (Routing Method x Number of Captures to route). In addition the router is forced to categorize the transcriptions, simply to have a baseline, but without introducing these results into the analysis matrix. The Call Routing is considered correct if the first destination returned by the router coincides with the ideal destination previously assigned to the utterance by a human4.
With this data we now have all the necessary conditions and all the grouping variables. A matrix is formed and we proceeded to carry out a Repeated Measures ANOVA (Routing Method x Number of Captures to route) with two grouping variables (Accuracy - high or low, and N° of tokens - long or short). In summary, a 2x2x2x2 ANOVA. The results will be extracted from this analysis, and their implications will be examined in the discussion.
At the end of the analysis, a logistic function is proposed to create acceptance zones in the router. There is a first zone where the first hypothesis returned and routed to this destination is accepted as valid; then a second disambiguation zone, where we might disambiguate between four hypotheses returned by the router, asking the user to choose;, and a third rejection zone , where any hypothesis is assumed to be erroneous and is not routed. The results of this mechanism will be shown as the percentage of correct call routings.
7 Results
In this section we will present the results of the study. First we will present the percentages of correctly routed calls under all possible conditions, including those which involve voice recognition. In addition, we also present the results of an ANOVA which displays the interactions between the variables involved: Routing Method, Accuracy, number of tokens and Number of Captures. Finally, the results of entering an acceptance criterion by means of a logistical function are presented, as well as the application of various confidence levels for a potential disambiguation strategy.
7.1 Performance of the ASR module (F')
As suggested above, one of the proposed means of measuring the ASR module's performance is applying Information Recall indices. To be precise, our system provides the following indices: Recall=.912, Precision=.959, F'=.932.
Very often, many categories are interrelated or overlap, such that not even a human classifier could be sure which category to assign. As a last resort, whilst both may even be correct, the choice between one or another is binary and exclusive, so the results may be understated. Some evaluations of Call Routing have used not only binary coincidence to correct this but also ratings of the fitness of the destination returned, or also a scale of possible success [38]. We have not used this type of evaluation due to the high cost involved, which lessens information in the results and implies that they may be understated.
7.2 Decrease in performance caused by voice recognition
Another of the relevant issues to be examined is the lowered performance of the whole system when speech recognition output utterances are routed rather than transcriptions. If we consider only the router's first hypothesis (Table 1), the results show that between the best condition with transcriptions (.75 with C-I) and the best condition with recognition (.67 also with C-I), we lose 8 percentage points in the rate of correct choices. If we consider the first four hypotheses returned (Table 2), between the best condition with transcriptions, also using C-I (.91), and the best with recognition, again using C-I (.85), the difference is reduced to 6 percentage points. We should remember, though, that our application has not undergone any optimization process for either voice recognition or the model of categories and that the main aim of this study is to check the scenarios in which C-I might behave more productively. We will look at this in the following section.
1st hypothesis		
	Trans Captur1	Captur2
No C-I	.69 .63	.62
C-I	.75 .67	.66
Table 1: Percentages of correctly routed calls, considering only the first hypothesis returned by the router. Captur1 and Captur2 represent the conditions where one or two speech recognition hypotheses respectively are used.
_Accumulated first 4 hypotheses_
Trans
Captur1
Captur2
No C-I C-I
.89 .91
.83 .85
.83 .84
Table 2: Percentages of correctly routed calls, considering the four hypotheses returned by the router. Captur1 and Captur2 represent the conditions in which one or two speech recognition hypotheses respectively are used.
7.3 Results of the ANOVA
Another of the aims of this study was to analyze the performance of each method in different scenarios -specifically bearing in mind the Accuracy of recognition measured with F', and the number of tokens in user utterances. For this purpose the ANOVA described in the method section was carried out, extracting both main effects and interactions.
Figure 4: Interaction between Routing Method and Number of Tokens.
Three significant main effects were in fact found. The first of them relates to Routing Method (F(1,1735)=18.13, MSE=.087, p < .001). C-I is better than Direct Routing as a general effect. The second is the N° of tokens (F(1,1735)=34.07, MSE=.637, p < .001). Short phrases boost the router's effectiveness. The third is F' (F(1,1735)=128.14, MSE=.637, p < .001). The Accuracy of voice recognition also increases the router's performance.
Figure 5: Interaction between Routing Method and Accuracy.
Given that we also found significant interaction effects with the variables above, we will focus on these, leaving the main effects as supplementary information. Firstly, we find a two-way interaction effect between the factors Routing Method and N° of tokens (F(1.,1735)=4.45, MSE=.087, p = .035). The loss of effectiveness observed when the user utters long phrases is greater with Direct Routing than with C-I (Figure 4). Both methods show reduced effectiveness with long utterances, but C-I less so, hence we could say it has a corrective effect. We also found a significant effect for the two-way interaction between Routing Method and Accuracy (F(1,1735)=15.41, MSE=.087, p < .001). C-I shows beneficial effects if the quality of the recognition (measured by F') is good (Figure 5); otherwise, C-I works the same as Direct Routing. Lastly, we found a significant effect for the two-way interaction between Number of Captures and Accuracy (F(1,1735)=8.60, MSE=.087, p = .003). The results of this last interaction show how concatenating the first two speech recognition Captures is advantageous compared to using only one, if the quality of the recognition is low (Figure 6).
Figure 6: Interaction between Number of Captures and Accuracy.
7.4 Acceptance levels
Beyond straight performance data - in other words, the percentage of occasions a hypothesis returned is correct -we must introduce acceptance levels into the process. These maximize the correct choices and minimize errors by using a confidence threshold above which the route proposed by the system is accepted (formula 5). Previous studies have often introduced objective acceptance criteria into the process, such as logistic functions of acceptance [4]. In our case we will use the same parameters used by a similar study [1] in the logistic function, as they gave good results. In this function two parameters act as predictor variables, and the correct choice / error rate as our dependent variable. The parameters are as follows: firstly we use the cosine from the first hypothesis (average of the 4 largest cosines with the exemplars of this destination). As the second parameter, we use the difference between this first cosine and the cosine from the second hypothesis. The latter parameter takes into account the level of certainty about whether the first hypothesis should be returned rather than the second.
P(Y = 1) =
1
1 + e
2.17-3.02 d cos-2.97cos
(Formula 5)
With the output from the logistic function, we define three zones of acceptance. The first one (0.5 - 1), above which we accept the label directly; the second one (0.4 -0.5), where we then disambiguate (using a question) between the four destinations proposed by the router (those that have the greatest cosine); and the third one (0 - 0.4), where it is directly rejected. In this way, not only does the percentage of correct choices fall within the acceptance zone, but it can also retrieve calls that are in the intermediate zone. Disambiguation mechanisms, such as preparing questions using the four hypotheses returned, could be built in this zone. For example, if the logistic function returned 0.46, the hypothesis returned would not be rejected, but rather disambiguated using the four hypotheses returned by the router, in the hope that the correct route would be among them, which is very likely. The results (Table 3) show that this
disambiguation mechanism would be productive. In the acceptance zone (0.5 - 1), the percentage of correct destinations would be 76.54%. If we also disambiguated in the next zone of acceptance (0.4 - 0.5), we would guarantee that 81% of the time the correct destination would be among the four hypotheses used to ask the question. In addition, our data would contain 215 calls that would be rejected directly as they are in the rejection zone (0 - 0.4). This amounts to only 12% of the calls.
Error
Correct choice
	N	%	N	%
0.4 to 0.5				
(disambiguation)	39	18.14	176	81.86
0.5 to 1				
(1st hypothesis)	308	23.46	1005	76.54
Table 3: Correct choice rates in the two zones of acceptance.
8 Discussion
The overall results of this system are encouraging. Considering that it is a pilot study performed without optimizing voice recognition, and using a hypothetical customer service operation, the results are very satisfactory. To offer some raw data, 67% of utterances were assigned a destination where router and human agreed (with spontaneous speech recognition and categorization). As explained above, this finding is cautious if we consider the possibility that the destinations overlap or that there is ambiguity in the human assignment of an utterance to a destination. In addition, by introducing an acceptance criterion and using disambiguation mechanisms, a large proportion of the remaining utterances (81% of the total utterances) can be assigned to a correct destination. In any case, since this is an academic study, our aim was to focus on checking the performance of some methods in certain scenarios and not so much on the overall results, which could be improved upon.
Our system brings several features together. One of them is that the router is based on LSA, a technique that is independent from the subject domain and is also fairly economical in terms of implementation. We have inserted an additional layer in this router, based on the assumption that the meaning of an utterance is not the sum of its words. Rather it is important to take into account how these words activate others that are not explicit, but participate in the final meaning. Whilst it is not identical, this layer is based on the Construction-Integration model, which makes the same assumption. We have also decided to summarize the final routing destinations in a file of sample utterances that represent them (golden destinations), thus avoiding any change in the organization of destinations, leading to a need to reclassify all utterances in the training corpus. This provides added flexibility to the system and reduces the response times to changes. All this has been integrated with a voice recognition engine (supported by a Stochastic Language Model or SLM), whose outputs are
passed to the router in order to assign destinations. In order to maximize the number of correctly recognized words, not only the first hypothesis output from the ASR module is routed: there is an option where the concatenated first and second hypotheses can also be routed. This was tested on audio clips of real calls and an experimental design that allowed us to evaluate performance in some scenarios depending on length of utterances and accuracy of voice recognition.
The analysis performed yielded various findings. Firstly that C-I is better than the direct method, in particular when it comes to cushioning the drop in performance in certain scenarios. It is true that in conditions where recognition has greatly declined, the contribution of C-I is not important, but when recognition is not bad, the C-I method seems to behave best with long utterances. Whilst this is not the case in a model like C-I, some previous studies have found benefits in long utterances if the speech recognition outputs are corrected by means of similarities extracted from LSA [24]. It should come as no surprise that in our system the C-I method performed best for long utterances, given that the original C-I model was created to account for longer propositions [2] and that the presence of a number of terms in the phrase facilitates the building of a context. This helps to over-weight the words that are within this context and to under-weight those that are not - for example, substitution or insertion errors. It also biases the meaning of ambiguous terms toward a meaning coherent with this context. The great contribution of this type of models is that they objectively mimic the process carried out in working memory while processing texts. As a text is being read or listened to, it is available in working memory, which retrieves content related with each word in the text from long-term memory. This will be the construction phase. In the integration phase, a mutual constraint mechanism is applied to this linked content [34] in order to extract the key idea. Therefore it is only to be expected that the higher the number of words in working memory (up to a threshold for simultaneous processing), the more data will be available to carry out this integration in a more correct way.
Secondly, although this is approximate complementary data, we have also tried using two voice recognition captures in the Call Routing process. The results are modest, although they show a subtle trend. When recognition is expected to decline, there is a tendency that taking two hypotheses rather than one improves the results. We sense that occasionally the errors committed in the first hypothesis are not committed in the second, and vice-versa.
Thirdly, we have seen that introducing a logistic function with some parameters helps to form acceptance criteria, above which correct choices are maximized, either by correctly rejecting the label or by accepting a label that later proves to be correct. By doing so, the results rise to 76.54% accuracy (either correct choices or correct rejections). We have also shown that improved performance results from setting up three zones of acceptance: the first one (0.5 - 1), above which we
accept the label directly; the second one (0.4 - 0.5), where we then disambiguate (in the form of a question) between the four destinations proposed by the router; and the third one (0 - 0.4), where the label is rejected directly. In this way this, we achieve 79.3% correct Call Routing, and the correct destination of the utterances that remain in the intermediate zone for disambiguation would be among the four labels proposed in the question (the four returned by the router) in 81.56% of cases. Thus we have not only the percentage of correct choices in the acceptance zone, but also those that arise from disambiguation. It is clear that this requires a cost in terms of disambiguation question design, and a cost in terms of satisfaction, but it might be a good way to gradually implement the system.
There is one last thing to note: this system's adaptability to change. On the one hand, although in this study we have used labels to identify the destinations for utterances in the training corpus, acceptable results can be obtained without them. We could even extend the training corpus, combining labeled with unlabeled parts, or parts labeled using different criteria. If we did these things, we would find a small, controlled reduction in performance, but not an abrupt drop [1]. On the other hand, the only labels that need to be exhaustive are those that identify the utterances that act as destination exemplars. These exemplars form part of a chosen sample of training utterances, but represent only a small percentage of them. Since there are few of them, they can be examined, changed or expanded quickly. Thus any change in the routing model (for example, a re-dimensioning of skills) can be dealt with, with acceptable response times, and with no need to retrain all utterances, thereby slowing down the process.
9 Conclusion
In view of these results, the possibility of using psycholinguistic models in information recovery or utterance categorization systems is encouraging. Simulating human processing is not an easy task. It is not even easy to describe, but it is useful to reflect upon it in order to find possible improvements to current systems. In summary, LSA represents a very flexible and economical means of implementing Call Routing, and also allows us to explore algorithms and methods derived from research in Cognitive Science that may prove very promising. In this case, we have presented a system that combines LSA with a network based on Construction-Integration. The results obtained are good, although fine tuning is needed to optimize the voice recognition process, as well as more coherent organization of the destinations (which would be the case in a working production system). In any case, what has proved most interesting about this study is not the overall results in absolute terms, but rather testing the C-I layer and how it works. This has offered quite promising results, demonstrating superiority in some scenarios and stability in others. We believe that establishing links between computational science and psycholinguistics can help to find ways to optimize current categorization systems.
Acknowledgements
The authors wish to thank Airenas Vaiciunas from the
Department of Applied Informatics at Vytautas Magnus
University (Lithuania) for his useful comments about
SLMs and also to Semantia Lab (www.semantialab.es)
for supporting the logistic of this research.
References
[1]	Jorge-Botana, G., Olmos, R. & Barroso, A. (2012). The Construction-Integration algorithm: A means to diminish bias in LSA-based Call Routing. International Journal of Speech Technologies, 15(2): pp. 151-164.
[2]	Kintsch, W. (1998). The use of knowledge in discourse processing: A construction integration model. Psychological Review, 95: 163-182
[3]	Kintsch, W., & Welsch, D. The construction-integration model: A framework for studying memory for text. In W.E. Hockley & S. Lewandowsky (Eds.), Relating theory and data: Essays on human memory in honor of Bennet B. Murdock. Hillsdale, NJ: Erlbaum. 1991: 367-385.
[4]	Chu-Carroll, J., y Carpenter, B. (1999). Vector-based natural language call routing, Computational Linguistics, 25(3): pp. 361-388.
[5]	Brumby, D. & Howes, A. Good enough but Ill just check: Web page search as attentional refocusing. Procedings of the 6th International Conference on Cognitive Modelling, 46-51. 2004.
[6]	Brumby,D. & Howes, A. Interdependence and past experience in menu choice assessment. In: Alterman, R., Kirsh, D. (Eds.), Proceedings of the 25thAnnualConference of the Cognitive Society . Lawrence Erlbaum Associates, Mahwah, NJ, p. 1320. 2003.
[7]	Jurafsky, D. & and Martin, J. Speech and Language Processing: A^n Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics. 2nd edition. Prentice-Hall. 2009.
[8]	Gasanova, T., Zhukov E., Sergienko, R., Semenkin, E. & Minker, W. A Semi-supervised Approach for Natural Language Call Routing. Proceedings of the 14th Annual Meeting of the Special Int^erest Group on Discourse and Dialogue (SIGDIAL). August 2013. Metz, France.
[9]	Sarikaya, R., Hinton, G. & Ramabhadran, B. Deep Belief Nets for Natural Language Call-Routing. In ICASSP-2011.
[10]	Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. Indexing. (1990). By Latent Semantic Analysis. Journal of the American Society For Information Science, 41: pp. 391-407.
[11]	Landauer, T. K., & Dumais, S. T. A solution to Plato's problem: the Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. (1997). Psychological Review, 104, pp. 211-240.
[12]	Jones, M.P., & Martin, J.H. Contextual spelling correction using latent semantic analysis. In
Proceedings of the Fifth Conference on Applied Natural Language Processing. pp. 163-176. 1997.
[13]	Bellegarda, J.R. Exploiting latent semantic information in statistical language modeling. In Proceedings of the ^E^EE:. 88 (8), 1279- 1296. 2000
[14]	Jen-Tzung Chien, Meng-Sung Wu and Hua-Jui Peng. (2004). Latent semantic language modeling and smoothing, International Journal of Computational Linguistics and Chinese Language Processing, vol. 9, no. 2, pp. 29-44.
[15]	Pucher, M., Y. Huang, Y. Combination of latent semantic analysis based language models for meeting recognition. ^n Proceedings of the Second LASTED International Conference on Computational Intelligence, San Francisco, California, USA, November 20-22, 2006
[16]	Lim, B.P, Ma, B., & Li, H. Using Semantic Context to Improve Voice Keyword Mining, In Proceedings of the International Conference on Chinese Computing (ICCC 2005), 2005; 21-23, March, Singapore
[17]	Shi, Y. j^n Investigation of Linguistic Information for Speech Recognition Error Defection, Ph.D. University of Maryland, Baltimore County, Baltimore. 2008.
[18]	Miller, G.A. (1995). WordNet: A Lexical Database for English. Communications of the ACM Vol. 38, No. 11: 39-41.
[19]	Fellbaum, C. (1998, ed.) WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.
[20]	Wandmacher, T. & Antoine, J. Methods to Integrate a Language Model with Semantic Information for a Word Prediction Component. In Proceedings of EMNLP-CoNLL, 506-513. 2007.
[21]	Cox, S. & Shahshahani, B. A Comparison of some Different Techniques for Vector Based Call-Routing. In Proceedings of 7th European Conf. on Speech Communication and Technology, Aalborg. 2001.
[22]	Li, L. & Chou, W. Improving latent semantic indexing based classifier with information gain, In Proceedings of the International Conference on Spoken Language Processing, ICSLP-2002, 11411144. September 16-20, 2002 Denver, Colorado, USA.
[23]	Serafin, R. & Di Eugenio. B. (2004). FLSA: Extending Latent Semantic Analysis with features for dialogue act classification. In Proceedings of ACL04, 42nd Annual Meeting of the Associat^ion for Computational Linguistics, Barcelona, Spain, July.
[24]	Tyson, N. & Matula, V.C. Improved LSI-Based Natural Language Call Routing Using Speech Recognition Confidence Scores, In Proceedings of EMNLP. 2004.
[25]	Stolcke, A. SRILM - An Extensible Language Modeling Toolkit, In Proceedings of Intl. Conf. Spoken Language Processing, Denver, Colorado, September 2002.
[26]	Jorge-Botana, G., Leon, J.A., Olmos, R. & Hassan-Montero, Y. (2010). Visualizing polysemy using LSA and the predication algorithm. Journal of the American Society for Information Science and Technology, 61(8), pp. 1706-1724.
[27]	Jorge-Botana, G., Olmos, R., Leon J.A. (2009) Using LSA and the predication algorithm to improve extraction of meanings from a diagnostic corpus. Spanish Journal of Psychology, 12(2), pp. 424-440.
[28]	Jorge-Botana, G., Leon, J.A.,Olmos, R., & Escudero, I. (2010). Latent Semantic Analysis Parameters for Essay Evaluation using Small-Scale Corpora. Journal of Quantitative Linguistics, 17(1), pp. 1-29
[29]	Suendermann, D., Hunter, P., Pieraccini, R., Call Classification with Hundreds of Classes and Hundred Thousands of Training Utterances ... and No Target Domain Data, in Perception in Multimodal Dialog Systems, Proceedings of the 4th IEEE Tutorial and Research Workshop on Perception and Interactive Technologies for Speech-Based Systems, PIT 2008, Kloster Irsee, Germany, June 16-18, 2008, Springer-Verlag, pp. 81-87.
[30]	Kintsch, W., Patel, V., & Ericsson, K. A. (1999) The role of Long-term working memory in text comprehension. Psychologia, 42: 186-198.
[31]	Kintsch, W. (2000) Metaphor comprehension: A computational theory. Psychonomic Bulletin & Review, 7, pp. 257-266.
[32]	Kintsch, W. (2001). Predication. Cognitive Science, 25, pp. 173-202.
[33]	Kintsch, W., & Bowles, A. (2002). Metaphor comprehension: What makes a metaphor difficult to understand? Metaphor and Symbol, 17, pp. 249262.
[34]	Kintsch, W. Meaning in context. In Landauer, T. K, McNamara, D., Dennis, S. & Kintsch, W. (Eds.) Handbook of Latent Semantic Analysis. Mahwah, NJ: Erlbaum. 2007: 89-105.
[35]	Olmos, R., Leon, J.A., Jorge-Botana, G.,& Escudero I. (2009). New algorithms assessing short summaries in expository texts using Latent Semantic Analysis. Behavior Research Methods, 41, pp. 944-950.
[36]	McCowan, Moore, Dines, Gatica-Perez, Flynn, Wellner & Bourlard. On the use of information retrieval measures for speech recognition evaluation. Technical Report. IAIDP. 2005.
[37]	Bird, S. and Loper, E. NLTK: The Natural Language Toolkit. In Proceedings of ACL04, 42nd Annual Meeting of the Association for Computational Linguistics. Barcelona, Spain, July. 2004.
[38]	Malmström P. E. Methods for Evaluating a Natural Language Call Routing Application: A Case Study. Master's thesis of the Uppsala University. Department of Linguistics and Philology. 2010
An Exact Analytical Grossing-Up Algorithm for Tax-Benefit Models
Miroslav Verbič, Mitja Čok and Tomaž Turk University of Ljubljana, Faculty of Economics, Kardeljeva ploščad 17, 1000 Ljubljana, Slovenia
Keywords: deductive data imputation, household budget survey, microsimulation, tax-benefit models Received: November 27, 2014
In this paper, we propose a grossing-up algorithm that allows for gross income calculation based on tax rules and observ^ed v^ariables in the sample. The algorism is applicable in ^ax-beneUt microsimula^ion models, which are mostly used by taxation policy makers to support government legislative processes. Typically, tax-benefit microsimulation models are based on datasets, where only the net income is known, though the data about gross income is needed to successfully simulate the impact of taxation policies on the economy. The algorithm that we propose allows for an exact reproduction of a missing v^ariable by applying a set of taxation rules that are known to refer t^o the variable in question and to other variables in the dataset during the dat^a generation process. Researchers and policy makers can adapt the proposed algorithm wit^h respect to the rules and variables in their legislative environment, which allows for complete and exact restoration of the missing variable. The algorithm incorporates an estimation of partial analytical solutions and a trial-and-error approach t^o find the initial true value. Its validity was proven by a set of tax rule combinations at different levels of income that are used in contemporary tax systems. The algorithm is generally applicable, with some modifications, for data imputation on datasets derived from various t^ax systems around the world.
Povzetek: Članek predstavlja algoritem obrutenja, ki omogoča izračunavanje bruto dohodkov iz neto dohodkov ob širokem naboru davčnih pravil različnih davčnih sistemov. Algoritem omogoča reproduciranje manjkajočih spremenljivk in je široko uporaben pri mikrosimulacijskem modeliranju.
1 Introduction
There are various techniques for data imputation, which give to the researcher an opportunity to remedy the situation when the dataset is not complete. This does not come without costs, since data imputation can easily introduce biased parameter estimates in statistical applications [1, 2], or in other domains [3, 4]. Imputation techniques rely on deterministic and stochastic approaches, mostly under the assumption that the variable in question is in some way related to other variables under investigation. In this paper, we are exploring a case of deductive approach [5], a possibility to estimate a missing variable by applying a set of rules for which it is known that they refer to the variable in question and to other variables in the dataset. This set of rules might be enforced in various contexts, for instance by legislation, government policy, or other institutional or social constraints. If there is a consistent set of rules, which are enforced in practice, and the rules are comprehensive, the researcher could develop a formal algorithm with respect to the rules and variables in the dataset, which would allow for the data imputation of the missing variable.
Let us consider the case of household budget survey (hereinafter HBS) datasets. HBS surveys are implemented at the national level of EU member states [6], where taxpayers report their net income for different income sources (e.g. wages, rents, pensions), as well as
socio-economic data, which enable estimations of tax allowances and tax credits. HBS datasets are most valuable in many microsimulation tax-benefit models. Such models are standard tools in academia, in financial industry, and for underpinning everyday policy decisions and government legislative processes [7, 8, 9, 10].
Gross income represents a starting point for any tax simulation (including tax-benefit modelling), but HBS datasets are usually reporting only net amounts. As noted in [7], one possibility to generate gross income is the statistical approach based on information on both net and gross income. Using this information, a statistical model can be developed that yields estimates of net/gross ratios. These estimates are then applied to net incomes in order to compute gross amounts.
The second known technique is the iterative algorithm that exploits the tax and contribution rules already built into tax-benefit models to convert gross income into net income [8, 9]. The procedure takes different levels of gross income for each observation, applies them to the tax rules, calculates the net income, and compares it with the actual net income as long as the gross income fits the actual net income within approximation limits.
Both techniques, namely the statistical approach and the iterative algorithm, give gross income values that are estimates and not the actual gross income values. The
task is not trivial, since modern tax systems include various rules for taxation and their combinations, and they usually involve a bracketing system for one or more parameters. Involvement of bracketing systems (and especially their combination) means that the calculation of net income from gross income is analytically nonreversible function.
In this paper, we are presenting a solution to this problem, namely an algorithm that enables a full restoration of the gross income value. The algorithm includes a set of analytical inversions combined with a trial-and-error approach to deal with bracketing system combinations. The proposed algorithm allows for the calculation of gross income from net income for a broad set of taxation possibilities, where only information on net income is available, along with information on tax reliefs. The algorithm is feasible in cases of proportional and progressive tax schedules of personal income tax (hereinafter PIT) and social security contributions. It also covers tax allowances as well as tax credits. It is thereby generally applicable to contemporary tax systems around the world.
The validity and accuracy of the proposed algorithm was tested by its application to a synthetic sample of taxpayers using an artificial system of personal income tax (PIT) and social security contributions. A comparison of gross income, calculated from net income using the proposed technique, with the initial gross income demonstrates the complete accuracy of the algorithm.
The rest of the paper is organized as follows. In Section 2, we analyse taxation rules that are used in contemporary taxation systems. The analysis is a basis for the formalization of the imputation algorithm, which is explained in Section 3, including detailed solutions and proofs for various combinations of tax rules and bracketing systems. A test of the validity and accuracy of the proposed algorithm is presented in Section 4. In the Conclusion, the proposed algorithm is presented in its full form, which can be directly applied in practice.
2 Analysis of taxation rules
Gross income is a starting point for the taxation of personal income, as to which we can distinguish three basic approaches [11, 12]: comprehensive income tax, dual income tax, and flat tax. Under a comprehensive income tax system, all types of labour and capital incomes are taxed uniformly by the same progressive tax schedule. A dual income tax system retains progressive rates on labour income, while introducing a proportional tax rate on capital income, e.g. the Scandinavian dual income tax [13]. The third option, which has been dominating income tax reforms in Eastern Europe [14, 15], is the flat-tax concept, although it is noted that this concept has not been implemented in any country in Western Europe [6].
Hereby we follow the most comprehensive procedure for the taxation of gross income, which is presented in Table 1 and includes a combination of progressive tax schedules and flat rates, with the addition of tax allowances and tax credits.
Gross income
-	Social security contributions^
■	determined by social security contributions schedule
■	set as a proportion of gross income
■	given in absolute amounts
-	Other costs related to the acquisition of net income
■	determined by a schedule
■	set as a proportion of gross income
■	given in absolute amounts
-	Tax allowances
= Personal income tax base X Personal income tax rateb
= Initial personal income tax - Personal income tax credit
= Final personal income tax
Net income = Gross income - Social security contributions
_- Final personal income tax_
a Employee social security contributions b Either a single (flat) tax rate or set by the tax schedule
Table 1: General procedure for taxing gross income.
Table 1 contains the general procedure for the taxation of gross income. From gross income, employee social security contributions and other costs related to the acquisition of income (e.g. travel allowances or standardized costs set as a proportion of gross income) are deducted. Further, the tax allowances are subtracted and the tax base is obtained, which is subject to a PIT calculation using the tax schedule or a proportional (flat) tax rate. In this way, the initial PIT is calculated, which could be further reduced by a tax credit in order to calculate the final PIT and the net income.
From the taxation point of view, other costs related to the acquisition of income have consequences identical to social security contributions or tax allowances. Therefore, our further development implicitly incorporates these costs into the concepts of social security contributions and tax allowances.
When the schedules are applied, the PIT schedule or social security contributions schedule consists of a number of tax brackets with different marginal tax rates. The amount of PIT is calculated from the tax base according to the PIT schedule. Likewise, the amount of social security contributions is determined by gross income and the social security contributions schedule.
In general, at the annual level tax bases from different income sources are summed up into a single tax base, which is subject to a single-rate schedule, and then the final annual PIT is calculated. An alternative option is a dual-tax system, where the PIT is calculated separately for different income sources (multiple-rate schedule).
The procedure from Table 1 covers the existing tax systems to a great extent. In several OECD countries [16], the employee social security contributions are determined by the schedule (i.e. Austria, France) or set as a proportion of gross income (i.e. Spain, Norway), while social security contributions set by absolute amount are not very common and can be found, e.g., in Slovenia for certain categories of the self-employed.
The algorithm applies the logic of social security contributions to the Other costs related to the acquisition
of net income (i.e. cost connected with the real estate maintenance in the case of taxing income from rents) and tax allowances (i.e. for children or interest of housing loan allowances), which are found across the tax systems. The algorithm also covers the case of social security ceiling (i.e. in Austria, Germany). Regarding the calculation of PIT, the algorithm covers the prevailing progressive PIT schedule, as well as flat-tax systems (e.g. in Hungary or Bulgaria).
However, the algorithm (more precisely, equations that cover specific combinations of tax parameters) has to be adapted to certain country specifics, which are not explicitly set out by the procedure from Table 1. For example, if social security contributions are not included in the PIT base, the gross income shall be calculated by the algorithm assuming that social security contributions are zero. Another example refers to above mentioned social security contribution ceiling. In this case, a zero rate tax bracket of social security contribution schedule above the set ceiling should be applied in an appropriate equation that is suitable to the specific combination of tax parameters of the particular country.
The algorithm hereinafter is derived for the case when there is only one PIT instrument and one SSC instrument, thus two instruments in total. However, in actual fiscal systems there are cases when two or more PIT or SSC instruments are applied to a single income source at the same time. In these cases, the net to gross conversion is more complicated, since a "compression" of two (or more) PIT or SSC instruments should be done into a single PIT or SSC instrument.
During the year, when a particular income source is paid out, the advance (in-year) PIT is usually paid at the time of disbursing the income source. This advance PIT is taken into account once the final annual PIT is calculated (i.e. the advance PIT is consolidated with the annual PIT). This procedure is called withholding.
Understanding the mechanism of (in-year) advance PIT is important when we are dealing with survey data, such as HBS datasets. In a typical survey, respondents report their net income from different income sources for a certain period of the year, when their income sources are only subject to (in-year) advance PIT. In order to calculate the overall annual gross income, the reported net income from different income sources should be initially grossed-up using the algorithms that take account of various rules of advance (in-year) PIT and social security contributions for each income source separately. The focus of our paper is the development of these grossing-up algorithms for different income sources. Once the grossed-up amounts from different income sources are calculated, they can be summed up into overall taxpayers' annual gross income, which is the starting point for building a microsimulation model.
Thus, the calculation of gross income from net income of a single income source (i.e. calculation of gross wages from given net wages) thus depends on different combinations of tax parameters from Table 1, which are described in detail as an algorithm in the paper. For example, in a case when gross wages are subject of: (a) progressive PIT schedule, (b) social
security contributions, which are set as a proportion of gross wage with a ceiling, (c) tax allowances, and (d) without tax credits (e.g. in-year taxation of wages in Croatia), then equation (9) should be applied. Since the ceiling of social security contributions is set, this implies that the applied value of social security contribution rate above the ceiling should be zero.
Table 1 can be transformed into the following expression:
N = G - S - PIT,
(1)
where N and G represent net and gross income, respectively, S is the sum of social security contributions, and PIT is the personal income tax.
Social security contributions S are a function of gross income. Similarly, PIT is a function of the tax base, which is the difference between gross income (reduced by social security contributions) and tax allowances, TA. This can be generalized as follows:
N = G - fs (G) - fpiT (G - fs (G) - TA).
(2)
Function fS(G) can be defined in practice in different ways. A common approach is to use a schedule system, but it can also be defined as a proportion of gross income or as an absolute amount.
In practice, function fpn{G - fs(G) - TA) is usually defined by a schedule system (different from the schedule system for social security contributions). As mentioned, function fPIT can also incorporate the concept of a tax credit.
Our task is to estimate gross income G from expression (2) from the known values of N and TA and from a set of constraints that are usually given by social security contributions and PIT schedule systems, or by other legislative rules. The combination of two schedule systems makes solving equation (2) for G particularly challenging. The solution we propose in this paper has a trial-and-error nature. The idea is to prepare a set of all possible (PIT and social security contribution) bracket combinations. Then, we calculate for each taxpayer 'candidate' gross income values for each bracket combination, calculate net incomes from these candidate gross incomes, and compare the results to the starting value of net income. The gross income candidate that fits (or equals) the net income is the true gross income value. The fit is exact, i.e. we find the actual gross income in a non-iterative way.
The following section describes the construction and design of the procedures we propose to deal with different income sources taxed by different rules. The general setup of the grossing-up algorithm is explained. Sections 3.2 and 3.3 set out a detailed examination of various taxation rules for social security contributions and tax crediting, together with the proposed grossing-up procedures for specific tax rule combinations.
3 Data imputation algorithm
In this section, we explore a general setup where the tax system involves a combination of the following elements: (1) a social security contributions schedule, (2) a PIT schedule, and (3) tax allowances. This general setup forms the basis for development of the proposed algorithm. In the next steps, we incorporate other tax complexities, i.e. other rules for calculating social security contributions and various rules for determining tax credits.
Function I^S(G) can be expanded by the rules of the social security contributions schedule to:
f,(G) = S = Sr^(G- L^) + X Srj(Hj - Lj), (3)
j =1
i.e. for each bracket, social security contributions are equal to the social security contributions marginal rate Sts, multiplied by the difference between gross income G and the lower bracket margin (s denotes the social security contributions bracket). This amount is added to the social security contributions, which are collected for all 'lower' brackets (i.e. brackets from 1 to s - 1). Hs and Ls denote the upper and lower social security bracket margins.
Similarly, function fpn can be expanded by the rules of the schedule system for PIT to:
fpiT (G) = PIT = Tfb (G - f, (G) - TA - L^) + Tr, (Hi - Li),
(4)
where Trb is the marginal tax rate for PIT bracket b, Lb is the lower margin for bracket b, Tri is the marginal rate for bracket i, and Hi and Li are the upper and lower margins of bracket i, respectively.
By combining (3) and (4), we obtain:
Nsb = Gsb -
Srs (Gsb - L,) + X Srj(Hj - Lj)
j=1
/	/	/
Tr,	Gb -	Sr, (Gsb
V		
Sr, (Gsb - L,) + X Srj (Hj - Lj)
j=1
- (5)
-TA- L,) + X Ti^i (Hi - ^i)
where
G = Nb - Sr^L^ - + SrJr,L^
(Sis - 1)(Trb -1)
- TrJA + Tr, Z s -^s -Z, (Sr^ -l)(Trb -1) ,
Zb =Z Tr^i (Hi - L^i)
(6)
and
Zs =ZSrs(H, -L^).
j=1
Following our general trial-and-error scheme, the grossing-up algorithm is as follows:
1. For each statistical unit, calculate the matrix with K • B candidate gross incomes as its elements, according to equation (6):
G,
G,.
G„
G„
Gk^b
where K and B are the number of social security contributions brackets and the number of PIT brackets, both defined by the PIT and social security contributions system, respectively, and where k = K and 1 = ,B.
2. Calculate the net incomes from the matrix of candidate gross incomes according to the tax rules:
N,
N,.
N,
N1B
Nkb
The above equation holds for an individual taxpayer, when PIT was calculated by the tax authorities in such a way that social security bracket s corresponds to gross income G, and PIT bracket b corresponds to (G - S -TA). Since we do not know the actual G and S, we cannot directly establish, which PIT and social security contributions brackets (and corresponding marginal rates) were actually used for each individual taxpayer by the tax authorities.
By reordering expression (5), we can express gross income as follows:
3.	In the above matrix, find the net income Nk1, which is equal to the starting net income for this individual taxpayer: N = Nk^,.
4.	The actual gross income G for this individual taxpayer is then: G = Gk^i.
In the next subsections, we discuss the following extensions to this general setup: (1) social security contributions are not determined by the schedule system,
i=1
i=1
=1
but as a proportion of gross income or as an absolute amount (Section 3.1), and (2) tax credits are included according to various rules for their determination (Section 3.2).
3.1 Variations of social security contributions
In the following section, we extend the general setup to include cases where social security contributions are not determined by a schedule, but as a proportion of gross income or as an absolute amount.
3.1.1 Social security contributions as a proportion of gross income
When social security contributions are set as a proportion of gross income, equation (2) can be rewritten as:
N =(1 - 5r) G - f ((1 - 5r) G - TA),
(7)
where Sr is the rate of social security contributions, expressed as a proportion of the gross income. By simplifying equation (5), we obtain:
Nb = (1 - Sr) Gb -(Trb ((1 - Sr) G^ - TA- L^) + Tr, (Hi - Li)
(8)
which holds for each PIT bracket b. By reordering, we can express the gross income with the equation:
G ^ N, - Tr,TA - Tr,L, +Sb b (1 - Sr)- Tr, (1 - Sr)
(9)
From here, we can proceed according to the general setup, outlined above.
3.1.2 Social security contributions as an absolute amount
When social security contributions are set as an absolute amount, we can simplify equation (5):
Nb = Gb - S-(Tr, (Gb - S - TA - L,) + +]-: Tri (Hi - ^i)
(10)
and by reordering we obtain:
G _ Nb + S - TrbS - TrbTA - Tr^b^, +S
Gb _ rrrr
(11)
3.2 Grossing-up procedure when PIT is subject to a tax credit
A tax credit means that PIT is reduced by a certain amount (called a tax credit) and that the gross income source is effectively not taxed with the full PIT (the 'initial PIT'), but with the PIT reduced by the amount of the tax credit (the 'final PIT'). In practice, if a tax credit is calculated to be greater than the initial PIT, then net income N equals gross income G, as the net income cannot exceed the gross income (i.e. a tax credit can be as high as the initial PIT).
In various tax systems, a tax credit can be defined in three ways: (1) as a proportion of the initial PIT, (2) as a proportion of the gross income, or (3) as an absolute amount.
3.2.1 Tax credit as a proportion of the initial PIT
In general, we can express a tax credit as a proportion of the initial PIT as:
N _ G - f, (G) - fp,T (G - fs (G) - TA) + +Cp.T ■ fpT(G - fs (G) - TA),
(12)
where CPIT is the share of the tax credit in the initial PIT. Following the above, we can write:
Nb _ Gb -
/	/
Tr,	Gb -
V	V
+c,
'b) f / Tr, ^
V V
Gsb-
J _1
S^,(Gb - L,) + XSfj(Hj - Lj)
J'1
Sr,(Gb - L^) + Srj(H. - L.)
j_1
-TA-L,) + i1 T^,{H, - Li.)^]+	(13)
Sr,(G,b - L^) + ]r Srj(H. - L.) -TA- L,) + (H, - L,)
which holds for a specific combination of social security contributions and PIT brackets. Solving (13) for G, we obtain:
G _ Nb - SrbL, - Trb^, + cTrb^, + TrbSrbL,
(1 - Sr^){Trb (CpiT -1) +1)
CPITTrbSrbLs + TrbTA - CPITTrbTA -
(1 - Sr:){Tr^b {cpiT-1) + 1)
[T^, (1 - )- 1) + (CpiT - 1)Sb (1 - Sr,){Trb (C^T -1) +1)
(14)
From here, we can proceed according to the general setup, outlined above.
When the tax credit is set as a proportion of the initial PIT and social security contributions are defined by a schedule, the above equation should be used instead of expression (6) in the general setup.
_1
i _1

Tax credit as a proportion of the initial PIT and social security contributions as a propotiion of the gross income
Where social security contributions are set as a proportion of the gross income, the general procedure can be simplified. In this case, the net income can be expressed as:
N =(1 - Sr) G - f ((1 - Sr) G - TÄ) + f ((1 - Sr) G - Tä).
(15)
By reordering we obtain:
Gb =
Nb-(1 - CpiT )(TrbLb + TrbTÄ-E^) (1 - Sr)(1 - Trb + CpT)
3.2.2 Tax credit as a proportion of the gross income
If the amount of a tax credit is defined as a proportion of the gross income, the net income calculation can be formalized as:
N = G - fS (G) - fpT (G - fS (G) - TÄ) + Cg • G, (20)
where CG is the tax credit share of the gross income. For clarity, we can denote the initial PIT as:
PIT, = fpT(G- fS(G) - TÄ)
The following equation holds for a particular tax bracket b:
and the final PIT as:
Nb = (1 - Sr) Gb - (Trb ((1 - Sr) Gb - TÄ - Lb) +
Tr, (Hi - Li)1 + CpiT (Trb ((1 - Sr) Gb - (16)
1=1 /
-TÄ- Lb) + ][: Tri (H, - Li)
PITf = fpiT (G - fS (G) - TÄ) - CG ■ G.
(21)
(22)
Since a tax credit can be as high as the initial PIT, the following rule applies:
= \G- fS(G)- PITf if CG • G< PIT,; IG - fS (G)	-	-- ( )
if CG ■ G > PIT,.
(17)
Thus, when a tax credit is set as a proportion of the initial PIT and social security contributions are set as a proportion of the gross income the above equation should be used instead of expression (6) in the general setup.
Due to this rule, the gross income cannot be easily estimated from net income N and tax allowances TÄ, as PITi and cg ■ G are not known at this stage. The rule implies that the actual calculation of net income N for each taxpayer was done by the tax authorities either by:
N = G - fS (G) - fPT (G - fS (G) - TÄ) + CG ■ G (24) when cG ■ G < f,,;^,(G- fs(G) - TÄ), or by:
Tax credit as a proportion of the initial PIT and social security contributions as an absolute amount
If social security contributions are set as an absolute amount, we can redefine equation (10) to incorporate tax credit as a proportion of the initial PIT:
Nb = Gsb - S-(Trb (Gb - S - TÄ - Lb) +
Tr, (H, - L,)] + c„T (Trb (Gb - S - TÄ - Lb)+ (18)
i=1	y
+][: Tr, (Hi - Li)
N = G - fS (G)	(25)
when cG • G > fP,T(G- fS(G) - TÄ).
When we are interested in G, we can use these two approaches in reverse fashion (calculating G and not N), but we do not know which one, (24) or (25), is correct.
Let us consider the case when we calculate G for a particular taxpayer from known values of N, TÄ and the PIT schedule (as in Table 1), once by using the rule expressed in equation (24) and once by using the rule expressed in (25). We obtain two estimates for the taxpayer's gross income G:
G' = N + fS(G) + ff^^(G- fS(G) - TÄ) - cG ■ G (26)
and by reordering we obtain:
G = Nb + S + (CPIT -1)(TrbLb + TrbS + TrbTÄ - £b) . (19)
b	(1 - Trb + CPrTTrb)	'
Thus, when a tax credit is set as a proportion of the initial PIT and social security contributions are set in an absolute amount, the above equation should be used instead of expression (6) in the general setup.
and
G" = N + fS(G).
(27)
If the net income N for this particular taxpayer was actually calculated according to expression (24), this inequality holds true:

i=1
(N + f,(G) + fpT(G - fs (G) - TA) --c^ • G)>(N + f,(G)),
(28)
since cG ■ G < fj.^(G- fs(G) - TA) must hold. By using (26) and (27), we obtain:
G' > G".
(29)
(N + f,(G) + fpT(G- fs(G) - TA) --cg ■ G)<(N + f,(G)),
(30)
From expression (34) we obtain:
Nb - Sr^L^ - Tr,^^ - Tr,L,
sb	1 + cG - Srb - Trb + SrbTrb
Sr.Tr.L - Ti^,TA + S +S,
+ b b s_b_s_b.
1 + cG - Srb - Trb + SrbTrb
The proper value of gross income G is G', since net income N for this particular taxpayer was actually calculated according to expression (24).
Let us consider the opposite case where net income N for our taxpayer was actually calculated (by the tax authorities) according to (25). In this case, we can write:
and from expression (35):
G" _ Nsb - Srt^s
Gsb _	^ ;;
1 - Sn
(36)
(37)
According to expression (32), we can establish the right value for gross income G^b:
Gsb _ max(G'b, G",).
(38)
since cG ■ G > fj.^(G- fS(G) - TA) must hold. By using (26) and (27), we obtain:
G" < G".
(31)
Tax credit as a proportion of the gross income and social security contributions as a pr'opor'tion of the gross income
The proper value of gross income G in this case is G". Following (29) and (31), we can conclude that in both cases the highest value of G' and G" is the one that actually holds:
Where social security contributions are set as a proportion of the gross income, the calculation of net income N for each taxpayer was done by the tax authorities either by:
G _ max(G', G")
(32)
or
G _ max ((N + (G) + fj.^(G- f^(G) - TA) --cg • G), (N + fS (G))).
(33)
For the construction of a general setup in the case of tax credits given as a proportion of the gross income, where social security contributions and PIT are calculated according to their schedules, we need to express equation (33) in a more exact way, for a specific combination of social security contribution and PIT brackets. The specific form for equation (26) is then:
Nb _ Gb -
Tr
Sr^.(Gb - L^) + XSrj(Hj - Lj)
J _1
Gb -
Sr^s (Gb - L,) + 2 Srj (Hj - Lj)
-TA- L,) + (H, - )
j_1
- CGGsb
(34)
N _ (1 - Sr)G- ((1 - Sr)G- TA) + cG ■ G (39)
when cG • G < ((1 - Sr)G- TA), or by:
N _ (1 - Sr)G	(40)
when cG ■ G > f,,^ ((1 - Sr)G - TA). The reasoning is similar to that above where we constructed equations (34) and (35). These two equations can be simplified since we only have one social security contributions rate Sr, and we obtain:
N, _ (1 - Sr) G, -(T,^, ((1 - Sr) G, - TA - L,) +
+2 Ti (H, - L,)
+ cgG,
and
N, _(1 - Sr) Gb
From this, we obtain two solutions for Gb
(41)
(42)
and for equation (27):
Nb _ Gb - Sr^ (Gsb - L,) + 2 Srj (Hj - Lj). (35)
G,_ N, - Tr,L, - Tr,TA + 1, b 1 + cg - Sr - Tr, + SrTr,
and
+
i _1
i _1
j _1
GNb
T-sr'
N _
(44)
Nb _ Gb - S-{Tfb {Gb - S - TA - Lb) +
Tr {H, - L,)
+ CcGb
and
Nb _ Gb - S .
The gross income for both cases can then be calculated from:
G - fs (G) - PI^Tp if C < PIT,;
G - fS (G)
if C > PIT,.
(51)
which should be used in the general setup instead of (36) and (37), respectively. Again, the matrix of candidate solutions is one-dimensional (a vector for gross income candidates, i.e. one value for each PIT bracket), since there is only one social security contributions rate.
Tax credit as a proportion of the gross income and social security contributions as an absolute amount
In this case, the procedure can follow the same principles we used to construct equations (34) and (35). Since social security contributions are now set as an absolute amount, these two equations can be simplified:
If net income N for a particular taxpayer was actually calculated (by the tax authorities) according to C < PIT, in (51), this inequality holds true:
{N + fS(G) + fp,T(G- fS(G) - TA) - C) > >{ N + fS (G)),,
(52)
since C < fPI^T(G - fS (G) - TA) must hold. In the opposite case, i.e. if net income N was calculated according to C > PIT,, then:
{N + fS(G) + fP,T(G- fS(G) - TA) - C)< <{ N + fS (G)),
(53)
(45)
since C > fp,^,(G- fS(G) - TA) must hold.
Following a similar reasoning to that in Section 2.3.2, we can conclude that the actual gross income G for a particular taxpayer must be:
(46)
G _ max{{N + fS(G) + fP,T(G- fS(G) --TA) - C),{N + fS(G)))
Quantity G' can be estimated from:
(54)
and
G,_ Nb + S - TfbLb - TfbS - TfbTA+Sb (47)
Gb_-^-- (4')
1 + Cg - Tfb
Gl_ Nb + S,
(48)
which should be used in the general setup instead of (36) and (37), respectively.
Nsb _ Gsb -
Tr
Srs(Gb - Ls) + XSr,(H, - L,)
J _1
Gsb -
Srs(Gsb - Ls) + 2Srj(H, - L,)
-TA- Lb) + 2Tr,{Hi - L,)
J■_1
+ C
and solving for Gsb:
(55)
3.2.3 Tax credit as an absolute amount
If the amount of a tax credit is defined as an absolute amount, the procedure is similar to the one described in Section 2.3.2. The net income can be expressed as:
N _ G - fS (G) - fP^,T (G - fS (G) - TA) + C,
PITp _ fPIT(G - fS (G) - TA) - C.
G _ Nb + C + SrsLs + TrbLb - Sr^TrbLs {Srs-1){Trb -1) TrbTA + Trb-Es -^s -^b
{Srs-l){Trb -1) '
+
(49)
(56)
where C is the amount of the tax credit. The initial PIT is the same as in Section 2.3.2, equation (21), and the final PIT is:
(50)
whereas the estimation of G is already explained in (36) and (38).
We can conclude that in cases where the amount of a tax credit is defined as an absolute amount, the general setup is the same as that described in Section 2.3.2, except for equation (36), which should be substituted by equation (56).
The following rule applies:
i_1
i _1
+
Tax credit as an absolute amount and social security contributions as a propotiion of the gross income
Where when social security contributions are set as a proportion of the gross income, the calculation of net income N for each taxpayer was done by the tax authorities either by:
N = (1 - Sr)G- fj,^((1 - Sr)G- TA) + C (57) when C < fj,^((1 - Sr)G- TA), or by:
N = (1 - Sr)G
(58)
Nb =(1 - Sr) Gb-(Trb ((1 - Sr) G^ - TA - L^) +
Tri (H, - L,)
+ C
(59)
and from this, we can obtain:
Nb - C - TrbLb - TrbTA +^b
(Sr-l)(Trb -1)
(60)
which should be used in the general setup instead of (43), whereas equation (44) also applies in this case for obtaining G". Again, the matrix of candidate solutions is one-dimensional (a vector for gross income candidates, i.e. one value for each PIT bracket).
Tax credit as an absolute amount and social security contributions as an absolute amount
In this case, the procedure can follow the same principle we introduced in Section 2.3.2. Since social security contributions are given as an absolute amount, equation (34) can be written in this way:
Nb = Gb - S-(Trb (Gb - S - TA - Lb) +
^b b-1
Tr, (Hi - Li)
+ C,
(61)
3.3 Algorithm in its full form
For clarity, the grossing-up procedure that we developed in the above sub-sections is given below, including all combinations of the taxation rules that we described in the subsections following the basic setup at the beginning of Section 3.
1.
when C > fi,^ ((1 - Sr)G - TA). The reasoning is similar to that in Section 2.3.2. Equation (41) can be rewritten in the following form:
For each statistical unit, calculate the matrix of K ■ B candidate gross incomes according to equation (6):
G =
G,,
G.
G,.,
G„
Gk^b
where K and B are the number of social security contribution brackets and the number of tax brackets, both defined by the tax and social security contribution systems, respectively, where k = ,K and l = B.
Formulas for specific combinations of taxation rules can be found in Table 2.
In cases where only the tax schedule system is used and social security contributions related to the acquisition of the income are set as one parameter, the above matrix of candidate gross incomes becomes a vector {G1,^, G,Gg}.
2. Calculate the net incomes from the matrix of candidate gross incomes according to the tax rules:
N =
N,,
N,
N,
Nk1
N
or
{N1,^,Ni,Ng}.
In the above matrix, find net income Nkl (or Nl), which is equal to the starting net income:
whereas equation (46) also holds in the case social security contributions are set as an absolute amount. The gross income can then be calculated from (61) as:
G, Nb - C + S - TrbLb - TrbS - TrbTA + ^b (62)
Gb =-^-, (62)
1 - Tr
N = N,, (or N = Ni). 4. The actual gross income G is then:
G = G,, (or G = Gi).
which should be used in the general setup instead of (36), together with (48), which was derived from (46).
i =1
3.
'=1
32	Informatica 39 (2015) 23-34			M. Verbič et al.
System without tax credits				Equation
I	Schedule for social security contributions	Gsb	Nsb - SrsLs - ti'l, + SrbTrbLs - ti'ta - ti'ZS +Zs +Zb (Srs - 1)(Trb -1)	(6)
	Social security contributions as	Gb	Nb - TrbTA - TrbLb +Zb	
II	a proportion of gross income		(1 - Sr)- Tr, (1 - Sr)	(9)
	Social security contributions as	Gb	Nb + S - TrbS - TrbTA - TI'L' +Z	
III	an absolute amount		1 - Tr,	(11)
Tax credit as a proportion of the initial PIT				Equation
IV	Schedule for social security contributions	Gsb	Nsb - SrbLs - TrbL, + CpIJTrbL, + TrbSrbLs - CpIJTrbSrbLs - TrbTA + (1 - Srs)(Trb(cpiT -1) +1) + CpTTA - Zs (Tr, (1 - CpiT) -1) - (cp^ - 1)Zb (1 - Sr,)(Trb(CpT -1) +1)	(14)
V	Social security contributions as	Gb	_ Nb -(1 - CpiT)(TrbLb + TrbTA-Z,)	
	a proportion of gross income		(1 - Sr)(1 - Tr, + CpTTr,)	(17)
VI	Social security contributions as	Gb	Nb + S + (CpiT -1)(TrbLb + TrbS + TI'TA- Z,)	
	an absolute amount		(1 - Trb + cpIJTrb)	(19)
Tax credit as a proportion of gross income				Equation
Gb = max {G'b )
VII
Schedule for social security contributions
G, Nb - Sr^L^ - TrbZ, - Tr,L, + SrJr,L^ - TrbTA + + Z, G,b =-
1 + Cg - Sr, - Tr, + SrbTr,
G„ = Nb - SrbLs +Zs
Gsb =-^—-
1 - Sr,
(38)
(36)
(37)
VIII Social security contributions as a proportion of gross income
G, = max (G',, G,')
G, = N, - TrbLb - TrJA + Zb b 1 + cG - Sr - Trb + SrTr„
g:==- nb
1 - Sr
(38)
(43)
(44)
Gb = max (Gb, G,)
IX Social security contributions as g = Nb + S + Tr^L^ - TrbS - TrbTA + Zb an absolute amount	b	^	"
1 - CG - Trb
Gl = N, + S
(38)
(47)
(48)
Tax credit as an absolute amount
Equation
Gb=max (G'b ,G:b)
X Schedule for social security contributions
N , + C + S/-L + Ti^.L, - S/-TrbL + Tr^/TA + TrbZ -Z -Zb
_ sb_s s	b b	s b s	b	b s s b
(Srs -1)( Trb -1)
G„ Nb - Sr,L, +Zs
(38) (56) (37)
G, = max (Gb, G')
XI Social security contributions as a proportion of gross income
G = N, - C - Tr,L, - TrJA + Zb
(Sr - 1)(Trb -1)
G' ^ Nb
1 - Sr
(38) (60) (44)
G, = max (Gb, G, )
Social security contributions as r^ _ N, - C + S - TI'L' - TI'S - TrbTA + Z XI^	^ ^	,	Gb =-^——-
an absolute amount
1 - Tr
Gb = Nb + S
(38) (62) (48)
Table 2: Equations for specific combinations of taxation rules.
4 Results and discussion
Table 3 presents a summary of all possible social security contributions and tax credit combinations explored in Section 3. In reality, for any income source one of these combinations is applicable. Parallel to this, the PIT schedule system and tax allowances in absolute amounts are assumed.
Our approach can also be applied to flat PIT systems (i.e. with a single proportional PIT rate). If this is a case, we apply only one PIT bracket with a positive marginal PIT rate. Where tax allowances are not set as absolute amounts, they can be expressed as an 'additional layer' of social security contributions.
To test for the validity and accuracy of the proposed algorithm, we created a synthetic sample of 10,000 taxpayers with a normally distributed gross income, where the mean gross income was 50,000 mu (monetary units) and the standard deviation was 11,500 mu. We assumed the following tax parameters:
1.	The PIT schedule includes three brackets:
•	0 - 20,000 mu, a 15% marginal PIT rate;
•	20,000 - 50,000 mu, a 25% marginal PIT rate;
•	over 50,000 mu, a 45% marginal PIT rate.
2.	The social security schedule includes three brackets:
•	0 - 10,000 mu, a 17% marginal rate;
•	10,000 - 40,000 mu, a 20% marginal rate;

over 40,000 mu, a 0% marginal rate.
	Schedule for	Social	Social
	social	security	security
	security	contributions	contributions
	contributions	as a proportion of the gross income	as an absolute amount
System			
without tax	I	II	III
credits			
Tax credit as			
a proportion	IV	V	VI
of the initial			
PIT			
Tax credit as			
a proportion	VII	VIII	IX
of the gross			
income			
Tax credit as			
an absolute	X	XI	XII
amount			
Table 3: Summary of tax rules combinations (detailed equations are given in Table 4).
The comparison of the gross income, calculated from the net income using the grossing-up algorithm, with the initial gross income demonstrates the complete accuracy of the algorithm for all income types.
As an example, we can repeat the steps for an individual taxpayer with a gross income equal to 49,433.10 mu. In the second step, we calculated the net income amount for all 12 combinations of the tax rules (see Table 4).
3.	Social security contributions as a proportion of the gross income were set at 22%.
4.	Social security contributions as an absolute amount were set at 500 mu.
5.	Tax allowances were set at an absolute amount of 2,000 mu.
6.	The amounts of tax credits were given as follows: 13% of the gross income, 6% of the initial PIT, or 200 mu.
These parameters were applied to the entire population of taxpayers according to the general procedure for taxing gross income (Table 1) and the specific combination of tax rules from Table 3.
In the first step, we generated the amount of gross income for each taxpayer. In the second step, we calculated the net income according to combinations I to XII (from Table 3) of the tax rules, as is done in practice by tax authorities.
In the third step, we applied the proposed grossing-up algorithm to combinations I-XII for each taxpayer. Finally, we compared the grossed-up income with the initial gross income.
Schedule for	Social	Social
social	security	security
security	contributions	contributions
contributions	as a	as an
	proportion of	absolute
	the gross	amount
	income	
System without tax credits Tax credit as a proportion of the initial PIT
Tax credit as a proportion of the gross income Tax credit as an absolute amount
33,799.80 (I)
34,275.80 (IV)
40,226.10 (VII)
33,999.80 (X)
31,418.30 (II)
31,846.70 (V)
37,844.60 (VIII)
31,618.30 (XI)
39,199.80 (III)
39,783.80 (VI)
45,626.10 (IX)
39,399.80 (XII)
Table 4: Net income for a chosen taxpayer with G = 49,433.10 mu.
For each net income from Table 4, we applied the grossing-up algorithm (i.e. equations from Table 2). According to the technique, several gross income candidates were calculated for each of these net incomes.
Due to space limitations, here we (arbitrarily) present the gross income candidates for net income VII:
GVI1 _
48,465.20 50,134.40 48,465.20 49,907.60 51,371.40 49,907.60 47,926.10 49,433.10 47,926.10
To each of these gross income candidates we applied the taxation rules (in this case the combination of taxation rules VII) and calculated the net income:
N"11 _
39,374.40 40,843.20 39,374.40 40,643.70 41,931.80 40,643.70 38,900.00 40,226.10 38,900.00
By comparing the elements of matrix NVII with the net income for a combination of tax rules VII from Table 4, which equals 40,226.10 (VII), we identified the matching element in the third row and the second column. The corresponding gross income in matrix G"11 equals 49,433.10, which is identical to the initial gross income of this particular taxpayer. In other words, for this combination of tax rules (VII), the proposed grossing-up algorithm is accurate.
We repeated such tests for all 12 tax rule combinations and for 10,000 individual cases.
5 Conclusion
In this paper, we presented a detailed construction of deterministic data imputation algorithm. In particular, we described an exact grossing-up algorithm for calculating the pre-tax income from data, which are only available in net (after-tax) form, and proved its successfulness, since it leads to a complete data reconstruction.
Contemporary tax systems are rich in complexity, and some of tax rules combinations might not be covered by our technique. However, we believe that the general architecture of our proposition is sound and flexible enough to incorporate (with some modifications) additional, locally specific tax rules.
In general, if a set of rules that relate to the variables under investigation could be assembled, researchers and policy makers can perform data imputation in deterministic fashion, and construct the algorithm for the exact analytical generation of the missing values.
In future research efforts, a framework for feasibility assessment of such approach could be envisioned, which would employ estimates on rules' consistency and complexity on the one hand, and measures of the quality of replicated data on the other hand.
References
[1] Rancourt, E. (2007). Assessing and dealing with the impact of imputation through variance estimation.
StatistiCal Dat^a Edifying: ImpaCt on Dat^a Quality. New York: United Nations.
[2]	Rueda, M. M., Gonzalez, S. & Arcos, A. (2005). Indirect methods of imputation of missing data based on available units. Applied MathematiC, and Computation 164: 249-261.
[3]	Smirlis, Y. G., Maragos, E. K. & Despotis, D. K. (2006). Data envelopment analysis with missing values: An interval DEA approach. Applied MathematiC, and Computation 177: 1-10.
[4]	Raghunathan, T. E., Lepkowski, J. M., van Hoewyk, J. & Solenberger, P. (2001). A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models. Survey Methodology 27: 85-95.
[5]	Franklin, S. & Walker, C. (2003). Surrey method, and praCtiCe,. Ottawa: Statistics Canada.
[6]	Fuest, C., Peichl, A. & Schaefer, T. (2008). Is a flat tax reform feasible in a grown-up democracy of Western Europe? A simulation study for Germany. International Tax and Pub^iC FinanCe 15: 620-636.
[7]	Immervoll, H. & O'Donoghue, C. (2001). Imputation of gross amounts from net incomes in household surveys: an application using EUROMOD, EUROMOD Working Paper, EM1/01. Colchester: ISER-Institute for Social and Economic Research.
[8]	D'Amuri, F. & Fiorio, C. V. (2009). Grossing-Up and Validation Issues in an Italian Tax-Benefit Microsimulation Model. EConpubbliCa Working Paper, 117, Milano: University of Milan.
[9]	Betti, G., Donatiello, G. & Verma, V. (2011). The Siena microsimulation model (SM2) for net-gross conversion of EU-silc income variables. International Journal of Microsimula^ion 4: 3 5-53.
[10]	ISER - Institute for Social and Economic Research, https://www.iser.essex.ac.uk/euromod (April 16th, 2012)
[11]	OECD - Organisation for Economic Co-operation and Development (2006). Reforming Personal Income Tax. Policy Brief March. Paris: OECD.
[12]	Zee, H. H. (2005). Personal income tax reform: Concepts, issues, and comparative country developments. IMF Working Paper 87. Washington: International Monetary Fund.
[13]	Sorenson, P. B. (2005). Dual income tax: Why and how? FinanzArchiv61: 559-586.
[14]	Ivanova, A., Keen, M. & Klemm, A. (2005). The Russian 'flat tax' reform. Economic Policy 20: 397444.
[15]	Moore, D. (2005). Slovakia's 2004 tax and welfare reforms. IMF Working Paper 133, Washington: International Monetary Fund.
[16]	OECD - Organisation for Economic Co-operation and Development (2013). Taxing Wage, 2011-20^12. Paris: OECD.
The slWaC Corpus of the Slovene Web
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute Jamova cesta 39, Ljubljana E-mail: tomaz.erjavec@ijs.si
Nikola Ljubešic
Faculty of Humanities and Social Sciences, University of Zagreb Ivana Lucica 3, Zagreb, Croatia E-mail: nikola.ljubesic@ffzg.hr
Nataša Logar
Faculty of Social Sciences, University of Ljubljana Kardeljeva plošcad 5, Ljubljana E-mail: natasa.logar@fdv.uni-lj.si
Keywords: language technologies, corpus linguistics, World Wide Web, Slovene language Received: February 25, 2015
The availability of large collections of text (language corpora) is crucial for empirically supported linguistic investigations of various languages; however, such corpora are complicated and expensive to collect. In recent years corpora made from texts on the World Wide Web have become an attractive alternative to traditional corpora, as they can be made automatically, contain varied text types of contemporary language, and are quite large. The paper describes version 2 of slWaC, a Web corpus of Slovene containing 1.2 billion tokens. The corpus extends the first version of slWaC with new materials and updates the corpus compilation pipeline. The paper describes the process of corpus compilation with a focus on near-duplicate removal, presents the linguistic annotation, format and accessibility of the corpus via Web concordancers. It then investigates the content of the corpus using the method of frequency profiling, by comparing its lemma and part-of-speech annotations with three corpora: the first version of slWaC, with Gigafida, the one billion word reference corpus of Slovene, and KRES, the hundred million word reference balanced corpus of Slovene.
Povzetek: Dostopnost velikih zbirk besedil (jezikovnih korpusov) je nujna za empirično podprte jezikoslovne raziskave posameznih jezikov, vendar pa je izdelava takih korpusov draga in zamudna. Korpusi besedil, zajetih s spleta, so v zadnjem času postali popularen vir jezikovnih vsebin, saj jih lahko zgradimo avtomatsko, vsebujejo pester nabor sodobnih besedilnih zvrsti in so zelo veliki. Prispevek predstavlja drugo različico korpusa slWaC, spletnega korpusa slovenščine, ki vsebuje 1,2 milijardi pojavnic. Korpus dopolnjuje prvo različico slWaC z novimi besedili, pridobljenimi z izboljšanimi orodji za zajem. V prispevku opišemo izdelavo korpusa s poudarkom na odstranjevanju podobnih vsebin ter jezikoslovno označevanje, format korpusa in njegovo dostopnost prek konkordančnika. Nato raziščemo vsebino korpusa z uporabo metode frekvenčnega profila, pri katerem leme in oblikoskladenjske oznake druge različice korpusa slWaCprimerjamo s tremi korpusi: s prvo različice korpusa slWaC, z referenčnim korpusom Gigafida, ki vsebuje milijardo besed, in s stomilijonskim referenčnim uravnoteženim korpusom KRES.
1 Introduction	mats, was very expensive in terms of time and labour.
With the advent of the Web, a vast new source of lin-
Large collections of digitally stored and uniformly encoded	guistic information has emerged. The exploitation of this
texts - language corpora - have for a number of years been	resource has especially gained momentum with the WaCky
the basic data resources that linguists, including lexicog-	initiative [1], which has popularised the concept of "Web
raphers, have used for their investigations into language	as Corpus". It has also made available tools for compiling
and for making dictionaries. However, the traditional way	such corpora and produced large WaC corpora for a num-
of compiling corpora, which involved acquiring texts from	ber of major European languages. Now such corpora are
authors and publishers, which exists in many disparate for-	also being built for the so called smaller languages, such as
Norwegian [8], Czech [18] and Serbian [11], moving the concept of a "large corpus" for smaller languages up to the 1 billion token frontier.
As Web corpus acquisition is much less controlled than that for traditional corpora, the necessity of analysing their content gains in significance. The linguistic quality of the content is mostly explored through word lists and collocates [1] while the content itself is explored using unsu-pervised methods, such as clustering and topic modelling [17].
For Slovene, a Web corpus has already been built [12]. However, the first version of slWaC (hereafter slWaCi) was rather small, as it contained only 380 million words. Furthermore, it contained domains from the Slovene top-level domain only, i.e. only URLs ending with ".si" were harvested. In the meantime, hrWaC, the Croatian Web corpus had already moved to version 2, touching the 2 billion token mark, and Web corpora for Serbian and Bosnian were built as well [11], all of them passing the size of slWaC1, making it high time to move forward also with slWaC.
This paper presents version 2 of slWaC (hereafter slWaC2) which tries to overcome the limitations of slWaC1: it extends it with a new crawl, which also includes well known Slovene Web domains from other top-level domains, and introduces a new pipeline for corpus collection and cleaning, resulting in a corpus of 1.2 billion tokens with removed near-duplicate documents and flagged near-duplicate paragraphs.
The rest of the paper is structured as follows: Section 2 presents the corpus construction pipeline, Section 3 introduces the linguistic annotation of the corpus, its format and its availability for on-line concordancing, Section 4 investigates the content of the corpus, by comparing it to slWaCi, to the balanced corpus of Slovene KRES, and the reference corpus of Slovene Gigafida, while Section 5 gives some conclusions and directions for future work.
2 Corpus construction 2.1 Crawling
For performing the new crawl we used the SpiderLing crawler1 with its associated tools for guessing the character encoding of a Web page, its content extraction (boilerplate removal), language identification and near-duplicate removal [19].
The SpiderLing crawler uses the notion of yield rate to optimize the crawling process regarding the amount of unique textual material retrieved given the overall amount of data retrieved. Yield rate is calculated for each Web domain as the ratio of bytes of text contributed to the final corpus and the bytes retrieved from that domain. Web domains with a yield rate under a predefined threshold are discarded from further crawling, thereby focusing the remaining crawl on the domains where more unique textual mate-
rial is to be found. SpiderLing has two predefined yield rates that control when a low-yield-rate Web domain is blacklisted; we used the lower one which is recommended for smaller languages.
As seed URLs we used the home pages of Web domains obtained during the construction of slWaCi and additionally 30 well known Slovene Web domains, which are outside the .si top-level domain.
The crawl was run for 21 days, with 8 cores used for document processing, which includes guessing the text encoding, text extraction, language identification and physical duplicate removal, i.e. removing copies of identical pages which appear under different URLs. After the first 14 days there was a significant decrease in computational load, showing that most of the domains had been already harvested and that the process of exhaustively collecting textual data from the extended Slovene top-level domain was almost finished.
After completing the crawling process, which already included document preprocessing, we merged the new crawl with slWaCi . We added the old dataset to the end of the new one, thereby giving priority to new data in the following process of near-duplicate removal. It should be noted that the corpus can, in cases when the content has changed, contain two texts with the same URL but with different crawl dates.
2.2 Near-duplicate removal
We performed near-duplicate identification both on the document and the paragraph level using the onion tool2 with its default settings, i.e. by calculating 5-gram overlap and using the 0.5 duplicate content threshold. We removed the document-level near-duplicates entirely from the corpus, while keeping paragraph-level near-duplicates, labelling them with a binary attribute on the <p> element. This means that the corpus still contains the (near)duplicate paragraphs, which is advantageous for showing contiguous text from Web pages, but if, say, language modelling for statistical machine translation were to be performed [10], near-duplicate paragraphs can easily be removed.
The resulting size of the corpus (in millions of tokens) after each of the three duplicate removal stages is given in Table 1. We compare those numbers to the ones obtained on the Croatian, Bosnian and Serbian domains [11], showing that the second versions of the corpora (hrWaC and slWaC), which merge two crawls obtained with different tools and were collected three years apart, show a smaller level of reduction (around 30%) at each step of near-duplicate removal, while the first versions of corpora (bsWaC and srWaC), obtained with SpiderLing only and in one crawl, suffer more data loss in this process (around 35-40%).
1http://nlp.fi.muni.cz/trac/spiderling
2https://code.google.com/p/onion/
	PHY	DND	PND	R1	R2
slWaC2	1,806	1,258	895	0.31	0.29
hrWaC2	2,686	1,910	1,340	0.29	0.30
bsWaCi	722	429	288	0.41	0.33
srWaCi	1,554	894	557	0.42	0.37
Table 1: Sizes of the Web corpora in millions of tokens after removing physical duplicates (PHY), document near-duplicates (DND) and paragraph near-duplicates (PND), with the reduction ratio (R1 and R2) after the DND and subsequent PND steps.
2.3 Linguistic annotation
slWaC2 was tagged and lemmatised with ToTaLe [4] trained on JOS corpus data [5]. However, it should be noted that ToTaLe had been slightly updated, so in particular the tokenisation of slWaCi and slWaC2 at times differs. The morphosyntactic descriptions (MSDs) that the words of the corpus are annotated with follow the JOS MSD specifications, however, these do not define a tag for punctuation. As practical experience has shown this to be a problem, we have introduced a punctuation category and MSD, named "Z" in English and "U" in Slovene.
3 Overview of the corpus 3.1 Size of the corpus
Table 2 gives the size of slWaC2, showing separately the amount of information from the 2011 crawl, from the 2014 crawl, and overall amount of information. For each of the counted elements we give the size of the corpus after removing document near-duplicates (DND from Table 1), and for the corpus which has also paragraph near-duplicates removed (PND).
Starting with the number of domains, it can be seen that the new crawl produced less domains than the first one, due to a large number (of the complete space of URLs) of static domains being removed in the physical deduplica-tion stage (PHY). Nevertheless, the complete corpus has, in comparison to slWaC1, about 12,000 new domains. Observing the URLs, we note that the new crawl gave somewhat less URLs than the old one, and that there is little overlap between the two, i.e. about 1%: 28,315 URLs are the same from both crawls, which means that their content has changed in the last three years (and are then in the corpus distinguished by having a different crawl date).
Regarding the number of paragraphs, we give both the numbers for DND and PND, with the reduction being very similar to the reduction on the token level already expressed in Table 1, i.e. 29%. For paragraphs, sentences, words and tokens, the complete corpus is simply the sum of the items for each of the two crawls. The most important numbers are the sizes of the complete corpus in tokens, i.e. 1.25 billion words for the DND and 900 million for PND,
which makes the corpus almost as large as Gigafida [13], the largest corpus of Slovene to date.
3.2 Corpus format
The annotated corpus is stored in the so called vertical format, used by many concordancing engines. This is an XML-like format in that it has opening and closing or empty (structural) XML tags, but the tokens themselves are written one per line, with the first (tab separated) column giving the token (word or punctuation) itself, the second (in our case) its lemma (or, for punctuation, again the token), the third its MSD in English and the fourth the MSD in Slovene, as illustrated by Figure 1.
<text domain="www.cupradan.si"
url="http://www.cupradan.si/" crawled="2014"> <gap extent="1000+"/> <p type="text" duplicate="0"> <s>
Izmed
vseh <g/>
ki
boste
delili
video
izmed ves
ki biti
Z
Sg
Pg-mpg Z
Cs
U
Dr
Zc-mmr U
Vd
Va-f2p-n Gp-pdm-n deliti Vmpp-pm Ggnd-mm video Ncmsan Sometn
Figure 1: Vertical format of the annotated slWaC2.
The example also shows a few other features of the encoding. Each text is given its URL, the domain of this URL and the year (2011 or 2014) on which it was crawled. Boilerplate removal often deletes linguistically uninteresting texts from the start (and end) of the document, which is marked by the empty gap element, which also gives the approximate extent of the text removed. The paragraphs are marked by their type, which can be "heading" or "text", while the "duplicate" attribute tells whether the paragraph is a (near) duplicate of some other paragraph in the corpus, in which case its value is "1", and "0" otherwise. Finally, we also have the empty "glue" element g, which can be used to suppress the space between two adjacent tokens in displaying the corpus.
3.3 Availability
The corpus is mounted under the noSketchEngine concor-dancer [15] installed at nl.ijs.si/noske. The concordancer allows for complex searches in the corpus, from concordances taking into account various filters, to frequency lex-ica over regular expressions.
Tlr

slWaC2	2011	2014	All
Domains	25,536	22,062	37,759
URLs (DND)	1,528,352	1,295,349	2,795,386
Paragraphs (DND)	7,535,453	18,303,123	25,838,576
(PND)	6,325,075	10,329,692	16,654,767
Sentences (DND)	22,615,610	50,693,747	73,309,357
(PND)	19,001,653	31,560,289	50,561,942
Words (DND)	360,273,022	718,332,186	1,078,605,208
(PND)	301,547,669	465,780,456	767,328,125
Tokens (DND)	421,178,853	837,727,874	1,258,906,727
(PND)	352,474,874	542,912,192	895,387,066
Table 2: Size of the slWaC 2.0 corpus.
We also make the corpus available for download, but not directly, mainly due to question of personal data protection. Namely, the corpus contains most of the Slovene Web, at least in the .si domain, so it also contains a lot of personal names with accompanying text. This is not such a problem with the concordancer, as similiar results on Web-accessible personal names can be also obtained by searching through Google or the Slovene search engine Najdi.si. However, being able to analyse the complete downloaded corpus enables much more powerful information extraction methods to be used, potentially leading to abuse of personal data. This is why we make the corpus available for research only, and require a short explanation of the use it will be put to. However, we make available the metadata of the corpus, in particular the list of URLs included in it, which enables other to make their own corpus on this basis.
4 Comparative corpus analysis
This section investigates how different the slWaC2 corpus is from its predecessor, slWaCi and from two other corpora of Slovene [13]: the balanced reference corpus KRES, which contains 100 million words, and the reference corpus Gigafida, which contains 1.2 billion words, mostly (77%) from printed periodicals created between 1990 and 2011. The KRES corpus was sampled from Gigafida and has roughly the following structure: 35% books, 40% periodicals and 20% Internet. To establish how different these corpora are we used the method of frequency profiling [14]. We first made a frequency lexicon of the annotation under investigation (lemma or grammatical description) for slWaC2 and the corpus it was compared with, and then for each item in this lexicon computed its log-likelihood (LL). The formula takes into account the two frequencies of the element as well as the sizes of the two corpora which are being compared; the greater LL is, the more the item is specific for one of the corpora. To illustrate, we give in Table 3 the first 15 lemmas with their LL score and their frequency per million words in slWaC 1 and slWaC2, with the larger frequency in bold.
As can be noted, most of these highest LL lemmas
Lemma	LL	slWaC1pm	slWaC2 pm
clen	30,366	0.131	0.282
foto	23,092	0.018	0.081
m2	22,826	0	0.033
biti	22,767	76,984	74,493
O	21,447	0.001	0.036
3d	17,738	0	0.026
spoštovan	11,177	0.019	0.059
2x	11,092	0	0.016
tožnik	9,909	0.008	0.036
odstotek	9,265	0.515	0.393
co2	9,090	0	0.013
amandma	8,992	0.007	0.031
hvala	8,954	0.106	0.173
1x	8,505	0	0.012
ekspr	8,373	0	0.012
Table 3: The first 15 lemmas with highest log-likelihood scores and their frequency per million words for the comparison of the old and new version of slWaC
are more prominent in slWaC2; only "biti" (to be) and "odstotek" (percent) are more frequent in slWaC1. Furthermore, quite a few lemmas have frequency 0 in slWaC 1. This is indicative of a difference in annotation between the two corpora: as mentioned, the tokenisation module of ToTaLe had been somewhat improved lately, which is evidenced in the fact that strings, such as "m2" and "3d" were wrongly split into two tokens in slWaCi but are kept as one in slWaC2. It is a characteristic of LL scores that they show such divergences, which should ideally be fixed, to arrive at uniform annotation of the resources.
4.1 Lemma comparison with slWaCi
The motivation behind comparing the previous and current version of slWaC was primarily to investigate what kind of text types are better represented in the new (or old) version of the corpus. Apart from the already mentioned differences in tokenisation, slWaC2 is more prominent in three
types of lemmas (texts).
First, there are legal texts, characterised by lemmas such as "clen" (article,) "odstavek" (paragraph), "amandma" (amendment) "tožnik" (plaintiff), which come predominantly from governmental domains, e.g. for "clen" mostly from uradni-list.si (official gazette), dz-rs.si (parliament), sodisce.si (courts).
Second are texts that address the reader (or, say, parliamentary speaker) directly, such as "spoštovan" (honoured), "pozdravljen" (hello), "hvala" (thank you). For "spoštovan", the most highly ranked domains are, again, the parliament, i.e. dz-rs.si, followed by vizita.si (medical help page of commercial POP.TV), delo.si (main Slovene daily newspaper), while "pozdravljen" and "hvala" come mostly from user forums. The corpus slWaC2 is thus more representative in text-rich domains whose content changes rapidly and that contain user-generated content.
Third, the list contains two interesting "lemmas" with very high LL scores. The first is "ekspr" (only 19 in slWaCi but more than 9,000 in slWaC2), which is the (badly tokenised) abbreviation "ekspr." meaning "expressive". It turns out that practically the only domain that uses this abbreviation is bos.zrc-sazu.si, i.e. the portal serving the monolingual Slovene dictionary SSKJ, which was newly harvested in slWaC2. Similarly, the word "ino" (less than 500 in slWaCi but more than 7,000 in slWaC2) turns out to be the historical form of "in" (and). Practically the only domain containing this word (6,000x) is nl.ijs.si, which now hosts a large library of old Slovene books. The new slWaC thus contains some extensive new types of texts coming from previously unharvested domains or domains that have had large amounts of new content added.
Finally, it is worth mentioning that the first proper noun in slWaC2 appears only at position 36 in the LL list, and is "bratušek" with almost 6,000 occurrences, referring to Alenka Bratušek, the former (2013 - 2014) PM of Slovenia.
It is also instructive to see which lemmas are now less specific against slWaC 1. Among function words, there is less conjunction "pa" used either as an informal version of "in" (and) or as an adversary conjunction but, and there is less of "da" (that), used to introduce relative clauses. The drop in the frequency of the conjunction "pa" seems to have a link in the increase of the conjunction "in" (and) which now demonstrates more than 22 million occurrences. Significantly lower appearance of "da" can be explained by the fact that verbs such as "dejati" (to say), "porocati" (to report), "pojasniti" (to explain), "povedati" (to tell), "sporočiti" (to communicate), and "napovedati" (to predict), which are usually followed by the conjunction "da" are now much less used in slWaC2. Those verbs are typical for news reporting and the drop in their usage indicates a drop in the proportion of news items in the corpus.
Most of the bottom part of the LL list, of course, consists of nouns and adjectives - and all of them again confirm that harvesting of texts for slWaC2 was much less focused on news portals than for the previous one. Namely,
as a previous frequency profiling of Gigafida and KRES shows [2] lemmas like "odstotek" (percent); "milijon" (milion), "evro" (euro), "dolar" (dollar), "tolar" (former Slovene currency); "predsednik" (president), "premier" (prime minister), "država" (state), "minister" (minster); "ameriški" (American), "britanski" (British), "hrvaški" (Croatian), "nekdanji" (former), "leto" (year), "lani" (last year), and "zdaj" (now) all typically appear in daily newspapers (or, in our case, on news portals) reporting on interior and international affairs - and, as mentioned, we found all of them at the bottom of the LL list, indicating less news in slWaC2 than in slWaCi and also the shifting of major news topics (for instance from Kosovo and Iraq).
4.2 Lemma comparison with KRES
With slWaC2, as with Web corpora in general, it is an interesting question of how representative and balanced they are. The easiest approach towards an answer to this question is a comparison with "traditional" reference corpora, and such experiments have been already performed, e.g. between the British Web corpus ukWaC and BNC, the British National Corpus [1]. The comparisons have shown that while Web corpora are different from classical corpora, which contain mostly printed sources, the differences are in general not great and so they can function as modern-day reference corpora.
We made a comparison between slWaC2 and KRES [13], the balanced reference corpus of Slovene with 100 million words. The comparison shows that, as with slWaC 1, some of the differences are due to the different linguistic analyses. As mentioned, slWaC2 was processed with ToTaLe, while KRES used the Obeliks tokeniser, tagger and lemmatiser [7], and the two disagree in some lemmatisations, the most prominent being "veliko/vec" (much), "mogoče/mogoč" (possible), "edini/edin" (only), "desni/desen" (right), "levi/lev" (left), "volitve/volitev" (elections), as well as some differences in tokenisation, e.g. "le-ta" and "d.o.o." as one token or three.
Real linguistic differences concern mostly two types of lemmas. The first are highly ranked non-content words such as "pa, tudi, sicer, ter, naš" (but, also, otherwise, and, our), which most likely show the bias of slWaC2 texts to antithetical and intensifying sentences, sentences with binding clause elements (adducing), and sentences which either (a) describe characteristics of the institution representing itself on the Web - "naši programi, naša spletna stran" (our programmes, our Web page); (b) establish a common communication circle [9, 6] - "naša dežela, naši plezalci" (our country, our climbers), or (c) include readers into a text - "naša duhovna rast, naša pot" (our spiritual growth, our path). The second are content lemmas, which fall into several groups: "spleten" (Web), "podjetje" (company), "tekma, ekipa" (match, team), "sistem, uporabnik, aplikacija" (system, user, application), and "blog" (blog), i.e. slWaC2 has more commercial, sports, and computer related texts, and, of course, text specific to the Web (blogs).
Conversely, KRES shows more lemmas to do with legal texts, such as "člen, odstavek, zakon" (article, paragraph, law), so that even with slWaC2 having more texts of this type than slWaCi, it still has much less than KRES.
KRES also has a specific group of lemmas, thematis-ing a person in relation to another person, e.g. "mama, oCe, mož, žena" (mother, father, husband, wife), and verbs characteristic for interpersonal communication - "vprašati, nasmehniti se, prikimati, zasmejati se" (to ask, to smile, to nod, to laugh). All these mostly come from fiction books in KRES. Two more specific lemmas are worth mentioning: "tolar" (former Slovene currency) shows that KRES, unlike slWaC2, contains texts dating before 2007 (the changeover year to the euro in Slovenia), while "wallander", the hero of a series of detective novels, shows that KRES - at least in this instance - has too much text from a single source, here a book series.
4.3	Lemma comparison with Gigafida
Not surprisingly, a comparison between slWaC2 and the Gigafida corpus showed rather similar results to the comparison between slWaC2 and KRES. The top part of the list again contains content lemmas like "spleten" (Web), "aplikacija" (application), "blog", "uporabnik" (user), "facebook", "sistem" (system), etc., indicating slWaC2 has more computer and Web related texts. However, the interesting part is the part where the two LL lists differ. First, it is obvious the Gigafida korpus has more sport related texts than KRES, therefore lemmas like "tekma" (game), "ekipa" (team), "rezultat" (result), "sezona" (season), "trening" (training) and "zmaga" (victory) are less prominent in slWaC2and in KRES. The lemma "podjetje" (company) has a much lower LL score now as well, showing it is thematised in Gigafida in a larger proportion of texts than in KRES. Lemmas that are specific to slWaC2 when we compare it to Gigafida (and not, when we compare it to KRES) are mostly non-content words, such as conjunctions "in, ali, ce" (and, or, if), personal pronouns "jaz, ti" (I, you), and possessive personal pronouns "moj, tvoj, vaš" (my, yoursg, your pi), which show slWaC2 contains more first and second person related contents most likely coming from user generated texts.
The lowest part of the LL list shows lemmas specific to Gigafida indicating Gigafida's bias towards news reporting texts thematising internal affairs, economy, and crime: "predsednik" (president), "minister" (minister), "vlada" (government), "občina" (municipality), "prodati" (to sell), "direktor" (manager), "milijon" (million), and "policist" (police officer), cf. [2].
4.4	Grammatical comparison with KRES
Apart from lemmas, it is also interesting to compare how the distribution of morphosyntactic categories of slWaC2 differs from that of KRES. To this end we calculated six LL comparison scores, for uni-, bi- and trigrams of part-
of-speech (PoS) and of complete morphosyntactic descriptions (MSDs).
The unigram PoS LL scores show that slWaC2 has significantly more adjectives, unknown words, conjunctions, prepositions and particles, in this order. However, it has much less punctuation and numerals, and slightly less interjections. Especially with unknown words and punctuation the differences might be, at least partially, an artefact of different annotation programs. For the others, the results show that slWaC2 tends more towards informal, user generated language (typical lemma for which is also "lp" meaning "lep pozdrav" (best regards) placed at position 20 in the LL list), although this conclusion is somewhat offset by the fact that it has less interjections. However, tagging interjections is notoriously imprecise, and the difference here might also be due to different taggers used. Conversely, KRES with its numerals shows a preponderance of newspaper texts, which tend to use lots of dates, times, amounts, and sports scores.
PoS bigrams again highlight the different annotation tools used. The most prominent combination in slWaC2 is a numeral followed by an abbreviation, e.g. "90 EUR, 206 kW, 298,80 m2" but this difference is due to the fact that in slWaC2 "EUR", "kW" etc. are treated as abbreviations, whereas they are common nouns in KRES. The same reasoning applies to combinations with punctuation. However, there are also legitimate combinations in the top scoring LL PoS bigrams: slWaC2 has more noun + verb, adjective + noun and verb + adjective combinations, while KRES has more numeral + numeral, numeral + noun and verb + verb combinations. Scores for PoS trigrams give little new information: apart from annotation differences, the most prominent slWaC2 combination is noun + noun + verb, which are mostly name + surname + predicate, e.g. "Oto Pestner naredil", while the most prominent for KRES is a sequence of three numerals.
As for MSDs, the differences in unigrams in favour of slWaC2 are greatest for the three unknown word types that KRES doesn't use (Xf: foreign word, Xp: program mistake and Xt: typo), followed by general adverbs in the positive degree, coordinating conjunctions, present tense first person auxiliary verb in the plural ("smo'') and animate common masculine singular noun in the accusative, i.e. the object of a sentence, e.g. "otroka". Conversely, KRES has much more punctuation, digits, common masculine and feminine singular nouns in the nominative (i.e. subjects) and general adverbs in comparative and superlative degrees. Bigrams show that slWaC2 has many more general adjective + common noun combinations in various genders and cases, while KRES has many more combinations with digits. The space of MSD trigrams is very large, and, if we discount the combinations appearing as a result of different annotations, does not show very interesting differences.
5 Conclusion
The paper presented a new version of the Slovene Web corpus, which is almost three times larger than its initial version and is made available through a powerful and freely accessible concordancer. During the construction process we focused on the content reductions obtained through near-duplicate removal, showing that both reductions to document and paragraph level remove a similar amount of content. We also compared the content of the slWaC2 corpus to three other Slovene corpora (the slWaCi corpus, the balanced reference corpus KRES and the reference corpus Gigafida) with frequency profiling on lemmas and grammatical descriptions.
This comparison showed that the new version of the corpus has significantly more legal texts and specific text types, such as a dictionary and a library of historical books and (comparatively) less news. In the lemma comparison with KRES it has less legal texts but more user generated content and more commercial, sports, political and computer related texts. The comparison with Gigafida again showed slWaC2 has more computer and Web related texts, while in this case sports and commercial news were no longer slWaC2 specific. A larger proportion of several personal pronouns indicated a significant difference in the extent of the user generated content between the two corpora as well. The comparison of grammatical categories also shows a bias to informal writing and against newspaper items. But maybe the most surprising (although, in retrospect, quite logical) insight of the comparison using frequency profiling is that it is a very good tool to detect even slight differences in the processing pipelines used for the compared corpora, which then lead to significant differences in the (token, lemma and MSD) vocabularies.
There are several directions that our future work could take. First, by constructing the second version of two out of four existing Web corpora of South Slavic languages, two ideas have emerged: one is to build a multilingual corpus consisting of all South Slavic languages, and the second to develop a monitor corpus which would be automatically extended with new crawls in predefined time frames. The second direction is in the annotation of the corpus, where more effort should be invested in developing a gold standard processing pipeline, which could then be used to re-annotate the Slovene corpora in a unified manner. In addition, given that the Web contains a significant portion of user generated content containing non-standard language, the annotation pipeline should be extended by introducing a standardisation (normalisation) step on word-forms, similar to our approach to modernisation of historical Slovene words [16], which would then give better lemmas and MSDs, allowing for easier exploration of Web corpora.
As to the slWaC2 functioning as a modern-day reference corpus of Slovene, the analysis showed considerable differences in the three corpora. In the future we therefore intend to supplement the results of the lemma comparison
with the results of the topic modelling method [3, 17, 2]. From the assembled data of both methods we will be able to estimate more precisely which texts each corpus contains and, perhaps even more importantly, which texts each corpus misses. We believe the building of the next generation reference corpus of Slovene could in this way greatly benefit from the slWaC2 corpus - its contents as well as its construction methodology.
References
[1]	Marco Baroni, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta. The WaCky wide Web: a collection of very large linguistically processed Web-crawled corpora. Language Resources and Evaluation, 43(3):209-226, 2009.
[2]	Nataša Logar Berginc and Nikola Ljubešic. Gigafida in slWaC: tematska primerjava. Slovenščina 2.0, 1(1):78-110, 2013.
[3]	Michael I. Jordan David M. Blei, Andrew Y. Ng and John Lafferty. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993-1022, 2003.
[4]	Tomaž Erjavec, Camelia Ignat, Bruno Pouliquen, and Ralf Steinberger. Massive multilingual corpus compilation: Acquis Communautaire and ToTaLe. Archives of Control Sciences, 15(3):253-264, 2005.
[5]	Tomaž Erjavec and Simon Krek. The JOS Mor-phosyntactically Tagged Corpus of Slovene. In
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), Marrakech, Morocco, 2008.
[6]	Monika Kalin Golob and Nataša Logar. Prostor v porocevalskem skupnem sporocanjskem krogu. Slavistična revija, 62(3):363-373, 2014.
[7]	Miha Grcar, Simon Krek, and Kaja Dobrovoljc. Obeliks: statisticni oblikoskladenjski oznacevalnik in lematizator za slovenski jezik. In Zbornik Osme konference Jezikovne tehnologije, Ljubljana, 2012. Jožef Stefan Institute.
[8]	Emiliano Guevara. NoWaC: A Large Web-based Corpus for Norwegian. In Proceedings of the NAACL HLT 2010 Sixth Web As Corpus Workshop, WAC-6 '10, pages 1-7, 2010.
[9]	Tomo Korošec. Stilistika slovenskega porocevalstva. Kmecki glas, Ljubljana, 1998.
[10] Nikola Ljubešic and Antonio Toral. caWaC - a Web Corpus of Catalan and its Application to Language Modeling and Machine Translation. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), Reykjavik, Iceland, may 2014. European Language Resources Association (ELRA).
[11]	Nikola Ljubešic. {bs,hr,sr}WaC: Web corpora of Bosnian, Croatian and Serbian. In Proceedings of the WAC-9 Workshop, 2014.
[12]	Nikola Ljubešic and Tomaž Erjavec. hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene. In Ivan Habernal and Väclav Matousek, editors, Text, Speech and Dialogue - 14th International Conference, TSD 2011, Pilsen, Czech Republic, September 1-5, 2011. Proceedings, Lecture Notes in Computer Science, pages 395-402. Springer, 2011.
[13]	Nataša Logar, Miha Grcar, Marko Brakus, Tomaž Erjavec, Špela Arhar Holdt, and Simon Krek. Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba. Zbirka Sporazumevanje. Trojina, zavod za uporabno slovenis-tiko: Fakulteta za družbene vede, Ljubljana, 2012.
[14]	Paul Rayson and Roger Garside. Comparing Corpora Using Frequency Profiling. In Proceedings of the Workshop on Comparing Corpora, pages 1-6. Association for Computational Linguistics, 2000.
[15]	Pavel Rychly. Manatee/bonito - a modular corpus manager. 1st Workshop on Recent Advances in Slavonic Natural Language Processing, pages 65-70, 2007.
[16]	Yves Scherrer and Tomaž Erjavec. Modernizing historical Slovene words with character-based SMT. In BSNLP 2013 - 4th Biennial Workshop on Balto-Slavic Natural Language Processing, Sofia, Bulgaria, 2013.
[17]	Serge Sharoff. Analysing Similarities and Differences between Corpora. In Proceedings of the Seventh Conference on Language Technologies, pages 511, Ljubljana, 2010. Jožef Stefan Institute.
[18]	Drahomfra Spoustovä, Miroslav Spousta, and Pavel Pecina. Building a Web Corpus of Czech. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta, 2010. European Language Resources Association (ELRA).
[19]	Vit Suchomel and Jan Pomikälek. Efficient Web Crawling for Large Text Corpora. In Serge Sharoff Adam Kilgarriff, editor, Proceedings of the seventh Web as Corpus Workshop (WAC7), pages 39-43, Lyon, 2012.
A Rule-Based System for Automatic De-identification of Medical Narrative Texts
Jelena Jacimovic12, Cvetana Krstev1 and Drago Jelovac2
1University of Belgrade, Faculty of Philology, Studentski trg 3, 11000 Belgrade, Serbia 2University of Belgrade, School of Dental Medicine, Dr. Subotica 8, 11000 Belgrade, Serbia E-mail: jjacimovic@rcub.bg.ac.rs, cvetana@matf.bg.ac.rs, drago.jelovac@stomf.bg.ac.rs
Keywords: named entity recognition, finite-state transducers, rule-based system, de-identification, medical narrative texts, Serbian
Received: May 7, 2014
This paper presents an automatic de-identification system for Serbian, based on the adaptation of the existing rule-based named entity recognition system. Built on a finite-state methodology and lexical resources, the system is designed to detect and replace all the explicit personal protected health information present in the medical narrative texts, while still preserving all the relevant medical concepts. The results of a preliminary evaluation demonstrate the usefulness of this method both in preserving patient privacy and the de-identified document interoperability.
Povzetek: Razvit je nov sistem za de-identifikacijo besedil v srbskem jeziku.
1 Introduction
Current advances in health information technology enable health care providers and organizations to automate most aspects of the patient care management, facilitating collection, storage and usage of patient information. Such information, stored in the form of electronic medical records (EMRs), represents accurate and comprehensive clinical data valuable as a vital resource for secondary uses such as quality improvement, research, and teaching. Besides the vast useful information, narrative clinical texts of the EMR also include many items of patient identifying information. For both ethical and legal reasons, when confidential clinical data are shared and used for research purposes, it is necessary to protect patient privacy and remove patient-specific identifiers through a process of the de-identification.
De-identification is focused on detecting and removing/modifying all explicit personal Protected Health Information (PHI) present in medical or other records, while still preserving all the medically relevant information about the patient. Various standards and regulations for health data protection define multiple directions to achieve de-identification, but the most frequently referenced regulation is the US Health Information Portability and Accountability Act (HIPAA) [1]. According to the HIPAA "Safe Harbor" approach, clinical records are considered de-identified when 18 categories (17 textual and one regarding images) of PHI are removed, and the remaining information cannot be used alone or in combination with other information to identify an individual. These PHI categories include names, geographic locations, elements of dates (except year), telephone and fax numbers, medical record numbers or any other unique identifying numbers,
among others. Since the manual removal of PHI by medical professionals proved to be prohibitively time-consuming, tedious, costly and unreliable [2, 3, 4], extracting PHI requires more reliable, faster and cheaper automatic de-identification systems based on the Natural Language Processing (NLP) methods [5].
The extraction of PHI can be viewed as a Named Entity Recognition (NER) problem applied in the medical domain for de-identification [6]. However, even though both the traditional NER and de-identification involve automatic recognition of particular phrases in text (persons, organizations, locations, dates, etc.), de-identification differs importantly from the traditional NER [7]. In contrast to the general NER focused on newspaper texts, de-identification deals with the clinical narratives characterized by fragmented and incomplete utterances, the lack of punctuation marks and formatting, many spelling and grammatical errors, as well as domain specific terminology and abbreviations. Since de-identification is the first step towards identification and extraction of other relevant clinical information, it is extremely important to overcome the problem of significantly large number of eponyms and other non-PHI erroneously categorized as PHI. For instance, the anatomic locations, devices, diseases and procedures could be erroneously recognized as PHI and removed (e.g. "The Z-vezdara method"^ vs. Clinical Center "Zvezdara"), reducing the usability and the overall meaning of clinical notes, and thus the accuracy of
1 The original surgical 2-step arteriovenous loop graft
procedure developed in the Clinical Center "Zvezdara", Belgrade, Serbia. Zvezdara is a municipality of Belgrade.
subsequent automatic processes performed on the de-identified documents.
In this paper we introduce our automatic clinical narrative text de-identification system, based on the adaptation of the existing rule-based NER system for Serbian. The aim of this study is to evaluate the accuracy of PHI removal and replacement while preserving all the medically relevant information about a patient and keeping the resulting de-identified document usable for subsequent information extraction processes.
2 Related work
Over the past twenty years, various text de-identification approaches have been developed, but relatively few published reports are focused only on the unstructured medical data. Extensive review of recent research in automatic de-identification of narrative medical texts is given in [5]. However, most of them are highly specialized for specific document types or a subset of identifiers. Regarding the general nature of applied de-identification methods, the majority of the systems used only one or two specific clinical document types (pathology reports, discharge summaries or nursing progress notes) for the evaluation [3, 8, 9, 10], while only a few of them were evaluated on a larger scale, with a more heterogeneous document corpus [11, 12, 13, 14]. The selection of targeted PHI varied from patient names only [12] to all 17 textual HIPAA PHI categories [3, 7, 15, 16, 17, 18, 19], or even everything but valid medical concepts [20, 21].
The de-identification approaches applied in medical domain are mostly classified into the rule-based or machine learning methods, while some hybrid approaches [14] efficiently take advantage of both previous methods. The rule-based methods [3, 15, 17, 19, 21] make the use of dictionaries and hand-crafted rules to identify mentions of PHI, with no annotated training data. Although these systems are often characterized with the limited generalizability that depends on the quality of the patterns and rules, they can be easily and quickly modified by adding rules, dictionary terms or regular expressions in order to improve the overall performance [22]. On the other hand, the machine-learning methods [7, 8, 9, 16, 18, 23], proved to be more easily generalized, automatically learn from training examples to detect and predict PHI. However, these methods require large amounts of annotated data and the adaptation of the system might be difficult due to the often unpredictable effects of a change. Extensive review of the published strategies and techniques specifically developed for de-identification of EMRs is given in [24]. In 2006, within the Informatics for Integrating Biology and the Bedside (i2b2) project and organized de-identification challenge, a small annotated corpus of hospital discharge summaries were shared among the interested participants, providing the basis for the system development and evaluation. Detailed overview and evaluation of the state-of-the-art systems that participated in the i2b2 de-identification challenge is given in [25].
Aside from systems specifically designed for the purpose of de-identification, some NER tools trained on newspaper texts also obtained respectable performance with certain PHI categories [7, 26].
3 Data and methods
This section provides an overview of our rule-based de-identification approach for narrative medical texts.
3.1	Training and text corpus
The training corpus for our system development consisted of 200 randomly selected documents from different specialties, generated at three Serbian medical centers. They included discharge summaries (50), clinical notes (50) and medical expertise (100), with a total word count of 143,378. The discharge summaries and clinical notes are unstructured free text typed by the physicians at the conclusion of a hospital stay or series of treatments, including observations about the patient's medical history, his/her current physical state, the therapy administered, laboratory test results, the diagnostic findings, recommendations on discharge and other information about the patient state. Medical expertise documents were oversampled because of their richness in the PHI items.
Main characteristics of medical narratives were confirmed in our corpus: fragmented and incomplete utterances and lack of punctuation marks and formatting. Moreover, as these documents are usually written in a great hurry there is also an unusual number of spelling, orthographic and typographic errors, much larger than in, for instance, newspaper texts from the Web. For the moment, we have taken these documents as they are and we are not attempting to correct them. In some particular situations we are able to guess the intended meaning, as will be explained in the next section.
3.2	The NER system
The primary resources for natural language processing of Serbian consist of lexical resources and local grammars developed using the finite-state methodology as described in [27, 28]. For development and application of these resources the Unitex corpus processing system is used [29]. Among general resources used for NER task are the morphological e-dictionaries, covering both general lexica and proper names, as well as simple words and compounds, including not only entries collected from traditional sources, but also entries extracted from the processed texts [30]. Besides e-dictionaries, for the recognition and morphosyntactic tagging of open classes of simple words and compounds generally not found in dictionaries, the dictionary graphs in the form of finite-state transducers (FSTs) are used. Due to the high level of complexity and ambiguity of named entities, the additional resources for NER were developed [31]. The Serbian NER system is organized as a cascade of FSTs -CasSys [32], integrated in the Unitex corpus processor. Each FST in a cascade modifies a piece of text by replacing it with a lexical tag that can be used in subsequent FSTs. For instance, in a sequence Dom
zdravlja "Milutin Ivkovic" 'Health Center 'Milutin Ivkovic" first a full name 'Milutin Ivkovic' is recognized and	tagged	{Milutin
Ivkovic,.NE+persName+full:ms1v}, and then a subsequent transducer in the cascade uses this information to appropriately recognize and tag the full organization name that can also be subsequently used (see Figure 1 and Example (1)).
Figure 1. A path in a cascade graph that uses already recognized NEs to recognize organization names.
(1){Dom	zdravlja "\{Milutin Ivkovic\, \.NE\+persName \+full\\:s1v\}",.NE+org:1sq:4sq}
Serbian NER system recognizes a full range of traditional named entity types:
•	Amount expressions - count, percentage, measurements and currency expressions;
•	Time expressions - absolute and relative dates and times of day (fixed and periods), durations and sets of recurring times;
•	Personal names - full names, parts of names (first name only, last name only), roles and functions of persons;
•	Geopolitical names - names of states, settlements, regions, hydronyms and oronyms;
•	Urban names - at this moment only city areas and addresses are recognized.
For the purpose of PHI de-identification not all of these NEs are of interest. For instance, amount expressions should not be de-identified, and roles or functions need not be de-identified. However, we chose not to exclude them from the recognition for two reasons: first, if they are recognized correctly that may prevent some false recognition and second, even if they are not of interest for this specific task they may help in recognition of some NEs that are of interest. For instance in Example (2) a name is erroneously typed (both the first and the last name are incorrect) but due to a correct recognition of a person's function the name is also recognized.
(2)	prof. dr sci Drangan Jorvanovic, specijalista za stomatološku protetiku i ortopediju 'Prof. PhD Drangan Jorvanovic, a specialist for Prosthetic Dentistry and Orthodontics'
The finite-state transducers used in the NER cascade beside general and specific e-dictionaries, as explained before, use local grammars that model various triggers and NE contexts, such as:
•	The use of upper-case letters - for personal names, geopolitical names, organizations, etc.;
•	The sentence boundaries - to resolve ambiguous cases where there is not enough other context;
•	Trigger words - for instance, reka 'river', grad 'city' and similar can be used to recognize
geopolitical names that are otherwise ambiguous;
•	Other type of the context - for instance, a punctuation mark following a country name that coincides with a relational adjective2 signals that it is more likely a country name than an adjective;
•	Other NEs - for instance, an ambiguous city name can be confirmed if it occurs in a list of already recognized NEs representing cities. Also, a five digit number that precedes a name of a city (already recognized) is tagged as a postal code (as used in Serbia).
•	Grammatical information - this information is used to impose the obligatory agreement in the case (sometimes also the gender and the number) between the parts of a NE. For instance, in .^istakao je gradonačelnik Londona Boris Džonson. '.stressed Mayor of London Boris Johnson.' Londona can be falsely added to the person's name (because London is also a surname) if grammatical information was not taken into consideration (Londona is in the genitive case, while Boris and Džonson are in the nominative case). This is enabled by grammatical information that is part of NE lexical tags (see Example (1)).
3.3 The PHI de-identification
We used our training corpus to create and adapt patterns that will capture the characteristics of PHI. Through the corpus analysis we found that, out of 18 HIPAA PHI categories, only eight appeared in our data. Since there is no annotation standard for PHI tagging, we collapsed some of the HIPAA categories into one (telephone and fax numbers, medical record numbers or any other unique identifying numbers). In order to maximize patient confidentiality, we adopted a more conservative approach, considering countries and organizations as PHI. For the purposes of this study, we defined the resulting PHI categories as follows:
•	Persons (pers) - refers to all personal names; includes first, middle and/or last names of patients and their relatives, doctors, judges, witnesses, etc.;
•	Dates (date) - includes all elements of dates except year and any mention of age information for patients over 89 years of age; according to HIPAA, the age over 89 should be collected under one category 90/120;
•	Geographic locations (top) - includes countries, cities, parts of cities (like municipalities), postal codes;
•	Organizations (org) - hospitals and other organizations (like courts);
•	Numbers (num) - refers to any combination of numbers, letters and special characters
2 In Serbian many country names coincide with relational adjectives of feminine gender: Norveška 'Norway' and norveška 'Norwegian'.
representing telephone/fax numbers, medical record numbers, vehicle identifiers and serial numbers, any other unique identifying numbers;
• Addresses (adrese) - street addresses.
The processing usually starts with a text having undergone a sentence segmentation, tokenization, part-of-speech tagging and morphological analysis. After general-purpose lexical resources are used to tag the text with lemmas, grammatical categories and semantic features, the FST cascade is applied, recognizing persons, functions, organizations, locations, amounts, temporal expressions, etc.
Since medical narratives have specific characteristics, the primary issue of date's recognition arose and we added a small cascade of FSTs prior to detection of the sentences. For the de-identification task and the processing of medical data, we performed the adjustments of the temporal expressions FSTs, in order to cover only those temporal expressions that should be treated as PHI. We also developed new patterns for the identification of different diagnostic codes present in training documents that could be misinterpreted as an identifier and then erroneously masked. Being applied as first in the cascade, this FST produces lexical tags denoting non-PHI category of the diagnostic codes, bringing the precision and accuracy up to an acceptable level in order to prevent loss of clinical information in the de-identification process.
Lexical tags produced by FSTs (see Example (3)), even though the most convenient for the use of subsequent FSTs in the cascade, are not useful for other applications and at the end are converted to the XML tags (Example (4)).
(3)	{Beogradu,.NE+top+gr:s7q}
(4)	<top.gr>Beogradu</top.gr>
The de-identification can be performed in several ways: PHI that needs to be de-identified can be replaced by a tag denoting its corresponding category, with a surrogate text, or both. We have chosen the latter approach. Moreover, since we are dealing with the narrative texts as a result we want to obtain a narrative text as well. To that end, the surrogate text is chosen to agree in case, gender and number with the PHI it replaces (if applicable). Again, such a replacement is enabled by grammatical information associated with some NE types (personal names, organization names, locations, etc.). For instance, recognized geographic name (+top) of the city (+gr) in Example (3) will be replaced by the surrogate text with the same values of the grammatical categories (Example (5)).
(5)	<top.gr PHI="yes">Kamengradu</top.gr>
At this moment, our system does not keep the internal structure of the numbers PHI category (num), and all the PHI numbers are simply replaced by placeholder characters X. Regarding the temporal information, only the month and day portion of date
expressions are considered PHI. According to HIPAA, the years are excluded from this category, being important features of the clinical context. In order to preserve the existing interval in days between two events in the text or the duration of specific symptoms, all dates were replaced by a shifted date that is consistent throughout all the de-identified documents.
3.4 An example
In this subsection we will give an example taken from the part of the test corpus containing medical expertise. The part of one note is given in Example (6).3 The same expertise after the de-identification and tagging is given in Example (7).4
(6)
Naš broj 23/246
OPŠTINSKI SUD - Istražni sudija G-da Jovana Jovanovic-
Vašim zahtevom u predmetu TR 123/01 od 23.07.2007. god. zatražili ste od Komisije lekara veštaka Medicinskog fakulteta Univerziteta u Kragujevcu sudskomedicinsko veštacenje o vrsti i težini telesnih povreda koje je dana 4.02. 2007. god. zadobio Petrovič Dragan iz Jagodine.
PODACI
1. Pri pregledu na Medicinskom fakultetu u Kragujevcu, obavljenom dana 12.02.2007. god. od strane članova Komisije lekara veštaka Medicinskog fakulteta Univerziteta u Kragujevcu, Dragan, Miroslava, Petrovič navodi: roden je 14. 01. 1956. god. u Jagodini, živi u Jagodini, ul. Savska br. 7, po zanimanju pekar, broj lične karte 1234567, MUP Jagodina . Amanestički navodi operaciju kolena marta 2000. god., negira postojanje oboljenja. Dana 4,02. 2007. god. oko 12,30 h, na sportskom terenu došlo je do fizičkog obračuna izmedu Dragana i njegovog poznanika.
NALAZ
1. U izveštaju Dr Petra Dragica, specijaliste za otorinolaringologiju, Zdravstvenog centra "Milutin Ivkovič", iz Jagodine, na ime Petrovič Dragana, od 4.02.07. god., navedeno je sledeče: "Povreden u tuči od strane poznatog lica. Svest nije gubio. Dg. Fractura dentis incisiv"
'Our number 23/246
Municipal court - Judge Mrs Jovana Jovanovič -In your request in the case TR 123/01 from 23/07/2007 you asked for a medico-legal expertise on the type and
3	This example looks exactly as the original - however, for the purpose of protecting the personal data we have manually replaced all the personal information with some "real world" data.
4	We wanted to avoid the introduction of some real people names and real location names in the de-identified texts. Instead we used names: Barni Kamenko (Barney Rubble), Vilma Kremenko (Vilma Flintstone), Kamengrad (Bedrock), Serbian names for the characters from the sitcom The Flinstones, created by Hanna-Barbera Productions, Inc.
gravity of bodily injuries inflicted on Petrovič Dragan from Jagodina on 4/02/2007 from the Commission of the Faculty of Medical Sciences University of Kragujevac's medical experts.
DATA
1. Upon the examination performed in Faculty of Medical Sciences of Kragujevac, conducted on 12/02/2007 by members of the Commission of the Faculty of Medical Sciences University of Kragujevac's medical experts, Dragan, Miroslava, Petrovič states: born on 14/ 01/1956 in Jagodina, lives in Jagodina, Savska Street 7, a baker by profession, ID number 1234567, MIA Jagodina. Anamnestic states the knee surgery performed on March 2000, negates the existence of a disease. On 4/02/2007 around 12:30 PM, on the sports field it came to a physical confrontation between Dragan and his acquaintance.
FINDING
1. In the medical report of Zoran Dragič, MD, specialised in otorhinolaryngology, Medical Centre "Milutin Ivkovič", from Jagodina, on Petrovič Dragan's name, from 4/02/2007, the following was stated: "He was injured in a fight by an acquaintance. He didn't lose his consciousness. Dg. Fractura dentis incisiv. "
(7)
Naš <number PHI="yes">XXXX</number> <org PHI="yes">SUD</org> - <pers><role>Istražni sudija	gospoda</role>	<persName.full
PHI="yes">Vilma Kremenko</persName.^ll></pers>-Vašim zahtevom u predmetu <number PHI="yes">XXXX</number> od <date PHI="yes">28.12.2007.</date> zatražili ste od <org PHI="yes">Komisije</org> <org PHI="yes">fakulteta</org> <org PHI="yes">Univerziteta</org> sudskomedicinsko veštacenje o vrsti i težini telesnih povreda koje je dana <date PHI="yes">09.07.2007.</date> zadobio <persName.full PHI="yes">Barni Kamenko</persName.full> iz <top.gr PHI="yes">Kamengrada</top.gr>.
PODACI
1.{S} Pri pregledu na <org PHI="yes">fakultetu</org>, obavljenom dana <date PHI="yes">17.07.2007.</date> od strane članova <org PHI="yes">Komisije</org> <org PHI="yes">fakulteta</org> <org PHI="yes">Univerziteta</org>, <persName.full PHI="yes">Barni Kamenko</persName.full> navodi: roden je <date PHI="yes">19.06.1956.</date> u <top.gr PHI="yes">Kamengradu</top.gr>, živi u <top.gr PHI="yes">Kamengradu</top.gr>, <adresa PHI="yes">ul. Kamenolomska br. 6a</adresa>, po zanimanju pekar, broj lične karte <number PHI="yes">XXXX</number>, <org>MUP<top.gr PHI="yes">Kamengrad</top.gr></org> .{S} Amanestički navodi operaciju kolena <date PHI="yes">avgusta 2000.</date>, negira postojanje oboljenja.{S} Dana <date PHI="yes">09.07.2007.</date> oko 12,30 h, na
sportskom terenu došlo je do fizičkog obračuna izmedu Dragana i njegovog poznanika.
NALAZ
1.{S} U izveštaju <pers><persName.full PHI="yes">Barnija Kamenka</persName.full><role>, specijaliste za otorinolaringologiju</role></pers>, <org PHI="yes">centra</org>, iz <top.gr PHI="yes">Kamengrada</top.gr>, na ime <persName.full><persName.full PHI="yes">Vilma Kremenko</persName.full></persName.full>, od <date PHI="yes">09.07.07.</date>, navedeno je sledeče:"....{S} Povreden u tuči od strane poznatog lica.{S} Svest nije gubio.{S} Dg.{S} Fractura dentis incisiv."
This example demonstrates our de-identification approach. Each detected PHI was enclosed in XML tags indicating its corresponding category, with the PHI attribute value set to "yes". Note that all dates were shifted into the future by the same amount. Information specific to the hospitals and other organizations was replaced by a generalized data with the same organizational hierarchy. For instance, a sequence of hierarchical organization names Komisija lekara veštaka Medicinskog fakulteta Univerziteta u Kragujevcu 'the Commision of medical experts of the Faculty of Medical Sciencies of the University of Kragujevac' is replaced by Komisija fakulteta Univerziteta 'a Commission of a faculty of a University'. Some personal data remained: the occurrence of the first name of the patient. Also, the replacement text was not always correct: the male patient's name was once replaced by the female name because the original occurrence was ambiguous and could be interpreted both as a masculine name Petrovič Dragan (in the genitive case) and a feminine name Petrovič Dragana (in the nominative case). Our system has randomly chosen the feminine name. These occurrences are bolded and underlined in Example (7).
4 Evaluation results
The previously described system for the automatic de-identification has been evaluated on a set of 100 randomly selected documents (total word count of 35,822), consisting of discharge summaries (60), clinical notes (27) and medical expertise (13). These chosen texts were not used in the system development and presented completely unseen material containing many occurrences of PHI. Details about the PHI distribution within the test corpus can be found in Table 1.
The performance has been evaluated with respect to recognition, bracketing and replacement of PHI. For that reason, a new attribute 'check' has been added to each XML tag. Possible values of this attribute were the following:
OK - PHI was correctly recognized, full extent was correctly determined, replacement was correctly assigned;
UOK - UOK1 (PHI type was correctly recognized, but full extent was not correctly determined, some part of PHI was revealed); UOK2 (PHI type was not
PHI/Document	Cinical	Discharge	Medical	Total
type	reports	summaries	expertise	
pers	52	254	407	713
top	32	219	109	360
org	62	164	242	468
num	20	61	90	171
date	65	133	267	465
adrese	0	64	10	74
Total	231	895	1125	2251
Table 1: The PHI distribution considering document type.
correctly determined, but the full extent was correctly determined, PHI successfully masked);
NOK - an utterance tagged falsely as PHI and de-identified;
MISS - PHI was not recognized;
MISS/E - PHI was not recognized because of the incorrect input.
In some cases when it was not so easy to decide which is the most appropriate value for the 'check' attribute (e.g. personal name as a name of an organization), we always treated as correct, for example, a personal name tag even though the utterance belonged to organization category.
We report the results of the evaluation using the traditional performance measures: precision (positive predictive value), recall (sensitivity) and F-measure
(harmonic mean of recall and precision). These measures are calculated at the phrase level, considering the entire PHI annotation as the unit of evaluation.
The harmonic mean of recall and precision is calculated in two ways, using the strict and relaxed criteria. With the strict criteria we consider as true positives only fully correctly recognized and de-identified PHI and as false negatives all PHI that were not recognized and de-identified, regardless of the reasons (including the incorrect input). With the relaxed criteria we consider as true positives all correctly recognized and de-identified PHI including partial recognition and false type attribution, and as false negatives all PHI that were not recognized and de-identified if the input was correct (see Table 2).
The overall evaluation of the system is presented in Table 3 and Table 4.
	1. Strict criteria	2. Relaxed criteria
TP	OK	OK+UOK
FP	NOK+UOK	NOK
FN	MISS+MISS/E	MISS
P	OK/(OK+NOK+UOK)	(OK+UOK)/(OK+NOK+UOK)
R	OK/(OK+MISS+MISS/E)	(OK+UOK)/(OK+UOK+MISS)
Table 2. Calculation using strict and relaxed criteria: TP (true positive), FP (false positive), FN (false negative), P (Precision), R (Recall).
PHI	OK	UOK1	UOK2	MISS	MISS/E	NOK
pers	634	12	47	15	5	30
top	337	0	0	14	9	5
Org	434	0	0	28	6	6
num	132	2	0	36	1	8
date	455	4	0	1	5	7
adrese	63	0	0	1	10	2
Total	2055	18	47	95	36	58
Table 3. Evaluation data.
PHI	Precision (p1)	Recall (r1)	Fi-measure	Precision (p2)	Recall (r2)	F2-measure
pers	0.88	0.97	0.92	0.96	0.98	0.97
top	0.99	0.94	0.96	0.99	0.96	0.97
org	0.99	0.93	0.96	0.99	0.94	0.96
num	0.93	0.78	0.85	0.94	0.79	0.86
date	0.98	0.99	0.98	0.98	1.00	0.99
adrese	0.97	0.85	0.91	0.97	0.98	0.98
Total	0.94	0.94	0.94	0.97	0.96	0.97
Table 4. Performance measures for PHI de-identification by applying the strict criteria (1) and the relaxed criteria. (2)
(1)
Besides the traditional F-measure, evaluation is also performed using the Slot Error Rate (SER) [33]. As a simple error measure, the SER equally weights different types of error directly, enabling the comparison of all systems against the fixed base. The SER is equal to the sum of the three types of errors — substitutions (UOK1, UOK2), deletions (MISS, MISS/E), and insertions (NOK) — divided by the total number of PHI in the reference corpus (Formula (1)).
SER =
NOK + MISS + MISS/E + UOKl + U0K2 OK + UOKl + U0K2 + MISS + MISS/E
In measuring the accuracy of the de-identification system, the extent of PHI (UOK1), missed PHI (MISS, MISS/E) as well as entities falsely tagged as PHI (NOK), should be taken into consideration as equally weight separate error slots. In that way, unlike the more relaxed F-measure, the SER of 11.3% stresses out the importance
of the errors that affect revealing of PHI to a greater extent.
5 Discussion
Clinical records are considered de-identified when, after removal of PHI, the remaining information cannot be used alone or in combination with other information to identify an individual. Nevertheless, even though PHI is removed, there is a concern that de-identified medical documents could potentially be re-identified i.e. that it is difficult but still possible to reestablish the link between the individual and his/her data [24]. In the context of de-identification each PHI category is treated differently. There are obvious identifiers (e.g. name, telephone number, home address.) as well as quasi-identifiers that can play an important role in indirect re-identification, such as dates, locations, race and gender [34]. In some cases, more than one identifying variable is needed to identify an individual uniquely. For example, sex and year of birth combined with the disease name (it might be some rare disease) could be used for indirect re-identification. However, some PHI categories, such as ages over 89, geographical locations, hospitals and other organizations are most frequently ignored by the existing de-identification systems [5]. Even though according to the most frequently referenced regulation HIPAA, states and organizations are not considered as PHI, we adopted a more conservative approach, considering them as variables that could be used for re-identification.
We found that our NER system could be modified to work on medical narratives for de-identification purposes. However, certain modifications were necessary in order to preserve relevant clinical information. Previous evaluation results showed that Serbian NER system gives priority to precision over the recall [30], and the recall rate had to be improved for the de-identification task.
An error analysis shows that every correctly recognized PHI was correctly de-identified. The main source of errors were missed PHI, resulting in the information disclosure. The most missed PHI were numbers and organizations not included in our pattern rules and dictionaries, while fewer than 6% of errors resulted in the revealing of the most sensitive category i.e. person names. Another source of errors that could cause PHI exposure was wrongly determined PHI extent (4.72% of total errors). Fewer than 20% of errors were examples tagged with an incorrect PHI category which may only reduce the readability of the resulting de-identified text without exposing PHI. Since one of the main goals is to preserve medically relevant information, it is important to pay special attention to false positives, which represented 22.83% out of total errors. For pers, a majority of false positives were diseases and procedures names.
Our automatic de-identification system achieved very competitive precision and recall rate, showing the overall Fi-measure of 0.94 (Table 4). High performance was achieved for most PHI types, except for numbers. The highest precision of 0.99 was reached for geographic
locations and organizations, followed by dates, addresses and numbers. When partially recognized and wrongly tagged personal names are treated as true positives, the precision of their de-identification is better. With respect to the recall, the most important measure for de-identification, dates have the highest rate. Beside dates, almost all PHI categories showed high sensitivity rating from 0.99 to 0.93. The lowest recall rate for numbers (0.78) and addresses (0.85) suggests that rules for corresponding categories have to be improved. In terms of recall, especially dates and personal names, we may say that our de-identification is sufficient to guarantee high patient privacy, with achieved competitive precision and preserved document usefulness for subsequent applications.
6 Conclusion
In this paper, we presented the automatic text de-identification system for medical narrative texts, based on the adaptation of the existing rule-based NER system for Serbian. We have also produced the first versions of de-identified medical corpus that could be useful to the research community interested in both analysing different medical phenomena and producing a machine-learning automatic de-identification system for Serbian.
Even though the evaluation of the presented system is conducted on a relatively small set of documents, we have collected the heterogeneous corpus, consisting of different document types belonging to various medical specialties and institutions. The results of this preliminary evaluation are very promising, indicating that our adapted NER system can achieve high performance on the de-identification task. However, there is still much to be done.
In the future, we plan to focus on improving our strategies, such as completing the existing and adding the new patterns covering the broader formats of PHI (email addresses, URLs, IP address numbers) and the disambiguation of clinical eponyms and abbreviations. Finally, we intent to measure the impact of the de-identification through the subsequent natural language processing task of medical concepts' recognition.
References
[1]	Health Insurance Portability and Accountability Act. P.L. 104-191, 42 USC. 1996.
[2]	Douglass, M., Clifford, G. D., Reisner, A., Moody,
G.	B., Mark, R. G. 2004. Computer-assisted de-identification of free text in the MIMIC II database. Computers in Cardiology, 31:341-344.
[3]	Neamatullah, I., Douglass, M. M., Lehman, L.-W.
H.,	Reisner, A., Villarroel, M., Long, W. J., Szolovits, P., Moody, G. B., Mark, R. G. & Clifford, G. D. 2008. Automated de-identification of free-text medical records. BMC Medical Informatics and Decision Making, 8:32.
[4]	Deleger, L., Molnar, K., Savova, G., Xia, F., Lingren, T., Li, Q., Marsolo, K., Jegga, A., Kaiser, M., Stoutenborough, L. & Solti, I. 2013. Large-
scale evaluation of automated clinical note de-identification and its impact on information extraction. Journal of the American Medical Informatics Association, 20:84-94.
[5]	Meystre, S. M., Friedlin, F. J., South, B. R., Shen, S. & Samore, M. H. 2010. Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Medical Research Methodology, 10:70.
[6]	Nadeau, D. & Sekine, S. 2007. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30:3-26.
[7]	Wellner, B., Huyck, M., Mardis, S., Aberdeen, J., Morgan, A., Peshkin, L., Yeh, A., Hitzeman, J. & Hirschman, L. 2007. Rapidly retargetable approaches to de-identification in medical records. Journal of the American Medical Informatics Association, 14:564-573.
[8]	Gardner, J. & Xiong, L. 2008. HIDE: An integrated system for health information DE-identification. In: Proceedings of the 21st IEEE International Symposium on Computer-Based Medical Systems. 254-259.
[9]	Uzuner, O., Sibanda, T. C., Luo, Y. & Szovits, P. 2008. A de-identifier for medical discharge summaries. Artificial Intelligence in Medicine, 42:13-35.
[10]	Gardner, J., Xiong, L., Wang, F., Post, A., Saltz, J. & Grandison, T, 2010. An evaluation of feature sets and sampling techniques for de-identification of medical records. In: Veinot T, (ed.), Proceedings of the 1st ACM International Health Informatics Symposium. New York:ACM. 183-190.
[11]	Sweeney, L. 1996. Replacing personally-identifying information in medical records, the Scrub system. Proc AMIA Annu Fall Symp, 333-337.
[12]	Taira, R. K., Bui, A. T. A. & Kangarloo, H. 2002. Identification of patient name references within medical documents using semantic selectional restrictions. Proc AMIA Annu Symp, 757-761.
[13]	Ruch, P., Baud, R. H., Rassinoux, A. M., Bouillon, P. & Robert, G. 2000. Medical document anonymization with a semantic lexicon. Proc AMIA Symp, 729-733.
[14]	Ferrandez, O., South, B. R., Shen, S., Friedlin, F. J., Samore, M. H. & Meystre, S. M. 2013. BoB, a best-of-breed automated text de-identification system for VHA clinical documents. Journal of the American Medical Informatics Association, 20:77-83.
[15]	Gupta, D., M. Saul, and J. Gilbertson. 2004. Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research. American Journal of Clinical Pathology 121 (2):176-186.
[16]	Aramaki, E., Imai, T., Miyo, K., Ohe, K. Automatic deidentification by using sentence features and label consistency. In: Workshop on challenges in natural language I2b2 processing for clinical data. Washington, DC; 2006.
[17]	Beckwith, B. A., R. Mahaadevan, U. J. Balis, and F. Kuo. 2006. Development and evaluation of an open
source software tool for deidentification of pathology reports. BMC Medical Informatics and Decision Making 6.
[18]	Guo, Y., Gaizauskas, R., Roberts, I., Demetriou, G., Hepple, M. 2006. Identifying personal health information using support vector machines. In: Workshop on challenges in natural language I2b2 processing for clinical data. Washington, DC; 2006.
[19]	Friedlin, F. Jeff, and Clement J. McDonald. 2008. A software tool for removing patient identifying information from clinical documents. Journal of the American Medical Informatics Association 15 (5):601-610.
[20]	Berman, J. J. 2003. Concept-match medical data scrubbing - How pathology text can be used in research. Archives of Pathology & Laboratory Medicine, 127:680-686.
[21]	Morrison, F. P., Lai, A. M. & Hripcsak, G. 2009. Repurposing the Clinical Record: Can an Existing Natural Language Processing System De-identify Clinical Notes? Journal of the American Medical Informatics Association, 16:37-39.
[22]	Meystre, S. M., Ferrandez, O., Friedlin, F. J., South,
B.	R., Shen, S. & Samore, M. H. 2014. Text de-identification for privacy protection: A study of its impact on clinical text information content. Journal of Biomedical Informatics. doi: 10.1016/j.jbi.2014.01.011.
[23]	Aberdeen, J., Bayer, S., Yeniterzi, R., Wellner, B., Clark, C., Hanauer, D., Malin, B. & Hirschman, L. 2010. The MITRE Identification Scrubber Toolkit: Design, training, and assessment. International Journal of Medical Informatics, 79:849-859.
[24]	Kushida, C. A., Nichols, D. A., Jadrnicek, R., Miller, R., Walsh, J. K., & Griffin, K. 2012. Strategies for de-identification and anonymization of electronic health record data for use in multicenter research studies. Medical care, 50(Suppl 1):S82-S101.
[25]	Uzuner, O., Luo, Y. & Szolovits, P. 2007. Evaluating the state-of-the-art in automatic de-identification. Journal of the American Medical Informatics Association, 14:550-563.
[26]	Benton, A., Hill, S., Ungar, L., Chung, A., Leonard,
C.,	Freeman, C. & Holmes, J. H. 2011. A system for de-identifying medical message board text. BMC Bioinformatics, 12(Suppl 3):S2.
[27]	Courtois, B., Silberztein, M. 1990. Dictionnaires electroniques du frangais. Larousse, Paris.
[28]	Gross, M. 1989. The use of finite automata in the lexical representation of natural language. Lecture Notes in Computer Science, 377:34-50.
[29]	Paumier, S. 2011. Unitex 3.0 User manual. http:// http://www-igm.univ-mlv.fr/~unitex/UnitexManual3.0.pdf.
[30]	Krstev, C. Processing of Serbian - Automata, Texts and Electronic dictionaries. Faculty of Philology, University of Belgrade, Belgrade, 2008.
[31]	Krstev, C., Obradovic, I., Utvic, M. & Vitas, D. 2014. A system for named entity recognition based
on local grammars. Journal of Logic and Computation, 24:473-489.
[32]	Maurel, D., Friburger, N., Antoine, J. Y., Eshkol-Taravella, I. & Nouvel, D. 2011. Transducer cascades surrounding the recognition of named entities. Cascades de transducteurs autour de la reconnaissance des entites nommees, 52:69-96.
[33]	Makhoul, J., Kubala, F., Schwartz, R., & Weischedel, R. 1999. Performance measures for
information extraction. In: Proceedings of DARPA broadcast news workshop. 249-252.
[34] El Emam, K., Dankar, F. K., Issa, R., Jonker, E., Amyot, D., Coco, E., Corriveau, J.P., Walker, M., Chowdhury, S., Vaillancourt, R., Roffey, T., & Bottomley, J. 2009. A globally optimal k-anonymity method for the de-identification of health data. Journal of the American Medical Informatics Association, 16:670-82.
Probabilistic 2D Point Interpolation and Extrapolation via Data Modeling
Dariusz Jacek Jakobczak
Department of Electronics and Computer Science, Technical University of Koszalin, Sniadeckich 2, 75-453 Koszalin, Poland E-mail: dariusz.jakobczak@tu.koszalin.pl
Keywords: 2D data interpolation, Hurwitz-Radon matrices, MHR method, probabilistic modeling, curve extrapolation Received: June 25, 2014
Mathematics and computer science are interested in methods of 2D curve interpolation and extrapolation using the set of key points (knots). A proposed method of Hurwitz- Radon Malices (MMR) is such a method. This novel method is based on the family of Hurwitz-Radon (MR) matrices which possess columns composed of orthogonal vectors. Two-dimensional curve is interpolated via different functions as probability distribution functions: polynomial, sinus, cosine, tangents, cotangent, logarithm, exponents, arcsin, arccos, arct^an, arcctg or power function, also inverse functions. It is shown how to build the orthogonal matrix operator and how to use it in a process of curve reconstruction.
Povzetek: Opisana je nov^a metoda 2D interpolacije in ekstrapolacije krivulj.
1 Introduction
Curve interpolation and extrapolation [1] represents one of the most important problems in mathematics: how to model the curve [2] via discrete set of two-dimensional points [3]? Also the matter of curve representation and parameterization is still opened in mathematics and computer sciences [4]. The author wants to approach a problem of curve modeling by characteristic points. Proposed method relies on functional modeling of curve points situated between the basic set of the nodes or outside the nodes. The functions that are used in calculations represent whole family of elementary functions with inverse functions: polynomials, trigonometric, cyclometric, logarithmic, exponential and power function. These functions are treated as probability distribution functions in the range [0,1]. Nowadays methods apply mainly polynomial functions, for example Bernstein polynomials in Bezier curves, splines and NURBS [5]. Numerical methods for data interpolation or extrapolation are based on polynomial or trigonometric functions, for example Lagrange, Newton, Aitken and Hermite methods. These methods have some weak sides [6] and are not sufficient for curve interpolation and extrapolation in the situations when the curve cannot be build by polynomials or trigonometric functions. Proposed 2D curve interpolation and extrapolation is the functional modeling via any elementary functions and it helps us to fit the curve during the computations.
The main contributions of the paper are dealing with presentation the method that connects such problems as: interpolation, extrapolation, modeling, numerical methods and probabilistic methods. This is new approach to these problems. Differences from the previous papers of the author are connected with calculations without
matrices (N = 1), new probabilistic distribution functions and novel look on shape modeling and curve reconstruction.
The method of Hurwitz-Radon Matrices (MHR) requires minimal assumptions: the only information about a curve is the set of at least two nodes. Proposed method of Hurwitz-Radon Matrices (MHR) is applied in curve modeling via different coefficients: polynomial, sinusoidal, cosinusoidal, tangent, cotangent, logarithmic, exponential, arcsin, arccos, arctan, arcctg or power. Function for MHR calculations is chosen individually at each interpolation and it represents probability distribution function of parameter a e [0,1] for every point situated between two interpolation knots. MHR method uses two-dimensional vectors (x,y) for curve modeling - knots (xi,yi) e in MHR method:
1.	MHR version with no matrices (N = 1) needs 2 knots or more;
2.	At least five knots (x1,y1), (x2,y2), (x3,y3), (x4,y4) and (x5,y5) if MHR method is implemented with matrices of dimension N = 2;
3.	For more precise modeling knots ought to be settled at key points of the curve, for example local minimum or maximum and at least one node between two successive local extrema.
Condition 2 is connected with important features of MHR method: MHR version with matrices of dimension N =2 (MHR-2) requires at least five nodes, MHR version with matrices of dimension N = 4 (MHR-4) needs at least nine nodes and MHR version with matrices of dimension N= 8 (MHR-8) requires at least 17 nodes. Condition 3 means for example the highest point of the curve in a particular orientation, convexity changing or curvature extrema. So this paper wants to answer the
question: how to interpolate end extrapolate the curve by a set of knots?
AiAk+ AkAi= 0,
Aj2=-I
for j + k; j, k = 1,2...m
" 0	1	0	0 "		" 0	0	1	0"		" 0	0	0	1"
-1	0	0	0		0	0	0	1	, A =	0	0	-1	0
				, A2 =									
0	0	0	-1		-1	0	0	0		0	1	0	0
0	0	1	0		0	-1	0	0		-1	0	0	0
For point p1=(x1,y1) and OHR of dimension N = 1 is the matrix (real number) M1:
M1( P1) [x1 • y1 ]= x;
(0)
For points p1=(x1,y1) and p2=(x2,y2) OHR of dimension N=2 is build via matrix M2:
M2( P1, P2) = -
1
x1 + x2
x1 y1 + x2 y2 x2 y1 - x1 y2
X1 y2 - X2 y1 X1 y1 + X2 y2
(1)
Figure 1: Knots of the curve before modelling.
Coefficients for curve modeling are computed using probability distribution functions: polynomials, power functions, sinus, cosine, tangent, cotangent, logarithm, exponent or arcsin, arccos, arctan, arcctg.
2 Probabilistic 2D interpolation and extrapolation
The method of Hurwitz - Radon Matrices (MHR) is computing points between two successive nodes of the curve: calculated points are interpolated and parameterized for real number a e [0,1] in the range of two successive nodes. Data extrapolation is possible for a < 0 or a > 1. MHR calculations are dealing with square matrices of dimension N = 1, 2, 4 or 8. Matrices A., i=1,2^m satisfying
For points P1=(x1,y1), P2=(x2,y2), P3=(x3,y3) and P4=(x4,y4) OHR M4 of dimension N = 4 is introduced:
M4(Pl, P2, P3, P4) = -
1
"0 - u
- u - u
u2
- u
(2)
where
u0 = x1 y1 + x2 y2 + x3 y3 + x4 y4, u1 = - x1 y2 + x2 y1 + x3 y4- x4 y3, U2 = - x173 - x2 74 + x3 71 + x4 72 ,
U3 = - x174 + x2 73 - x3 72 + x4 7l .
For knots P1=(x1,y1), P2=(x2,y2), _ and P8=(x8,y8) OHR M8 of dimension N = 8 is constructed [12] similarly as (1) and (2):
are called a family of Hurwitz - Radon matrices. They were discussed by Adolf Hurwitz and Johann Radon separately in 1923. A family of Hurwitz-Radon (HR) matrices [7] are skew-symmetric: Ai^= -A, and A - = - A,. Only for dimensions N = 1, 2, 4 or 8 the family of HR matrices consists of N - 1 matrices. For N = 1 there is no matrices but only calculations with real numbers. For N=2:
11
^ -10
For N = 4 there are three HR matrices with integer entries:
A1 =
For N = 8 we have seven HR matrices with elements 0, ±1. So far HR matrices have found applications in SpaceTime Block Coding (STBC) [8] and orthogonal design [9], in signal processing [10] and Hamiltonian Neural Nets [11].
How coordinates of knots are applied for interpolation and extrapolation? If knots are represented by the following points {(xi^yi), i = 1, 2, n} then HR matrices combined with the identity matrix I^n are used to build the orthogonal Hurwitz - Radon Operator (OHR).
M8( P1, P2... P8) = -
where
7l
- y.
72 7l
	U0	u1	u2	U3	U4		u5	U6	u7
	- u1	u0	U3	- u2	U5		-u4	- u7	U6
	- u2	- u3	U0	u1	U6		u7	- u4	- u5
1	- U3	U2	- U1	U0	u7		-u6	u5	- u4
8 X, ^	- U4	- U5	- U6	- U7	U0		u1	U2	U3
i=1	- u5	U4	- u7	U6	- U1		U0	- U3	u2
	- u6	u7	U4	- U5	- u2		U3	U0	- u1
	- u7	- u6	U5	U4	- u3		-u2	U1	u0 .
									(3)
73	y4	75	y6	y7	7S		x1		
- y4	73	- y6	75	7s	- 77		x2		(4)
7l	- 72	- TZ	- 7s	y5	76		x3		
72	7l	- 7s	77	- y6	75		x4		
77	Ts	7l	72	- y3	74		x5		
Ts	- 77	72	71	y4	- 73		x6		
- 75	y6	73	- y4	y1	72		x7		
- y6	- 75	y4	y3	- y2	7,		xs		
-	Ti	74
-	74	- yi
-	75	76
-	76	- 75
and u = (u0, U1,_, U7)T (4). OHR operators Mn (0)-(3) satisfy the condition of interpolation
Mn^x = y	(5)
for x = (x1,x2_,xn)t e R^, x ^ 0, y = (7l,y2^,yN)T e and N = 1, 2, 4 or 8.
2.1 Distribution functions in MHR interpolation and extrapolation
Points settled between the nodes are computed [13] using MHR method [14]. Each real number c e [a;b] is calculated by a convex combination
u
u
u
u
0
-u u
u -u
0
u
u
0
-y
J8
c = a • a + +(1 - a) ■ b b-c
for
a =
b-a
=[0,1].
(6)
The weighted average OHR operator M of dimension N= 1,2, 4 or 8 is build:
M = r- A+{\-r)-B.	(7)
The OHR matrix A is constructed (l)-(3) by every second knot pi=(xi =a,yi),	.. .and
A =
The OHR matrix B is computed (l)-(3) by knots
P2={x2=b,y2\ P4=(^4,74), ... and P2Ap(^2Af,72Af):
B= MiIp2, P4,..., P2N)-Vector of first coordinates C is defined for Ci= a-X2i.-i+ {l-o.)-X2i , 2 = 1, 2,..., A^	(8)
and C= [ci, C2,..., ca?]^. The formula to calculate second coordinates y(cj) is similar to the interpolation formula (5):
Y{C) = M-C	(9)
where Y{C) = LK^i), Xci),---, Xc«)]^. So interpolated value X^i) from (9) depends on two, four, eight or sixteen (2N) successive nodes. For example A^l results in computations without matrices:
connected with applied functions F(a). These functions (12)-(41) represent the probability distribution functions for random variable a e [0,1] and real number s> 0:
1.	power function
with s>0.	(12)
For s = 1: basic version of MHR method when y = a.
2.	sinus

C= Ci = (l-a)-^2 ,

y	y
Xq) = 7 - /i + (1 -«)(1- 7)/2 + T(\-oc)—x^ + a(l- 7)—ATj
(10)
Formula (10) shows a clear calculation for interpolation of any point between two successive nodes (^1,71) and (^2,72). Key question is dealing with coefficient y in (7). Basic MHR version means y= a and then (10):
Xq) = a' • /1 + (1 - af + «(1 - «)(—
/2

(11)
Formula (11) represents the simplest way of MHR calculations (A^ = 1, y = a) and it differs from linear interpolation y{c) = a■	MHR is not a
hnear interpolation.
Each interpolation requires specific distribution of parameter a (7) and y depends on parameter a e [0,1]: y = F(a), F:[0,1]^[0,1], F(0) = 0, F(l) = 1 and F is strictly monotonic.
Coefficient y is calculated using different functions (polynomials, power functions, sinus, cosine, tangent, cotangent, logarithm, exponent, arcsin, arccos, arctan or arcctg, also inverse functions) and choice of function is connected with initial requirements and curve specifications. Different values of coefficient y are
	Y = sin{af ■ k/2) , s>0	(13)
	or	
	y = sinXa ■ %I2), s > 0.	(14)
Fors =	1: j= sin{a-%l2).	(15)
3.	cosine	
	y= l-cos(a'• 71/2), s>0	(16)
	or	
	y= l-cos^(a-71/2), s>0.	(17)
Fors =	1: y= l-cos(a-71/2).	(18)
4.	tangent	
	y = tan{a' -71/4), s > 0	(19)
	or	
	y = • 71/4), s>0.	(20)
Fors =	= 1: y = • 71/4).	(21)
5.	logarithm	
	y = log2{a' + 1), s > 0	(22)
	or	
	y = log2\a + 1), s > 0.	(23)
Fors =	= 1: j = log2{a+\).	(24)
6.	exponent	
	Y = s>0 and a>0 and a^ 1.	(25)
	a-1	
Fors =	= land a = 2: y = 2"-l.	(26)
7.	arc sine	
	J = 21%- arcsinia!) , s > 0	(27)
	or	
	y = (2/71- arcsin a)", s > 0.	(28)
Fors =	= 1: y = 2/71- arcsin(a).	(29)
8.	arc cosine	
	y = 1-2/71- arccosia!) , s > 0	(30)
	or	
	y = 1-(2/71- arccos a)", s > 0.	(31)
Fors =	= 1: y = 1-2/71- arccos(a).	(32)
9.	arc tangent	
	y = 4/71- arctania) , s > 0	(33)
	or	
	y = (4/71- arctan a)', s> 0.	(34)
Fors =	= 1: y = 4/71- arctan(a).	(35)
10.	cotangent	
	y= ctgi%/2 - a' ■-k/4) , s>0	(36)
	or	
	y= rt/(71/2 - a - 71/4), s>0.	(37)
Fors =	= 1: j= ctg{%l2-a-%IA).	(38)
11. arc cotangent
y = 2 - 4/71- arcctgia') , s > 0 or
y = (2 - 4/71- arcctg af, s> 0.
For s =1:
y = 2 - 4/71- arcctg(d).
(39)
(40)
(41)
f{a) = F\a) , F{a) = \ f{t)df
0
/(a) = a • e"
(43)
and
For example (43):
Y = F{a) = {a- l)e" +1 •
What is very important: two curves may have the same set of nodes but different N ox j results in different interpolations (Fig.6-13). Here are some apphcations of MHR method with basic version (y = a): MHR-2 is
MHR version with matrices of dimension N = 2 and MHR-4 means MHR version with matrices of dimension
N=4.
Functions used in y calculations (12)-(41) are strictly monotonic for random variable a e [0,1] as y = F(a) is probability distribution function. Also inverse function F ^a) is appropriate for y calculations. Choice of function and value s depends on curve specifications and individual requirements. Interpolating of coordinates for curve points using (6)-(9) is called by author the method of Hurwitz - Radon Matrices (MHR) [15]. So here are five steps of MHR interpolation: Step 1: Choice of knots at key points. Step 2: Fixing the dimension of matrices N= 1, 2, 4 or 8: 1 is the most universal for calculations (it needs only two nodes to compute unknown points between them) and it has the lowest computational costs (10); MHR with N = 2 uses four successive nodes to compute unknown coordinate; MHR version for A/^ = 4 applies eight successive nodes to get unknown point and MHR with N = ^ needs sixteen successive nodes to calculate unknown coordinate (it has the biggest computational costs).
Step 3: Choice of distribution y = F(a): basic distribution for y = a.
Step 4: Determining values of a: a= 0.1, 0.2...0.9 (nine points) orO.Ol, 0.02. ..0.99 (99 points) or others.
Step 5: The computations (9).
These five steps can be treated as the algorithm of MHR method of curve modeling and interpolation (6)-(9).
Considering nowadays used probability distribution functions for random variable ae[0,l] - one distribution is dealing with the range [0,1]: beta distribution. Probability density function f for random variable a e [0,1] is:
f{a) = c-a'-{\-ay ,s>Q,r>Q	(42)
When r = 0 probability density function (42) represents f{a) = c-a' and then probability distribution function F is like (12), for example f(a) = and j=a?. If s and r are positive integer numbers then y is the polynomial, for example f(a) = 6a(l-a) and y = 3a^-2a?. So beta distribution gives us coefficient y in (7) as polynomial because of interdependence between probability density f and distribution ^functions:
Figure 2: Function l{x) =	with 396 interpolated
points using basic MHR-2 with 5 nodes.
Figure 3: Function J(x) = ^+ln(7-^) with 396 interpolated points using basic MHR-2 with 5 nodes.
Figure 4: Function J(x) = ^+2^-1 with 792 interpolated points using basic MHR-4 with 9 nodes.
Figure 5: Function /(x) = 3-2^ with 396 interpolated points using basic MHR-2 with 5 nodes.
Figures 2-5 show interpolation of continues functions connected with determined formula. So these functions are interpolated and modeled. Without knowledge about the formula, curve interpolation has to implement the coefficients y (12)-(43), but MHR is not hmited only to these coefficients. Each strictly monotonic function Fbetween points (0;0) and (1;1) can be used in MHR interpolation. MHR 2D data extrapolation is possible for a < 0 or a>l.
3 Implementations of 2D probabilistic interpolation
Curve knots (0.1; 10), (0.2;5), (0.4;2.5), (1;1) and (2;5) from Fig.l are used in some examples of MHR interpolation with different y. Points of the curve are calculated for 1 and y = a (11) in example 1 and with
matrices of dimension N = 2 in examples 2 - 8 for a = 0.1, 0.2,^,0.9.
Example 1
Curve interpolation for 1 and j= a._

Figure 6: Modeling without matrices (N = 1) for nine reconstructed points between nodes.
Example 2
Sinusoidal interpolation with j = sinja ■ nil)._

********
Figure 7: Sinusoidal modeling with nine reconstructed curve points between nodes.
Example 3
Tangent interpolation for y = tanja ■ 7t/4)._

Figure 8: Tangent curve modeling with nine interpolated points between nodes.
Example 4
Tangent interpolation with j = tanja" ■ 7t/4) and 5 = 1.5.
1
-V:

Figure 9: Tangent modeling with nine recovered points between nodes.
Example 5
Tangent curve interpolation for y = tan(as ■ n/4) and s ^ 1.797.
Figure 10: Tangent modeling with nine reconstructed points between nodes.
Example 6
Sinusoidal interpolation with y = sin(as ■ n/2) and s 2.759.
12


Figure 11: Sinusoidal modeling with nine interpolated curve points between nodes.
Example 7
Power function modeling for y = g^ and 5=2.1205.
Figure 12: Power function curve modeling with nine recovered points between nodes.
Example 8
Logarithmic curve modeling with y = log2(as + 1) and s = 2.533.
Figure 13: Logarithmic modeling with nine reconstructed points between nodes.
These eight examples demonstrate possibilities of curve interpolation for key nodes. Reconstructed values and interpolated points, calculated by MHR method, are applied in the process of curve modeling. Every curve can be interpolated by some distribution function as
parameter y. This parameter is treated as probability distribution function for each curve.
4 MHR 2D modeling versus polynomial interpolation
0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0. The smallest difference is calculated for s = 1.5:
|wl-3.e94|+|w3- 1.632 \+\w2-23m |+|w4- 1.32? | = 0.266
(44)
Calculations for average OHR operator M(7) are done:
		■ 2.40J 2.062 ■		Pl'		■ 3.694 '
	M =		M-			
4.1 Example 4.1		-2.002 2.405				1.032
Let us consider a graph of function f(x) = 2/x in range [0.4,1.6]. There are given five interpolation nodes for x = 0.4, 0.7, 1.0, 1.3, 1.6. The curve / = 2/x reconstructed by basic MHR method (12) with N = 2 for s = 1 looks not precisely:
Figure 14: The curve y = 2/x reconstructed by MHR method for y = a (7) and five nodes together with 36 computed points.
Lagrange interpolation polynomial is not to be accepted:
M =
1375 0.7S2 -0.7S2 1.37J
M-
p2-		■ 2.302 ■
		1.329
Figure 15: Polynomial interpolation of function y = 1/x is out of acceptation.
For better reconstruction of the curve, appropriate parameter s in MHR method (12) is calculated. Choice of parameter s is connected with comparison of accurate values Wj for function j{x) = 2/x in control points p,, situated in the middle between interpolation nodes (a=0.5), and values in control points p, computed by MHR method. Control points are settled in the middle between interpolation nodes, because interpolation error of MHR method is the biggest. Choice of coefficient s is done by criterion: difference between precise values Wi and values reconstructed by MHR method is the smallest. Control points pi in this example are established for = 0.55, 0.85, 1.15, 1.45. Four values of the curve are compared for parameter s = 0.1, 0.2, 0.3, 0.4, 0.5, 0.6,
Computed values appear in (44). Other results:
a)	^0.1
|wl-i.Si; | + |w3- 1.339 | + |w2-3.209 |+| w4- l.ffll | = 3.4M
b)	^0.2:
|wl-J.;87 | + |w3- 1.90S | + |w2-3.111 |+| w4- 1.J72 | = 3.008
c)	^0.3:
I wl - i.373 l + l w3- 1.375 | + | w2- 3.02 | + | w4- 1.544 | = 2.704
d)	^0.4:
I wl - 5.174 l + l w3- l.S4e | + |w2-2.93J |+| w4- 1.519 | = 2.366
e)	^0.5:
I W1-4.9SS | + |w3- 1.S13 l + l w2-2.356 |+U4- 1.495 j =2.05
f)	^0.6:
I wl - 4.S15 j + l w3- 1.794 1 +I w2 - 2.732 | + | w4- 1.473 | = 1.756
g)	^0.7:
I wl - 4.653 j + j w3- 1.771 |+| w2- 2.712 j + j w4- 1.452 | = 1.4S
h)	^0.8:
I wl - 4.503 j + l w3- 1.749 |+| w2- 2.64S |+| w4- 1.433 j = 1.225 i) ^0.9:
I wl - 4.362 j + j w3- 1.72S |+| w2 - 2.538 |+| w4- 1.415 j = 1.003 j) ^l:
|wl-4.23 | + |w3 - 1.709 j + j w2 - 2.532 |+| w4- 1.398 | = 0.322 k) ^1.1:
I wl - 4.103 I +1 w3 - 1.692 I + I w2 - 2.479 I + I w4 - 1.332 I = 0.648 1) ^1.2:
I wl - 3.994 j + l w3- 1.675 | + | w2- 2.43 |+| w4- 1.368 | = 0.51 m) ^1.3:
|wl- 3.837 | + |w3- 1.66 |+|w2- 2.385 |+|w4- 1.354| = 0.337 n) ^1.4:
I wl- 3.787 | + |w3 - 1.645 |+|w2- 2.342 | + |w4- 1.341 j =0.294 0) ^1.6:
I wl - 3.603 l+j w3- 1.619 j+j w2 - 2.265 | + | w4- 1313 | =0.298
p) ^1.7:
I wl - 3.527 j+j w3- 1.60S j + j w2- 2.231 | + | w4- 1.308 j = 0.434
q) ^1.8:
I wl - 3.451 j+j w3 - 1.597 |+| w2- 2.199 | + | w4- 1.298 j = 0.563 r) ^1.9:
I wl - 3.381 l+l w3 - 1.537 | + | w2- 2.169 j + j w4- 1.289 j = 0.682
s) ^2.0:
I wl-3.315 |+|w3- 1.577 | +| w2 - 2.14 |+j w4- 1.281 j =0.795
One can see the changes of interpolation error. Reconstruction of the curve y=2/x by MHR method for N = 2 and y = a1.5 looks as follows:
Figure 16: The curve y = 2/x modeled via MHR method (12) for s = 1.5 and five nodes together with 36 reconstructed points.
Figure 16 represents the curve y = 2/x more precisely then classical interpolation and Figure 14. If we would like to have better parameter s (with two digits after coma), calculations are done:
a)	^1.49:
1^1 - 3.703 | + |w3- 1.033 |+|w2 - 230ö | + |w4- 1.33 | = 0.203
b)	^1.51:
I wl - 3.e8J l + l w3- 1.Ö31 l + l w2 - 2.299 | + | w4- 1.328 | = 0.2Ö2
c)	^1.52:
I wl-3.677 | + |w3- 1.629 | + | w2-2.29; | + | w4- 1.327 | = 0.261
d)	^1.53:
I wl - 3.66s l + l w3- 1.628 |+| w2- 2.291 |+| w4- 1.32Ö | = 0.258
e)	^1.54:
I vrl - 3.659 l+l w3- 1.627 | + | w2- 2.2S7 |+| w4- 1.325 | = 0.255
f)	^1.55:
|wl-3.65 | + |w3 - 1.626 |+| w2 - 2.284 | +| w4- 1.324 | =0.251
g)	^1.56:
I wl- 3.642 l+l w3- 1.624 l+l w2-2.2S | + |w4- 1.323 | = 0.25
h)	s=1.57:
I wl- 3.633 l + l w3 - 1.623 | + |w2- 2.276 | + | w4- 1.321 | = 0.255
i)	^1.58:
I wl- 3.625 | + |w3- 1.622 |+|w2 - 2.273 | + |w4- 1.32 | = 0.268
j) ^1.59:
I wl-3.616 l + l w3- 1.621 |+|w2 - 2.269 | + | w4- 1.319 | = 0.283
Model of the curve for s = 1.56 in (12) is more accurate then with s = 1.5.
Convexity of reconstructed curve is very important factor in MHR method with parameter s. Appropriate choice of parameter s is connected with regulation and controlling of convexity: model of the curve (Fig. 16) preserves monotonicity and convexity.
4.2 Example 4.2
This example considers a graph of function f(x) = 1/(1 + 5x') in the range [-1,1]:
Figure 17: A graph of function f( x) = 1/(1 + 5x2) in range [-1,1].
There are given five interpolation nodes for x = -1.0, -0.5, 0, 0.5, 1.0. It is an example of function with useless Lagrange interpolation polynomial: Runge phenomenon and unpleasant two minima.
Figure 18: Lagrange interpolation polynomial differs extremely from a graph of function f(x) = 1/(1 + 5^^).
Model of the curve y = 1/(1 + 5x^ ) in basic MHR method (12) for 2 and s = 1:
Figure 19: The curve y = 1/(1 + 5a^) modeled via MHR method for y = a (7) and five nodes together with 36 reconstructed points.
Reconstructed curve (Fig. 19) preserves monotonicity and symmetry for s = 1. Comparing precise values Wj with values computed by MHR method in control points p, fixed for = -0.75, -0.25, 0.25, 0.75, identical the best results appear for s = 1 (12): I wl - 0.299 1+ I w3 - 0.6SS | + | w2 - 0.6SS | + | w4 - 0.299 | = 0.221
(45)
and for s = 0.5, 0.6, 0.7, 0.8, 0.9, 1.1, 1.2, 1.3, 1.4, 1.5, 1.7, 1.8, 1.9. But only for s = 1 reconstructed curve is
symmetric- look at values in (45). So if a case s = 1 appears among the best results, then model ought to be built for s = 1. And if a case s = 1 does not appear among the best results, then model ought to be built for s near by 1. Here is an example of reconstructed curve for s=1.5 (without feature of symmetry):
-ar-* ♦ ♦ rltn	♦ ♦
. ^ n f	♦
^ iJ^U / „	V
	
D,2 -1-1-e—	% 1 1
-0.5
0.5
Figure 20: The curve y = 1/(1 + 5x^) modeled via MHR method for s = 1.5 (12) and five nodes together with 36 reconstructed points.
The best model of the curve y = 1/(1 + 5x^), built by MHR method for s = 1 (Fig. 19), preserves monotonicity and symmetry. Convexity of the function (Fig. 17) can make some troubles.
5	Future research directions
Future works with MHR method are connected with object recognition after geometrical transformations of curve (translations, rotations, scaling)- only nodes are transformed and new curve (for example contour of the object) for new nodes is reconstructed. Also estimation of object area in the plane, using nodes of object contour, will be possible by MHR interpolation. Object area is significant feature for object recognition. Future works are dealing with smoothing the curve, parameterization of whole curve and possibility to apply MHR method to three-dimensional curves. Also case of equidistance nodes must be studied with all details. Another future research direction is to apply MHR method in artificial intelligence and computer vision, for example identification of a specific person's face or fingerprint, optical character recognition (OCR), image restoration, content-based image retrieval and pose estimation. Future works are connected with object recognition for any set of contour points. There are several specialized tasks based on recognition to consider and it is important to use the shape of whole contour for identification and detection of persons, vehicles or other objects. Other applications of MHR method will be directed to computer graphics, modeling and image processing.
6	Conclusion
The method of Hurwitz-Radon Matrices (MHR) enables interpolation of two-dimensional curves using different coefficients y: polynomial, sinusoidal, cosinusoidal, tangent, cotangent, logarithmic, exponential, arcsin,
arccos, arctan, arcctg or power function [16], also inverse functions. Function for y calculations is chosen individually at each curve modeling and it is treated as probability distribution function: y depends on initial requirements and curve specifications. MHR method leads to curve interpolation via discrete set of fixed knots. So MHR makes possible the combination of two important problems: interpolation and modeling. Main features of MHR method are:
a)	the smaller distance between knots the better;
b)	calculations for coordinate x close to zero and near by extremum require more attention;
c)	MHR interpolation of the function is more precise then linear interpolation;
d)	minimum two interpolation knots for calculations without matrices when ^1, but MHR is not a linear interpolation;
e)	interpolation of L points is connected with the computational cost of rank O(L);
f)	MHR is well-conditioned method (orthogonal matrices)[17];
g)	coefficient y is crucial in the process of curve probabilistic interpolation and it is computed individually for a single curve;
h)	MHR 2D data extrapolation is possible for a < 0 or a > 1.
Future works are going to: choice and features of coefficient y, implementation of MHR in object recognition [18], shape geometry, contour modeling and parameterization [19].
References
[1]	Collins II, G.W.: Fundamental Numerical Methods and Data Analysis. Case Western Reserve University (2003)
[2]	Chapra, S.C.: Applied Numerical Methods. McGraw-Hill (2012)
[3]	Ralston, A., Rabinowitz, P.: A First Course in Numerical Analysis - Second Edition. Dover Publications, New York (2001)
[4]	Zhang, D., Lu, G.: Review of Shape Representation and Description Techniques. Pattern Recognition 1(37), 1-19 (2004)
[5]	Schumaker, L.L.: Spline Functions: Basic Theory. Cambridge Mathematical Library (2007)
[6]	Dahlquist, G., Bjoerck, A.: Numerical Methods. Prentice Hall, New York (1974)
[7]	Eckmann, B.: Topology, Algebra, Analysis-Relations and Missing Links. Notices of the American Mathematical Society 5(46), 520-527 (1999)
[8]	Citko, W., Jakobczak, D., Sienko, W.: On Hurwitz -Radon Matrices Based Signal Processing. Workshop Signal Processing at Poznan University of Technology (2005)
[9]	Tarokh, V., Jafarkhani, H., Calderbank, R.: SpaceTime Block Codes from Orthogonal Designs. IEEE Transactions on Information Theory 5(45), 14561467 (1999)
[10]	Sienko, W., Citko, W., Wilamowski, B.: Hamiltonian Neural Nets as a Universal Signal Processor. 28th Annual Conference of the IEEE Industrial Electronics Society IECON (2002)
[11]	Sienko, W., Citko, W.: Hamiltonian Neural Net Based Signal Processing. The International Conference on Signal and Electronic System ICSES (2002)
[12]	Jakobczak, D.: 2D and 3D Image Modeling Using Hurwitz-Radon Matrices. Polish Journal of Environmental Studies 4A(16), 104-107 (2007)
[13]	Jakobczak, D.: Shape Representation and Shape Coefficients via Method of Hurwitz-Radon Matrices. Lecture Notes in Computer Science 6374 (Computer Vision and Graphics: Proc. ICCVG 2010, Part I), Springer-Verlag Berlin Heidelberg, 411-419(2010)
[14]	Jakobczak, D.: Curve Interpolation Using HurwitzRadon Matrices. Polish Journal of Environmental Studies 3B(18), 126-130 (2009)
[15]	Jakobczak, D.: Application of Hurwitz-Radon Matrices in Shape Representation. In: Banaszak, Z., Swic, A. (eds.) Applied Computer Science: Modelling of Production Processes 1(6), pp. 63-74.
Lublin University of Technology Press, Lublin (2010)
[16]	Jakobczak, D.: Object Modeling Using Method of Hurwitz-Radon Matrices of Rank k. In: Wolski, W., Borawski, M. (eds.) Computer Graphics: Selected Issues, pp. 79-90. University of Szczecin Press, Szczecin (2010)
[17]	Jakobczak, D.: Implementation of Hurwitz-Radon Matrices in Shape Representation. In: Choras, R.S. (ed.) Advances in Intelligent and Soft Computing 84, Image Processing and Communications: Challenges 2, pp. 39-50. Springer-Verlag, Berlin Heidelberg (2010)
[18]	Jakobczak, D.: Object Recognition via Contour Points Reconstruction Using Hurwitz-Radon Matrices. In: Jozefczyk, J., Orski, D. (eds.) Knowledge-Based	Intelligent	System Advancements: Systemic and Cybernetic Approaches, pp. 87-107. IGI Global, Hershey PA, USA (2011)
[19]	Jakobczak, D.: Curve Parameterization and Curvature via Method of Hurwitz-Radon Matrices. Image Processing & Communications- An International Journal 1-2(16),49-56,(2011)
Swarm Intelligence and its Application in Abnormal Data Detection
Bo Liu, Mei Cai and Jiazong Yu
College of Information Science and Technology, Jinan University, Guangzhou, China E-mail: ddxllb@163.com
Keywords: swarm intelligence, abnormal data, detection, ACO, PSO, BCO Received: June 18, 2014
This study addresses swarm intelligence-based approaches in data quality detection. First, three typical swarm intelligence models and their applications in abnormity detection are introduced, including Ant Colony Optimization (ACO), Particle Swarm Optimization (PSO), Bee Colony Optimization (BCO). Then, it presents three approaches based on ACO, PSO and BCO for detection of at^tribute outliers in da^ase^s. These approaches use different search strat^egies on the da^a i^ems; ho^ev^er, they choose the same fitness function (i.e. the O-measure) to ev^aluate the solutions, and they make use of swarms of the Ht^test agents and random mov^ing agents to obtain superior solutions by changing the searching paths or positions of agents. Three algorithms are described and explained, which are efficient by heuristic principles.
Povzetek: Opisane so tri metode z roji za analizo kvalitet podatkov: na osnovi mravelj, delcev in čebel.
1 Introduction
Swarm Intelligence (SI) is a branch of Artificial Intelligence that is used to model the collective behaviour of social swarms (groups of agents) in nature, such as ant colonies, honey bees and bird flocks[1]. Each agent can not only interact with its local environment and other agents, but also respond to environmental change immediately, such that it can always find its objective. Swarm-based algorithms have emerged as a family of nature-inspired, population-based algorithms that are capable of producing low-cost, fast and robust solutions to several complex problems[2][3]. [4][5][6] introduce a slice of new developments made in the theory and applications of bio-inspired computation. For instance, Cui et al.[7] proposed the artificial plant optimization algorithm with detailed case studies, and Yang et al.[8] edited a special issue on meta-heuristics and swarm intelligence in engineering and industry.
Data quality mining (DQR) is a new and promising application of data mining techniques for the detection, quantification and correction of data quality deficiencies in very large databases. Typical data mining techniques, such as clustering, classification, association analysis, have been used to discover abnormal data in databases[9][10]. Recently swarm-based DQR approaches have also been developed. For example, Alam et al.[11] proposed SI-based clustering approaches for outlier detection, and Jindal et al.[12] surveyed of SI in intrusion detection.
The motivation of this study is dedicated to the application of SI in data quality mining, which explores new approaches for abnormal data discovery, avoiding searching the whole data space, such that improves efficiency and accuracy in abnormal data detection. The paper introduces typical SI models, and investigate their applications in abnormal data or behaviour detection. It discusses the inadequacy of current SI-based solutions in
abnormal data detection, and proposes new approaches utilizing three SI models for detection of abnormal attributes.
The rest of this paper is organized as follows: in Section 2, typical SI models are introduced. We give a survey of SI-based methods for outlier detection in Section 3. New approaches based on SI for abnormal attribute detection are presented in Section 4, and the paper concludes with our final remarks in Section 5.
2 Typical SI models
SI was first introduced by G. Beni and J. Wang in the global optimization framework as a set of algorithms for controlling robotic swarm [13], which has become one of popular natural computing[14] areas. The most popular models of SI include: Ant Colony Optimization (ACO), Particle Swarm Optimization (PSO) and Bee Colony Optimization (BCO). The main ideas of the three models are discussed below.
There is no single 'Ant Model', rather there is a family of models, each inspired by a different aspect of ant behaviour[15]. These models include those inspired by ant-foraging behaviour, brood-sorting behaviour, cemetery formation behaviour and cooperative transport. One typical behaviour is ant foraging. Each ant deposits a chemical substance (called a pheromone) on the ground while walking[16], and the pheromone encourages the following ants to stay close to the previous path. Meanwhile the pheromone evaporates over time, allowing search exploration. Dorigo and Maniezzo illustrated the complex behaviour of ant colonies through a number of experiments in [17], which showed that the ants following the shorter path will make more visits to the source than those following the longer path. ACO is inspired by observation of the behaviour of real ants,
which was proposed by Dorigo and colleagues[18] as a method for solving combinatorial optimization problems.
PSO was introduced by J. Kennedy et al.[19]. This is a population-based optimization technique inspired by the social behaviour of bird flocking and fish schooling. It was observed that large numbers of birds often change direction suddenly, and scatter and regroup. The synchrony of flocking behaviour is thought to be a function of the birds' efforts to maintain an optimum distance between themselves and their neighbours. PSO researchers have implemented such natural processes to solve optimization problems in which each single solution, called a particle, joins other individuals to make up a swarm (population) for exploring within the search space. Each particle has a fitness value calculated by a fitness function, and a velocity of moving towards the optimum[20]. The PSO process starts by creating random particles and initializing the best positions for each particle and for the whole population. It then iterates the computation of the positions and velocities of each particle movement and the update of the local best position for each particle and the global best position for the entire swarm. PSO has been continually modified trying to improve its convergence properties, and the PSO variants have been used to solve a wide range of optimization and inverse problems [21].
BCO was proposed by D. Karabago as a new member of the family of SI algorithms[22]. In nature, bees reach a nectar source by following a nest-mate who has already discovered a patch of flowers. When a forager bee (recruiter) decides to attract more bee mates to a newly-discovered good food source, it returns to the hive and starts performing what is known as the waggle dance to communicate spatial and profitability information about the discovered food source, and recruit more honey bees (dancer followers) to exploit it [1]. BCO is based on a swarm of artificial bees cooperating together to solve a problem. First, a bee, named InitBee, settles to find a solution with good features. From this first solution a set of other solutions called SearchArea is determined by using a certain strategy. Then, every bee will consider a solution from SearchArea as its starting point in the search, and communicates the best visited solution to all its neighbours. However, if after a period of time the swarm observes that the solution is not improved, it introduces a criterion of diversification, preventing it from being trapped in a local optimum [23].
3 Survey of Sl-based methods for outlier detection
An abnormity is an object that does not conform to the normal behaviour of a dataset. Because abnormal data are very much in the minority in a dataset, they are also called outliers or noises. Although abnormal data or outliers may not be errors, detecting them is the most valid way to eliminate errors. There are two kinds of outliers: class outliers and attribute outliers[24]. Class outliers are records or tuples belonging to rare categories or labelled with wrong classes, and attribute outliers are attribute values deviating from the normal distribution or
error attribute values in dataset records. In other words, they are at two levels: the record level and the attribute value level.
Many studies have used traditional data mining or machine learning techniques to discover outliers, which need training on classifying rules, data patterns, or clustering instances of the dataset. There are also several approaches based on SI without training for detecting outliers, which are simple and heuristic. We focus on SI-based solutions in the rest of this study.
In [12], an ACO-based approach of clustering was proposed as a data preprocessing procedure, during which outliers can be detected. In this process, the continuity of ants for similar data points and dissimilar data points is collected into the next nodes. Each ant of the data points compares their property values according to the initial data point set, checks the importance of the data points and iteratively updates the values of the pheromones laid by other ants. Lastly, the data point selection matrix for obtaining final clustering results using another algorithm is generated; at the same time, those unselected data points are outliers. Without setting the number of clusters and initial centre points, the ACO-based clustering method can obtain the high clustering correctness rate and outlier filtering rate.
Alam et al.[11] proposed a SI-based clustering technique called Hierarchical PSO Based Clustering (HPSO-clustering) for outlier detection. HPSO-clustering uses a partitional approach to generate a hierarchy of clusters. The initial generation consists of the entire swarm. The swarm is then evolved towards a single cluster by merging two clusters of the swarm in each successive generation. For each particle of the swarm, updating of the position is continued until the particle comes to its most suitable position inside the cluster and becomes the true representative of the centroid. The cluster-based approach helps to find outliers based on their distance to the centroids. Less dense data falling at considerable distances from the centroid of the nearest cluster are considered as potential outliers.
Intrusion detection in computer networks is one of the common areas of application of outlier detection. There are some intrusion detection approaches based on ACO, PSO and BCO, and most of them categorize the detected behaviour into normal or abnormal, thus, in a sense, the intrusion detection problem is reduced to a classification or clustering problem[25]. Several examples are given below.
Tsai et al. [26] described an intrusive analysis model based on the design of honeypots and an ant colony; the ACO is applied to trace the trail of attack and analyse the habits and interests of aggressors. Soroush et al.[27] presented one of the pioneering studies, in which ACO is used for intrusion detection, unlike previous approaches in which it was used for intrusion response. Ramos and Abraham were two of the first researchers who attempted to introduce the LF algorithm[28] into the intrusion detection realm[29]. Their model, called ANTIDS, was based on a number of ant-like agents that pick up and drop items with a certain probability to form clusters, distinctively, the cluster with abnormal attributes is
typically much smaller in size.
Most of the PSO-based intrusion detection systems (IDS) are hybrid anomaly detection systems, Kolias et al. [25] classified them into: (a) hybrid PSO-Neural Network Systems, (b) hybrid PSO-Support Vector Machine Systems, (c) hybrid PSO-K-means Systems and another category containing intrusion detection systems that employ PSO for the extraction of classification rules.
In addition, Osama et al.[30] proposed an approach using BCO as a search strategy for subset generation, and using Support Vector Machine (SVM) as the classifier based on the selected feature subset, in which the feature selection is one of the prominent factors influencing the quality of intrusion detection systems. In this approach, first, an initial scout bee population is randomly generated, then the fitness of all the bees in the population is measured iteratively. The bees that have the highest fitness are selected as elite bees and the sites visited by them are chosen for neighbourhood search. At the end, SVM is trained based on the best feature subset.
As a whole, SI-based approaches have several advantages over other techniques. More specifically, SI techniques aim to solve complex problems by employing multiple but simple agents without a need for any form of supervision. Every agent collaborates with others toward finding the optimal solution via direct or indirect communication (interactions). These agents are used for discovering classification rules, clusters in anomaly detection[25].
However, to our knowledge, existing SI-based approaches have focused on discovering class level abnormities; and no SI-based approaches for discovering attribute level abnormities in a dataset have been studied. In reality, attribute abnormities happen more frequently, and correcting abnormal attributes rather than eliminating the tuple outliers has advantage of retaining information. Koh et al.[31] gave some statements: class outliers are often the result of one or more attribute outliers, and even when attribute outliers do not affect class memberships, they may still interfere with data analysis; in addition, many real world datasets do not contain class attributes or distinct clusters, and it is meaningful to identify attribute outliers which may be source errors. Thus we will study new approaches based on SI for attribute outlier detection in the following.
4 New ideas on SI approaches for attribute outlier detection
Some studies[9][24] have used attribute correlation analysis to detect attribute outliers in a dataset, but the time complexity of these algorithms is very high. A distribution-based approach[32] eliminates attribute values that do not fit into the distribution models of the dataset, but the accuracy largely depends on the best-fit distribution models used[31].
In this section, we will discuss SI approaches for attribute outlier detection. The main motivation for using SI in the discovery of outliers is that they perform a global search and cope better with attribute interaction, and the time complexity will be reduced.
ID	Name	Gender	City	Province	Country
1	L-199	F	Guangzhou	Guangdong	China
2	L-200	M	Guangzhou	Guangdong	China
3	L-201	M	Nanning	Guangxi	China
4	L-202	F	Guangzhou	Guandon	China
5	L-203	M	Nanchang	Jiangxi	China
Table 1: An example of a dataset R.
4.1 Outlier measurement
Rarity is not an indicator of attribute outliers, and observations should be drawn from the correlation behaviour of attributes. Three measures, namely, the O-measure, the P-measure and the Q-measure, are provided in [31]. We use the O-measure in our approach. Related definitions referred to by [31] are described as follows.
Definition 1 (support) Let R be a relation with m
attributes A1,A2,......,Am. Let S be a projection on R over
attributes Au,..., Av, i .e. S = nAu,..., Av (R). The support of a tuple s in S, denoted by sup(s), is the count of the tuples in R that have the same values for attributes Au,_.Av as item set s.
For example, in Table 1, let S = nCity, Province (R), sup(Guangzhou, Guangdong) = 2, sup(Nanning, Guangxi) = 1.
Definition 2 (Neighbourhood) Let tuple s =
<au,......, av>. Suppose Ay is the target attribute, the
extent of deviation of which we are interested in determining. The neighbourhood of A^ w.r.t s is defined
as N(Ay, s) = < au,......, av - i >. The support of N(Ay, s) is
the count of tuples in R with the same values au,......, av
- 1for Au,_.Av - i.
For example, in Table 1, let S = nCity, Province(R), consider tuple s = <Guangzhou, Guangdong>, sup(N(Province, s)) = 3, sup(N(City, s)) = 2.
Definition 3 (O-measure) The O-measure of target attribute Av w.r.t s is defined as
^ sup(#( 4, s)) O-measure(Ay, s) = j^u_
sup(N(Ay, s))
(1)
For example, let s = <Guangzhou, Guangdon,
China> be a tuple of S = nCity, Province, Country (R). The
support of N(Province, s) is 3 while sup(N(Country, s)) and sup(N(City, s)) are both 1. The O-measure of the Province attribute w.r.t. s is (1 + 1)/3 = 0.6, i.e. O-measure(Province, s) = 0.6.
For comparison, we also compute the O-measure of the Province attribute in tuple t = <Guangzhou, Guangdong, China>. We have O-measure(Province, t) = (2 + 2)/3 = 1.33.
The lower the O-measure score, the more likely it is that attribute Av is an attribute outlier in a tuple. In Table 1, O-measure(Province, s) is relatively lower than O-measure(Province, t), so Guangdon is more likely an to be attribute outlier.
Consider a relation R with m attributes and n tuples. In the worst case, scanning all data subspaces (or projected relations) in R requires 0(i3x2™) searches, where 2™ is the total number of projected relations on R. Therefore, computing the 0-measure scores for each attribute w.r.t every projected relation requires 0(2"'xnxin) time complexity [31]. Obviously, the approach of searching all data subspaces of a relation for detecting outhers is highly inefficient. Thus, we make use of a swarm of agents to search heuristically and randomly.
Although different kinds of agents use different search strategies, if they use the same fitness function to evaluate the solutions, they tend to obtain a similar solution set. After stopping searching, the tuples having lower 0-measure values (below a threshold) in the solution set include attribute outliers. This does not mean that they are all errors.
4.2 Overview of ACO for detecting attribute outlier
In a relational dataset, each tuple or its projection over some attributes can be represented as a path in a dataset graph (DG). A DG for a dataset with the attribute number m is defined as G = (V, E), where V is a set of all data items, i.e. V = {A = v| Ais an attribute name, and v is a value in the domain of A}, and Vi is a subset of V, which only includes data items related to the attribute Aj, U™j I/ = K; E is a set of directed edges between two data items in each tuple, i.e. E = {<x, y >| xG Vi, y G Vj, i^j}.
For example, a dataset is shown in Table I, and a DG for its projection S=nprovmce, city, comtiy(R) is given in Figure I, where
V = Vi^Vz^ V3
= {Vii, V12, V13} ^ {V21, V22, V23, V24} ^ {V31 }
= {Guangzhou, Naning, Nanchang} ^ {Guangdong, Guangxi, Guangdon, Jiangxi} ^ {China}}
T,(f = 0) =
1
Edge _ number
(2)
Figure I: A DG of the dataset projection in Table I.
For a group of ants, the main steps in detecting attribute outliers by the constructed DG are shown as Algorithm I.
Some explanations of Algorithm I are as follows: •In step 2), the pheromone of each edge e^ {l<i<Edge_number) in DG is initialized as Equation (2), where Edge_number is the edge number of DG.
Algorithm 1: The flow of ACO for detecting attribute outhers_
1)	Set ACO parameters (number of ants, pheromone
evaporation rate, etc.);
2)	Initialize the pheromone trails and create a null outlier table S;
3)	Assign a group of ants to a start node of DG;
4)	Each ant selects a path of DG by the transition rule;
5)	Update pheromone trails;
6)	Compute the 0-measure of attributes from each ant's walking path and add or modify outher information into table S;
7)	Go to step 4) and repeat until the convergence and termination conditions are met._
•In step 2), the schema of outlier table S is (T, A, V, M), where T is a tuple ID on an ant traversed path, A is the target attribute name, V is the target attribute value of the tuple, and M is the 0-measure of A for the tuple.
•In step 3), each of the start nodes is assigned the same number of ants, and the degree of a start node is zero.
•In step 4), the probability that the adjacent edge e^ is chosen by the ant from node i^is given by the Equation (3):
m) =

where

1
(3)
(4)
sup(e,)
In Equation (3), I is the next adjacent edge set of
node v,ri. is a heuristic value of edge Ci, f, and v^are two
vertices of e^ and sup(ei) is support of (i^/,, v^). From the equation, it can be inferred that the less frequent (i^/,, v,^ is, the larger the heuristic value of e^ is.
Pheromones placed on the edges in an ACO system play the role of a distributed long-term memory. This memory is not stored within the individual ants, but is distributed on the edges of the path, which allows an indirect form of communication. This benefits exploitation of prior knowledge. However, it increases the probability of choosing routes belonging to those previously traversed, thus inhibiting the ants from exhibiting a bias toward exploration. In order to enhance the role of exploration, we apply the following transition mle for the ants' selection of the next edge, where 9 is a random number in [0, I] and ^ is a parameter in (0, I), if q<(p then
select an edge randomly from /; else
choose with max P;, end if
•In step 5), the pheromones of edges in the newest
visited path P are updated by Equation (5), and the pheromones of other unvisited edges are normahzed by Equation (6).
T,(f +1) = (1 - p)T,(0+0 / <3' ^ e,e P (5)
j
In Equation (5), p in [0,1] is the pheromone evaporation rate, which controls how fast the old visited
path evaporates; Q = XSup(ei), e^ e P, such that the smaller Q is, the more the pheromone of the edge in P increases. In Equation (6), ^t is the sum of the
j
pheromones of all edges.
•In step 6), for the newest selected path P, let the tuple t consist of vertices on P, A is attribute set. Compute the 0-measure of each attribute from P, i.e. O-measure(/li, t), A e A, and obtain the minimum value O-measure(/li, t) among them; then search the tuple s in table S where N(A, s[T]) =	i) and s[A] = A^ and
and add or modify outlier information into table S by the following rule (call it Rule A). Rule A: if sexists then
if 0-measure(/li, 0<s[M] then
s[v] = m-,
s[M] = 0-measure(/li, t)-, end if;
else
insert {t. A, ?[A],0-measure(/li, t)) into S; end if
4.3 Overview of PSO for detecting
attribute outliers
While ACO solves problems with search spaces that can be represented as a weighted construction graph, PSO solves problems with solutions that can be represented as a set of points in an n-dimensional solution space[l]. We regard each attribute value in a dataset as a point in the 2-dimensional space. The objective is to find the points that can minimize the fitness, i.e. the 0-measure. The main steps for detecting attribute outliers in a given dataset by a swarm of particles are shown as Algorithm 2.
Algorithm 2: The flow of PSO for detecting attribute outhers_
1)	Set PCO parameters (number of particles, positive constants, etc.);
2)	Initiahze the swarm by assigning each particle to a velocity and a random position;
3)	Evaluate the fitness for each particle's position;
4)	Update the solution table S and set or reset the best particle;
5)	Update the velocities of all particles;
6)	Move each particle to its new position;
7)	Go to step 3) and repeat until the convergence and termination conditions are met.
Some explanations of Algorithm 2 are:
•In step 2), initialize the velocity of each particle to be zero, i.e. 0, Vid represents the velocity of the i"" particle in the d**" dimension, where i = [1... particle number] and d is lor 2, where 1 represents the tuple dimension and 2 represents the attribute dimension. At the same time, assign each particle a random position, i.e. Pid, which represents the position of the i**" particle in the d"" dimension, where p a is [1... tuple number] and piz is [1... attribute number].
•In step 3), for the i"" particle, according to pa and Pi2, we know the particle is at the position of attribute A,, of tuple t, for which the tuple ID is pu and the attribute ID is pj2, so the 0-measure(A, f) can be calculated.
•In step 4), like the solution table used in ACO, the schema of the solution table S is (T, A, V, M), where T is a tuple ID, A is the target attribute name, V is the target attribute value of the tuple and M is the 0-measure of A for the tuple. For the i**" particle, which is at the position of attribute A of tuple t, search the tuple s in table S where N(A, s[T]) = N(A, 0 and s[A] = A^ and add or modify the new information into table S using Rule A presented in section 4.2.
Set or reset the global best particle that has the swarm's best fitness value (i.e. the minimum 0-measure value), and let it be the g**" particle.
•In step 5), the velocity of each particle is updated by the following equation (7).
vdt+ 1) = vdi) + CI RApdi) - Ut))
+	(7)
where
Pgd represents the position of the swarm's global best particle in the d**" dimension;
Ri and R2 are two 2-dimensional vectors with random numbers (each dimension has its own random number) uniformly selected in the range of [0.0, 1.0], which leads to useful randomness for the search strategy;
C] and C2 are positive constant weighting parameters, which generally fall in the range of [0, 4] with c; + q= 4.
•In step 6), the position of each particle is updated by the following equation shown in (8).
pdt+l) = \ pM +vdt+l)\ mod Ld, (8)
where
I; is the number of tuples and L2 is the number of attributes.
4.4 Overview of BCO for detecting attribute outliers
The BCO algorithm has a unique blend of neighbourhood search and random search that makes it appropriate for combinatorial and functional optimizations[33]. For a group of bees, the search area is a 2-dimensional plane, in which each site corresponds to an attribute value of a tuple, denoted as Vjj, i.e. the Aj attribute value in tuple The main steps for detecting attribute outliers in a given dataset by a swarm of bees are shown as Algorithm 3.
Some explanations of Algorithm 3 are:
•In step 1), set parameters, such as the number of scout bees: n, the number of elite bees: e and the number
of recruited bees around elite regions: k.
Algorithm 3: The flow of BCO for detecting attribute outliers_
1)	Set BCO parameters;
2)	Locate each scout bee at a random position;
3)	Evaluate the fitness for each scout bee's position;
4)	Select elite bees and update the solution table S;
5)	Recruit a number of bees around the elite bees for neighbourhood search and evaluate their fitness;
6)	Select the fittest bee from each elite region;
7)	Assign the remaining bees to search randomly and evaluate their fitness;
8)	Go to step 4) and repeat until the convergence and termination conditions are met._
•In step 2), n bees are arbitrarily located on the search plane, and each site is denoted as Ljj, where Aj attribute value in tuple tj is stored.
•In step 3), for each bee at site Lj, calculate the O-measure of Aj for the tuple t,, i.e. its fitness.
•In step 4), the elite bees with superior fitness (i.e. the lowest e O-measure values) are chosen from n bees, and the solution table S is updated using them. Like the solution table used in ACO, the schema of the solution table S is (T, A, V, M), where T is a tuple ID, A is the target attribute name, V is the target attribute value of the tuple, and M is the O-measure of A for the tuple. For the elite bee at L,j, search the tuple s in table S where N(AJ, s[T]) = N(Aj, t) and s[A] = Ajj and s[V]^tj{Aj], add or modify the new information into table S by Rule A presented in section 4.2.
•In step 5), recruit k bees around each elite bee for the neighbourhood search and evaluate their fitness. For example, let k be 4, the sites for recruited bees around the elite bee at Lj may be Li + ij +1, L,-1 -1, Li + ij -1, L,- ij +1.
•In step 6), select the fittest bee from each elite region, i.e. the best e sites.
•In step 7), the remaining n-e bees search randomly, and their fitnesses is evaluated.
4.5 The time complexity analysis
Suppose that the number of agents is k, the dataset attribute number is m, and the tuple number in the dataset is n. From the descriptions of the above three algorithms, we can analyse the complexity of the time-consuming process for each algorithm.
Compute the O-measure: O(nm); Update solution: O(s), where s is the changing size of the solution table, and s is much less than nm.
If the repeat time is q, the time complexity for each algorithm is O(kqnm), which varies linearly with n and m.
5 Conclusion
This study deals with SI-based approaches in data quality detection. Although SI models have been widely used in many applications, including detection of abnormal behaviour in networks and outliers in databases, limited research has been conducted on attribute outliers. The
paper introduces SI models into attribute outlier detection, and presents three approaches based on ACO, PSO and BCO. The algorithms for these approaches are explained in detail. The large searching space is a major bottleneck for detecting attribute outliers. Our SI-based approaches make use of swarms of the fittest agents and random moving agents, and obtain superior solutions by changing the searching paths or the positions of agents heuristically, which improves the time efficiency of outlier detection. Although different search strategies are applied for the three approaches, they use the same fitness function (i.e. the O-measure) to evaluate the solutions. For future study, we intend to perform experiments on large datasets and study a proper set of parameters that can lead to highly accurate results.
Acknowledgement
This work is supported by National Natural Science Foundation of China(U1431227,11403008), Guangdong Scientific and Technological Project (2013B010401017), Natural Science Foundation of Guangdong(S2012010008831).
References
[1]	Ahmed, H., Glasgow, J.(2012). Swarm Intelligence: Concepts, Models and Applications, Technical Report 2012-585, School of Computing Queen's University.
[2]	Panigrahi, B. K., Shi, Y.H., Lim , M.-H. (2011). Handbook of Swarm Intelligence. Series: Adaptation, Learning, and Optimization, Vol 8, Berlin :Springer-Verlag.
[3]	Blum ,C. , Merkle, D. (2008). Swarm Intelligence -Introduction and Applications (Natural Computing serial). , Berlin : Springer.
[4]	Yang, X.S., Cui, Z. H.(2014). Bio-inspired computation: success and challenges of IJBIC, International Journal of Bio-inspired Computation, 6(1): 1-6.
[5]	Cui, Z. H., Xiao, R.B.(2014). Bio-inspired Computation: Theory and Applications, Journal of Multiple-valued Logic and Soft Computing, 22(3): 217-221.
[6]	Yang, X. S., Deb, S., Loomes, M., Karamanoglu, M. (2013). A framework for self-tuning optimization algorithm, Neural Computing and Applications, 23(7-8): 2051-2057.
[7]	Cui, Z. H., Fan, S. J., Zeng, J. C., Shi, Z.,Z. (2013). Artificial plan optimization algorithm with three-period photosynthesis, Int. J. Bio-Inspired Computa^ion, 5( 2): 133-139.
[8]	Yang, X. S., (2012). Metaheuristics and swarm intelligence in engineering and industry: editorial, Int. J. Bio-Inspired Computation, 4( 4): 197-199.
[9]	Ciszak, L.(2008) Application of clustering and association methods in data cleaning. Proceedings of the International Multiconference on Computer Science and Information Technology, pp:97-103.
[10]	Chiang, F., Miller, R.J.(2008). Discovering data quality rules. Proceedings of the International Conference on Very Large Databases, pp:1166-1177.
[11]	Alam, S., Dobbie, G., Riddle, P., Naeem, M.A.(2010). A swarm intelligence based clustering approach for outlier detection, 2010 IEEE Congress on Evolutionary Computation, pp:1-7.
[12]	Jindal, R., Sharma, S.D., Manoj Sharma, M.(2013). A New Technique to Increase the Working Performance of the Ant Colony Optimization Algorithm. International Journal of Innovative Technology and Exploring Engineering, 3(2) 128131.
[13]	Beni, G., Wang, J.(1989). Swarm intelligence in cellular robotic systems. Proceedings of NATO Advanced Workshop on Robots and Biological Systems, vol.102, 703-712.
[14]	Jiang, K.Q., Song, B.S., Shi, X.L., Song, T.(2012). An Overview of Membrane Computing, Journal of Bioinformatics and Intelligent Control, 1(1): 17-26.
[15]	Mohamed Jafar, O.A., Sivakumar, R.(2010). Ant-based Clustering Algorithms: A Brief Survey. International Journal of Computer Theory and Engineering, 2(5) 1793-8201.
[16]	Dorigo, M., Caro, G. D. (1999). Ant Algorithms for Discrete Optimization. Artificial Life, 5(3) 137-172.
[17]	Dorigo, M., Maniezzo, V. (1996). The ant system: optimization by a colony of cooperating agents. IEEE Transactions on Systems, Man, and Cybernetics, 26(1) 1-13.
[18]	Dorigo, M., Maniezzo, V, Colorni, A.(1991). Positive feedback as a search strategy, Technical Report 91-016, Dipartimento di Elettronica, Politecnico di Milano, Italy.
[19]	Kennedy, J., Eberhart, R. C.(1995). Particle swarm optimization. Proceedings of IEEE International Conference on Neural Networks, 1942-1948.
[20]	Aydin, M.E., Wu,J., Zhang,L.(2010) Swarms of metaheuristic agents: A model for collective intelligence, 2010 International Conference on P2P, Parallel, Grid, Cloud and Internet Computing, pp:296-301.
[21]	Garcia-Gonzalo, E., Fernandez-Martinez, J.L.(2012). A Brief Historical Review of Particle Swarm Optimization (PSO), Journal of Bioinformatics and Intelligent Control, 1(1): 3-16 .
[22]	Karaboga, D.(2005). An idea based on honey bee swarm for numerical optimization, Technical
Report-TR06,Erciyes University, Engineering Faculty, Computer Engineering Department.
[23]	Djenouri, Y., Drias, H., Habbas, Z., Mosteghanemi, H.(2012).Bees swarm optimization for web association rule mining, 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology,pp: 142-146.
[24]	Zhu, X., Wu, X. (2004). Class noise vs. attribute noise: A Quantitiative study of their impacts. Artificial Intelligence Review, 22(3) 177-210.
[25]	Kolias, C., Kambourakis, G. , Maragoudakis, M.(2011). Swarm intelligence in intrusion detection: A survey. Computers & Security, 30(8) 625-642.
[26]	Tsai, C.L., Tseng, C.C., Han, C.C.(2009). Intrusive behavior analysis based on honey pot tracking and ant algorithm analysis. Proceedings of the 43rd Annual 2009 International Carnahan Conference on Security Technology, pp:248-252.
[27]	Soroush, E., Saniee A.M., Habibi J.A.(2006). Boosting ant-colony optimization algorithm for computer intrusion detection. Proceedings of The IEEE 20th International Symposium on Frontiers in Networking with Applications.
[28]	Lumer, E., Faieta, B.(1994). Diversity and adaptation in populations of clustering ants, Proceedings of Third International Conference on Simulation of Adaptive Behavior: From Animal to Animats 3,pp: 499-508.
[29]	Ramos,V., Abraham, A.(2005). ANTIDS: Self organized ant based clustering model for intrusion detection system. Proceedings of The Fourth IEEE International Workshop on Soft Computing as Transdisciplinary	Science	and Technology(WSTST'05), pp:977-986.
[30]	Osama, A., Zulaiha, A.0.(2012). Bees Algorithm for feature selection in network anomaly detection. Journal of Applied Sciences Research, 8(3) 17481756.
[31]	Koh, J. L. Y. , Lee, M.L., Hsu,W., et al.(2007). Correlation-based detection of attribute outliers. Advances in Databases: Concepts, Systems and Applications(Lecture Notes in Computer Science,vol.4443), 164-175.
[32]	Barnett, V., Lewis,T.(1994). Outliers in Statistical Data. New York : John Wiley and Sons.
[33]	Pham, D.T., Ghanbarzadeh, A., Koc, E., S. Otri, S., et al. (2006). The bees algorithm—a novel tool for complex optimisation problems. Proceedings of IPROMS 2006 conference, pp:454-461.
Experimental Comparisons of Multi-class Classifiers
Lin Li
Institute of Intelligent Computing and Information Technology, Chengdu Normal University
No.99, East Haike Road Wenjiang District, Chengdu, China
E-mail: lilin200909@gmail.com
Lin Li, Yue Wu and Mao Ye
School. of Computer Science and Engineering, University of Electronic Science and Technology of China
No.2006, Xiyuan Ave, West Hi-Tech Zone, Chengdu, China
Keywords: multi-class classifier, tree classifier, boost classifier, support vector machines, combining boost classifier
Received: June 18, 2014
The multi-class classification algorithms are widely used by many areas such as machine learning and computer vision domains. Nowadays, many literatures described multi-class algorithms, however there are few literature that introduced them with thorough theoretical analysis and experimental comparisons. This paper discusses the principles, important parameters, application domain, runtime performance, accuracy, and etc. of twelve multi-class algorithms: decision tree, random forests, extremely randomized trees, multi-class adaboost classifier, stochastic gradient boosting, linear and nonlinear support vector machines, K nearest neighbors, multi-class logistic classifier, multi-layer perceptron, naive Bayesian classifier and conditional random fields classifier. The experiment tested on five data sets: SPECTF heart data set, Ionosphere radar data set, spam junk mail filter data set, op^digits handwriting data set and scene 15 image classification data set^. Our major contribution is that we study the relationships between each classifier and impact of each parameters to classification results. The experiment shows that gradient boosted trees, nonlinear support vector machine, K nearest neighbor reach high performance under the circumstance of binary classification and minor data capacity; Under the condition of high dimension, multi-class and big dat^a, however^, gradient boosted trees, linear support vector machine, multi-class logistic classifier get good results. At last, the paper addresses the development and future of multi-class classifier algorithms.
Povzetek: V prispevku je podan pregled klasifikatorj^^ z več razredi.
1 Introduction
Multi-classification problem [1] is that of supposing a set	solution to the scalability issues of single decision tree.
of training data(x;,q),...,(x,,), where xeRp are finite set of	Adboost algorithm is proposed by Yoav Freund and
input features, c,e{1,2,...,k} are class numbers of output	Robert Schapire[34] is a meta-classification algorithm
variables. The pui rpose of multi-classification task is to	which can be combined with other classification
find a classifying rules based on the training sample, then	algorithms to enhance its performance. Multi-class
given a new features, outputting a classifying category.	logistic classifier proposed by Jerome Friedman et al. [7]
Today multi-classifier algorithms are applied to a variety	is another important improvement of enhancing the basic
of application areas such as: radar signal classification,	boosting algorithm. K nearest neighbor [5] classifies
character recognition, remote sensing, medical	samples based on the adjacent spatial relationships of
diagnostics, expert systems, voice recognition domains	features. In 1980s with the rise of data fusion and
and etc.	learning model in statistics and management science,
Multiple classifiers has a long history. Selfridge et al.	Bayesian expert [8-10] system is widely used. Since the
[2] first propose a multi-classification system based on	1990s, Vapnik proposes support vector machines,
'winners get all' solution which chooses the optimal	transforming the feature from low dimensional space into
solution as a multi-classifier output. Kanal and Minsky	high dimensional space, which is a better ways to
[3, 4] play an important role in multi-classifiers	classify the features. Nonlinear support vector machine
development. They claim that any classifying algorithm	gets a great success, however it is not idealin some cases
does not solve all problems. We need to design specific	i.e. the original features are already high dimensional
classifying algorithm for different problems. Multilayer	space, so people propose a linear support vector
perceptron [5] is an artificial neural network model that	machine[11] for these cases. Because of the complexity
can resolve this kind of nonlinear data. Decision tree is	of the data, a single classifier is often difficult to obtain
an ancient non- parametric classification algorithm that	good performance for specific applications, it is a
classifies the samples according to the classifying rules.	growing tendency to improve classifying performance by
Leo Breiman[6] proposes random forest as a good	a combination of classification methods [12].
How to solve the multi-classification problems is challenging. There are two ways to deal with it [13]. Nilsson et al. [14] first use combination of binary classifiers to solve the multi-classification problems. The other way is that directly extends binary classifier into multiple classifier.
The main contributions of this paper are as following:
1.	We compare twelve most commonly used algorithms for multi-classification in several aspects such as principle, important parameters, running time performance, and etc.
2.	This paper considers the impact of differences of type and size in data sets. We chose binary classification and multi-classification, a small amount of data and a large size set as the evaluation data sets.
3.	This paper gives in-depth analysis of multi-category classification for each class.
4.	In this paper, we investigate the relationship between the single classifier and a combination of single classifier.
This paper discusses the 12 multiple classifiers: decision trees, random forests, extremely randomized trees, multi-class adaboost, stochastic gradient boosting, support vector machines (including linear and nonlinear), K nearest neighbors, multi-class logistic classifier, multilayer perceptron, naive Bayesian classifier and conditional random fields classifier.
The paper is organized as follows: section 1 is an introduction of studying content. In section 2 we discuss the related works. Overview of algorithms are discussed in section 3. Section 4 presents an experimental setup and parameters settings. In Section 5, we explain experimental results. Section 6 concludes the paper.
2 Related works
There are few comprehensive comparisons of multi-classification algorithms. King et al. [15] is the most comprehensive and earliest study of multi-classification algorithms including CART, the traditional algorithm C4.5, Naive Bayes, K nearest neighbor, neural networks,
such as support vector machines, random forests have been widely used. Further, data sets they used are too small while comparing to current big data. Then again their evaluation criteria is simple. Bauer et al. [16] thoroughly compare voting classification algorithms including bagging, boosting and its improved versions, but it is only comparing these two types of voting algorithm. LeCun et al. [17] use accuracy, rejecting rates and running time as the evaluation criteria. They compare algorithms: K-nearest neighbor, linear classifier, the main ingredient and polynomial classifiers on handwritten recognition data set. But only one data set used, it is not sufficient to evaluate different application scenarios. Ding et al. [18] use neural networks and support vector machines for protein test data set. They do detailed comparison of the accuracy of different parameters, but this comparison is relatively simple and data set is small. Tao et al. [19] study the decision trees, support vector machines, K nearest neighbor classification of gene sequences of organization application. However, this comparison only discusses single dataset. Foody et al [20] study multi-class image classification by support vector machines. But they only compare support vector machines, even linear support vector machine is not involved. Chih-Wei Hsu et al. [21] also study multi-class support vector machines for a more in-depth theoretical analysis and comparison, but is limited to multi-class support vector machine. Caruana et al. [22] study supervised learning algorithm (support vector machines, neural networks, decision trees, and etc.) in 9 different criteria such as ROC area, F evaluation and etc., but the literature is only discussed two classification data sets. Krusienski et al. [23] compare the Pearson correlation method, Fisher linear discriminant, stepwise linear discriminant, linear support vector machines, and Gaussian kernel support vector machines on P300 Speller data set. However, the data set is relatively simple, and small size of data is often difficult to compare running performance of support vector machine between linear and non-linear scenarios.
Table 1 lists the current situation of classifier comparisons.
Reference	Comparison of algorithms	Data sets	Evaluation criteria	Research domains
King et al.[15]	Symbolic learning(CART, C4.5, New ID, AC2, Cal5, CN2) , statistic learning( Bayesian network, K-nearest, kernel density, linear discrimination, quadratic discrimination , logistic regression) , neural network	Satellite image, hand written digits, vehicle, segment, Credit risk, Belgian data, Shuttle control data, Diabetes data, Heart disease and head injury, German credit data	Accuracy	General purpose
Bauer et al[16	Bagging, boosting and its variants	Segment, DNA, chess, waveform, sat-image, mushroom, nursery, letter, shuttle	Average error rate, variance, bias	General purpose
Ding et al.[18	Support decision vector, neural network	Protein test dataset	Accuracy, Q-percentage, Accuracy	Bioinformatics
Tao et al. [19]	Support decision vector, Bayesian network, K-nearest, decision tree	ALL, GCM, SRBCT, MLL-leukemia, lymphoma, NCI60, HBC	Accuracy	Bioinformatics
Foody et al.[20]	Support decision vector, decision tree, discriminating analysis, neural network	Airborne sensor data	Accuracy,	Remote imaging
Chih-Wei Hsu et al[21]	Support decision vector	Iris, wine, glass, vehicle, segment, DNA, sat-image, letter, shuttle	Accuracy, running-time	General purpose
Caruana et al. [22]	Support decision vector, neural network, decision tree, memory based learning, bagged tree, boosted tree, boosted stumps	RMS, MXE, CAL, ADULT.	Accuracy, square error, inter-class cross entropy, ROC regions, F evaluation, recall and precision, average precision and recall, lift, probability calibration	General purpose
Krusienski et al. [23]	Pearson related methods. Fisher linear discrimination, stepwise linear discrimination, linear support decision vector, Gaussian kernel support decision vector	P300 Speller	ROC curve	Medicine domain
Our method	Decision tree, random forests, extremely randomized trees, multi-class adaboost classifier, multi-class logistic classifier, stochastic gradient boosting, multilayer perceptron, K nearest neighbors, naive Bayesian classifier and support vector machines (including linear and nonlinear)	SPECI Ionosphere, spam, optdigits and Scene 15	Overall accuracy, average precision and recall, average Jaccard, inter-class F and Jaccard evaluation, running performance.	General purpose
Table 1: Comparisons of multi-class classification algorithms.
Table 1 shows that the majority of current surveys focus on multi-class classifiers in a particular field, such as medicine, biology, remote sensing images, and etc. The data sets and methods used for evaluation is relatively simple. King et al's evaluation [15] is more comprehensive, however the comprising algorithms are classical. After that new algorithms emerge. Our comparing algorithms are the newest and most representative of the current tendency in a variety of application domains.
3 Overview of Algorithms
3.1 Brief introduction of algorithms
In order to better understand various classifiers to compare, we briefly introduce multiple classifier algorithms.
3.1.1 Decision tree
In machine learning, decision trees [24] is a predictive model. A decision tree is a flowchart-like structure in which internal node represents test on an attribute, each branch represents outcome of test and each leaf node represents class label, and then decision is taken after computing all attributes. A path from root to leaf represents classification rules. Decision tree is actually an adaptive basis function model [25], and can be expressed as follows
M	M
fix) = F[y \ x] = '^w I{xG R ) w ^(x, v )
(1)
Where is the m'th region, w^ is the mean response in this region, and v^ encodes the choice of variable to split on, and the threshold value, on the path from the root to the m'th leaf.
Classification tree is an ancient method, it has various variants, typically such as[26] 1D3 (Iterative Dichotomiser 3) proposed by Ross Quinlan, is a greedy approach that in each iteration choose the best attribute value to split the data, but this method has the problem of local optimum. C4.5 also proposed by Ross Quin lan, is an improved 1D3, can be used for classification. CART(classification and regression Trees) [27] proposed by Breiman has same process with C4.5 algorithm.
except that the C4.5 uses an information entropy rather than Geni coefficient used by CART.
3.1.2	Random forests
Random forests are proposed by Leo Breiman and Adele Cutler[28]. It is an ensemble learning method for classification (and regression) that is built by constructing a multitude of decision trees at training time and outputting the class by voting of individual trees.
Random forests are a method of building a forest of uncorrelated trees using a CART like procedure, combined with randomized node optimization and bagging [29].
Random forests have the advantages of computing efficiency, improving the prediction accuracy without significantly increase of computational cost. Random forest can be well predicted up to thousands of explanatoiy variables [30], known as one of the best current algorithms [31].
3.1.3	Extremely randomized trees
Extremely randomized trees have been introduced by Pierre Geurts, Damien Ernst and Louis Wehenkel[32]. The algorithm of growing extremely randomized trees is similar to random forest, but there are two differences:
1.	Extremely randomized trees don't apply the bagging procedure to construct a set of the training samples for each tree. The same input training set is used to train all trees.
2.	Extremely randomized trees pick a node split very extremely (both a variable index and variable splitting value are chosen randomly), whereas random forests find the best split (optimal one by variable index and variable splitting value) among random subset of variables.
3.1.4	Multi-class adaboost classifier
Boosting has been a veiy successful technique for solving the two-class classification problem [33]. In going from two class to multi-class classification, most boosting algorithms have been restricted to reducing the multi-class classification problem to multiple two-class problems [34].
3.1.5	Stochastic gradient boosting
Gradient boosting proposed by Friedman [35] is a method to improve basic boosting algorithm. The traditional boosting method is adjusted weights to correct classification samples and error samples based on gradient descend at each iteration. The major difference of gradient boosting from the traditional boosting method is that purpose at each iteration is not to reduce the losses, but in order to eliminate the loss. The new model at each iteration is based on the residuals of former process. Inspired by Breiman[6]'s randomized bagging idea, Friedman introduces stochastic gradient boosting by randomized down-samphng to train basic classifier
3.1.6	Support vector machines
Support vector machines [36, 37] is the method that mapped feature vector into a higher dimensional vector space, where a maximum margin hyper-plane is established in this space. So we choose the hyper-plane so that the distance from it to the nearest data point on each side is maximized. The greater the distance between the nearest data of different classes is, the smaller the total error classification is. The multi-class support vector machines [21] can be defined as follows
Supposing 1 groups sample!	,
where x,sR',i=\,...,i , c,G{i,...,i} is the type of . The i- th support vector machine solve this problem.
Uwfw + cY^^-iwf
^ >1
(2)
{n'f<j>(Xj) +b'>\- if c = i
(3)
j=i....j	(4)
Where training sample is mapping into high dimensional space by function ^ and regularization parameters c. Minimizing (i/2)(h'/h'means that we have to minimize2/II ^n, the margins between two group data.
The penalty	f is to reduce the number of training
>1
error The core concept of support vector machine is to seek a balance between the regularization term(i/2)(H'yH' and the training errors.
We get k decision functions after solving formula
(4).
{WY(l>{x)+b'
■■■	(5)
{v^f(t)(x) + b''
After that, we have the largest value of decision functions as the predictive class of x.
3.1.7 Linear support vector machines
SVM uses a nonlinear mapping that converts low-dimensional feature space into a high-dimensional space to get better discriminative. However, in other applications, the input feature itself is a high-dimensional space, if more is mapped to more high latitudes, it may not be able to get better performance. Their own space can be directly used as identification. The linear support vector machine SVM [38] is suitable for this scenario. For multi-classification. Crammer et al. [39] propose
this method to solve the problem. We define the original question as follow
Supposing the training data set, where, A; G;t:cÄ°is feature vector, g r={i,2,...,i} is the type of instance, 2=1,2,...,;. The multi-class problem can be formulated as the following primal problem.
min -T \\wj\ + Cy^
<w.>,«>2^'»"	(6)
where, c> o is the regularization parameter, is the weight vector associated with class =	, and(T^^ = i
if c, = m , (T^ ^ = 0 if c^^ m . Note that in (7), the constant m= c, corresponds to the non-negativity constraint: > o . The decision function is
arg max
(7)
3.1.8	K nearest neighbors
K nearest neighbours algorithm (or k-NN for short) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space.
In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbours, with the object being assigned to the class most common among its k nearest neighbours (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbour
3.1.9	Multi-class logistic classifier (maximum entropy classifier)
In some cases, multi-class logistic regression well fits features. It has the formula [25]
exp(wix)
p(y= c\ x,W) = -
(8)
Where,
p(y= C \ X, W) is the predictive probability. 7 is the class type of totally C . W^ is the weight of class c, and approximated by maximum posterior probability. With this, the log-likelihood has the form
N
=z
1=1 C=1
r C	A r c
This model can be
(9)
Where, = /<7, = c 1 »0 optimized by L-BFGS [41].
3.1.10 Multilayer perceptron
A multilayer perception (MLP) is a feed forward artificial neural network model that maps sets of input data onto a set of appropriate outputs. A MLP consists of multiple layers of nodes in a directed graph, with each layer fully connected to the next one. Except for the input nodes, each node is a neuron (or processing element) with a nonlinear activation function.
MLP[57] has evolved over the years as a veiy powerful technique for solving a wide variety of problems. Much progress has been made in improving
performance and in understanding how these neural networks operate. However, the need for additional improvements in training these networks still exists since the training process is veiy chaotic in nature.
3.1.11 Naive Bayesian classifier
A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model"
Assuming sample z , belongs to type C, . The class-conditional density is

f(X\c,)P(,q)_ f(X\q)P(,q)
p{X)
f{X\ Cj)P(Cj)
(10)
Where, /(| c) is the maximum likelihood. Input
features is -¥ , C is class type.
A simpler alternative to generative and discriminative learning is to dispense with probabilities altogether, and to learn a function, called a discriminant function, that directly maps inputs to outputs. The decision function of naive Bayesian classifier is
c = arg max P(C = c, )n P( Y, = y, I C = c, )	(11)
3.1.12 Conditional random fields classifier
CRFs (Conditional random fields) are a class of statistical modelhng method often applied in pattern recognition and machine learning, where they are used for structured prediction. Whereas an ordinaiy classifier predicts a label for a single sample without regard to "neighboring" samples, a CRF can take context into account; e.g., the hnear chain CRF popular in natural language processing predicts sequences of labels for sequences of input samples.
We define a CRF on observations z and random variables y as follows:
Let G = {v,E) be a graph such that	, so
that y is indexed by the vertices of g . Then (x,Y)is a conditional random field when the random variables Yu, conditioned on X, obey the Markov property with respect to the graph:
p( Y, I X, Y„ w v) = p(Y, I X, Y,, w- v)	(12)
Where f means that w and v are neighbors in G. What this means is that a CRF is an undirected graphical model whose nodes can be divided into exactly two disjoint sets z and r , the observed and output variables, respectively; the conditional distribution p(y | x) is then modeled. For classification problem, we compute the maximum conditional probabihstic distribution.
3.2 Comparison of algorithms
With reference to [45] for the multi-classification, we made a comparison of 12 algorithms as shown in table 2.
We analyze and summarize in Table 2 as follow: 1. Algorithms type
Aside from Naive Bayes, others are discriminant model. Bayesian algorithm by learning the joint distribution P(C,X) , then obtains the conditional probability P(X\ C) . The classilying prediction is achieved by maximizing likelihood approximation. However the discriminant method directly makes prediction by the discriminant function or conditional probability.
2.	Algorithms trait
Decision trees, random forest, extreme random tree, multi-class adaboost upgrade and stochastic gradient boosting all belong to model with adaptive basis functions that can be grouped into common additive model[25], as shown in Equation 13.
f(x) = a+ ) + ...+ f„(x„)	(13)
Where f, is the sub-model obtained through training sample. f(x) is the superposition of sub-models. Decision tree is the basic sub-model for tree-like algorithms. Upon whether the use of all samples for training, these kinds of algorithms can be divided into random forests and extremely random tree. Random forests is to build submodel through random bagging sampling. However, extremely random tree obtains sub-model using all training samples, but randomly selecting the splitting threshold. Multi-class adaboost and stochastic gradient boosting is a linear combination of sub-models (weak classifiers). The difference lies in their learning algorithms.
Multilayer Perceptron, linear and non-hnear support vector machines can be classified as kernel methods [46]. The unified formula has the form
f(x)=w<^(x)+b	(14)
where wis real weight , be Ris bias. <i>(.y) function is the type of the classifiers, for MPL
(i>(.)=(<i>,(-),<i>,(-),...,<i>j,(-)) , the i-th hidden is defined as
<i>,(x) = h(v,x+d,) . h is the mapping function , generally a hyperbohc function or shape function B is chosen by MPL. For linear support vector machine <i>(.y) is linear function, rather than polynomial function, Gaussian kernel function, and etc. for non-hnear support vector machine.
K nearest neighbor model is constructed according to the division of distance relationship of the training feature space, being classified by a majority voting.
Multi-class logistic regression (maximum entropy) is the probabilistic choice model with constraints that uncertain contents are treated with equal probability of using entropy maximization to represent.
Naive Bayes classifier is based on the conditional independence assumption of training samples, learning parameters with the joint probability distribution though Bayes' theorem, then classilying a given sample, by maximizing the posterior probability to get corresponding class.
3.	Learning policy, loss and algorithms
Decision trees, random forests, extremely randomized trees belong to the maximum likelihood approximation of learning strategies, with the loss of log-likelihood function.
Algorithms	Algorithms type	Algorithms characteristic	Learning policy	Loss of learning	Learning algorithms	Classification strategy
Decision tree	Discriminant	Classification tree	Regularized maximum likelihood estimation	Logarithmic likelihood loss	Feature selection, generation, prune	IF-THEN policy based on tree spitting
Random tree	Discriminant	Classification tree(based on bagging)	Regularized maximum likelihood estimation	Logarithmic likelihood loss	Building multi-decision tree based on bagging of subsampling	Sub-tree voting
Extremely random tree	Discriminant	Classification tree	Regularized maximum likelihood estimation	Logarithmic likelihood loss	Building multi-decision tree	Sub-tree voting
Multi-class adaboost	Discriminant	Linear combination of weak classifier(based on decision tree)	Addition minimization loss	Exponent loss	Forward additive algorithm	Linear combination of weighted maximum weak classifiers
Stochastic gradient boosting	Discriminant	Linear combination of weak classifier(based on decision tree)	Addition minimization loss	Exponent loss	Stochastic gradient descent algorithm	Linear combination of weighted maximum weak classifiers
Non-linear Support vector machine (based on libsvm)	Discriminant	Super-plane separation, kernel trick	Minimizing the loss of regular hinge, soft margin maximization	Hinge loss	Sequential minimal optimization algorithm (SMO)	Maximum class of test samples
Linear SVM (based on liblinear)	Discriminant	Super-plane separation	Minimizing the loss of regular hinge, soft margin maximization	Hinge loss	Sequential dual method	Maximum weighted test sample
K-nearest	Discriminant	Distance of feature space	^^^^	^^^		Multiple voting, empirical loss minimization
Multi-logistic (Maximum entropy)	Discriminant	Conditional probabilistic distribution, Log-linear model	Regularized maximum likelihood estimation	Logistic loss	L-BFGS	Maximum likelihood estimation
Multilayer perceptron	Discriminant	Super-plane separation	Minimization of error separation distance point to the hyperplane	Error separation distance point to the hyper-plane	Random gradient decrease	Maximum weighted test sample
Naive Bayesian classifier	Generative	Joint distribution of feature and class, conditional independent assumption	Maximum likelihood estimation. Maximum posterior probability	Logarithmic likelihood loss	Probabilistic computation	Maximum posterior probability
Conditional Random Fields	Discriminant	Conditional probabilistic distribution under observing sequence. Log-linear model	Maximum likelihood estimation. Regularized maximum likelihood estimation	Logarithmic likelihood loss	Random gradient decrease, quasinewton methods	Maximum likelihood estimation
Table 2: Summary of twelve multi-class methods.
The decision tree's optimal strategy is to leam some features through a recursive process and split the tree in accordance with the feature of the training samples. In order to have a better generalization ability, the decision tree model has to be pruning for removal of over-fitting.
Random forests is based on random sampling based on the approach (bagging) to form more stars forest trees. Extremely randomized trees is randomly selecting the splitting value to build decision trees forests.
Multilayer Perceptron, linear and non-hnear support vector machines are to separate hyper-plane. The difference is that multilayer perceptron is to minimize the error hyper-plane, however linear and non-linear support vector machine is a minimal loss of hinge page. Perceptron learning algorithm is stochastic gradient descent, linear support vector machine is a sequential dual method, and non-linear support vector machine is a sequential minimal optimization method.
K-nearest neighbor is based on distance of feature space.
Multi-class logistic classifier (maximum entropy) learning strategies can be seen as either maximum likehhood estimation, or a logical loss minimization.
Loss of function is a linear logarithmic function. The model learning is the maximum likelihood estimation or regularized maximum likelihood estimation under certain training conditions. Optimization algorithm is L-BGFS.
Naive Bayesian probability is a typical generative model. The maximum likelihood estimation and Bayesian posterior probabilities is calculated based on the joint probability and conditional independence assumption.
4. Classification strategy
Decision tree uses the IF-THEN rules for classification decisions based on the value of the model learned. Random forests and extremely random tree is based on the voting of eveiy single decision tree classification, then taking the highest vote as the final results.
Multilayer Perceptron, multi-class logistic regression (maximum entropy), linear and non-hnear support vector machine have the same form of classification decisions
class of x=argmax((v^) (j>(x)+h'y	(15)
The difference lies in the choice of <Kx) . Multilayer perceptron machine chooses B-shaped function, hyperbohc tangent function, and etc.; Log-hnear
functions is chosen for multi-class logistic regression; linear support vector machine choses a linear function; nonlinear support vector machine's choice is non-linear kernel function.
K nearest neighbor is a majority voting that output classification is determined by choosing K nearest voting in light of test sample' s distance to the learned model.
Naive Bayesian decision strategy is the rule of maximizing the posterior probability.
4 Experimental Setup
In order to evaluate the performance of various types of classifiers, we implemented our comparisons based on Darwin [47], Opencv[56], Libsvm [48] and liblinear [11] in C++. This paper compared 12 kinds of algorithms.
4.1 Performance comparisons
Confusion matrix (Confusion Matrices) [49, 50] is a common performance evaluation method in pattern recognition classification, which characterizes the relationship between the type of real classes and the recognition classes. For multi-classification problem (For simplifying the representation, we take three categories as example) is illustrated in Table 2.
Prediction True class	A	B	C
A	AA	AB	AC
B	BA	BB	BC
C	CA	CB	CC
Table 3: Statistics of confusion matrices for samples classification.
Where A, B and C are three classes, AA, BB and CC represent the correct prediction number of samples, the remaining number of samples is representative of the error prediction. AA represents the number of samples correctly identified as samples A. AB is predictive number that original Sample A which is incorrectly predicted as Sample B. The remaining items have the same meaning.
Total accuracy rate can be calculated based on confusion matrix as follows.
TA = (AA + BB + CC)/( AA + AB +... + CC)	(16)
Where aa + ab+... + cc is the total number of sample. TA is the total accuracy.
Precision and recall are quantitative evaluation method. They are not only used to evaluate the accuracy of each class, but also the important standard to measure the performance of the classification system.
Precision is the fraction of retrieved instances that are relevant. Precision reflects the classification accuracy. In practical applications, the average precision are often used to evaluate multi-classification (taking categories as example), which is calculated as follows
aviptecisian= (AA/(AA+ BA+CA)+BB/(B4+ BB+ BC) + CC/( AC+ BC+CCj) (17)
Recall is the fraction of relevant instances that are retrieved. Recall reflects the classification
comprehensiveness. In practical applications, the average recall are often used to evaluate multi-classification (taking categories as example), which is calculated as follows
aigiecaU=(44/(44+ AB+ AC)+BB/(B4+ BB+ BC) + CC/( C4+ CB+ CC))
(18)
Measure is an integrated measurement method of the recall and precision. Higher values reflect the recall and precision better integrated. i is defined as
i ^ 2, avgprecisian * avgrecall	(19)
avgprecisian + avgrecall
Jaccard Coefficient [51] is used for comparing the similarity and diversity of sample sets. Taking class A as example, the formula is
JC, = -
AA
AA + AB + AC + BA + CA
Average Jaccard coefficient reflects the average of the various categories Jaccard coefficient, which is calculated as
AA	BB	OC
(20)
B0)/3(21)
AA+AB+AC+BA+CA AB+BB+CB+BA+BC C4+OB+OO+ AC+B
Jaccard coefficient predicts more accurately reflects higher. Jaccard coefficient can evaluate multi-class classification in each class.
4.2 Data sets and data transformation
We evaluate the performance of these algorithms on five data sets that consists of three binary classification data sets and two multi-classification data sets.
4.2.1	SPECTE Heart data set
SPECTF[52] means single-photon emission computed tomography cardiac data sets. SPECTF is a new data set[53]. In cleaning process, the records with missing information, incomplete picture records are filtered, since the original image scale is not uniform, so the original image is transformed into gray image (0 to 255).
After cleaning, the data set contains 44 features that record 22 region of interest for the cardiac systolic and expanded state. The data type is an integer type (rang from 0 to 100). The data set has total 267 instances of which containing 80 training samples and 187 test samples. Each instance has two states: normal and abnormal. In experiment, we converted class label into class 0 (normal) and class 1 (abnormal). Histogram distribution of training samples showed normal shape is shown in Figure 1 (a).
4.2.2	Ionosphere data set
Ionosphere data set[52] includes is a set of radar data sets created by a military system acquisition in Labrador NATO airbase, Goose Bay, Canada.
The data set has total of 34 feature (integer, decimal type) that record 17 sub-pulses labels and values. The data set consists of 351 instances of which 100 instances are training samples, 251 instances are test samples. Each instance has two states: good and bad. In the experiment we converted class label into class 0 that represents the existence of the facts) and class 1 the non-existent facts. Feature type of the dataset is real type, range from -1 to
a. Histogram of the distribution of the training samples is shown in Figure 1 (b).
4.2.3	Spam data set
Spam[52] is a spam filtering data set that has total of 57 data features. 48 consecutive real number is used to represent the percentage of messages of 48 words (i.e. make, address, all, conference) and 6 word occurrences (!, (, [,, $, #) in email.
The data set has a total of 4601 instances, where the training instances are 3065, testing instances are 1,536. Each instance has two states: 0 for not spam and 1 for junk mail. The data set is real type feature information, the range is 0 to 9090. The training sample distribution histogram is scattered, as shown in Figure 1 (c).
4.2.4	Optidigits data set
Optdigits[52] is the information collection extracted from 43 individuals (30 of them for training, 13 as a testing) by the handwritten Optical Character Recognition bit image Standardization American National Standards Institute of Technology (NIST).
The data set has a total of 5620 instances, where the training instances are 3823, test instances are 1797. The instances have ten classes, from 0 to 9 digits. Feature data type is a positive integer, ranging from 0 to 16. Histogram distribution of training samples is shown in Figure 1 (d).
4.2.5	Scene 15 data set
Scene 15[54] data set is a post-processing data set. The original data set has total of 15 classes, 4485 images. We randomly selected sample of 2250 as training samples, 2235 samples for testing.
We used the method of PHOW (Pyramid Histogram Of visual Word) [54] for feature extraction. 12000 descriptors were obtained for each image. In order to obtain a better classification within the features, we used method proposed by vedaldi[55] for features kernel transformation, finally we got 36,000 dimensional feature descriptors to characterize a single image. In the experiment we converted 15 classes into class 0 to 14, representing the bedroom, CALsuburb, industrial, kitchen, livingroom, MITcoast, MITforest , MIThighway, MITinsidecity, MITmountain, MITopencountry, MITstreet, MITtallbuilding, PARoffice and store respectively. Feature data type is a positive real number, the range is -0.3485 to 0.4851, further feature data is relatively concentrated, similar to normal training. Histogram distribution of training samples is shown in Figure 1 (e).
This data set has large dimensions of features and a large size of data. Selection of this data set is mainly to test algorithms adaptability for real application scenarios.
In Figure 1, we demonstrated our diversity of data sets. SPECIF, ionoispher and spam are small scale data sets. They are for binary classification. Optigits and Scene 15 are large scale data sets. They are for multiclass classification. At the same time, the range of our feature
vector value are large.
The range of Figure 1 feature values is also huge. The range of SPECIF is from 0 to 100, ionoispher from -1 to 1, and spam from 0 to 1000, Optigits from 0 to 20 and Scene 15 from -0.5 to 0.5. The diversity for our comparing experiments are necessary. This shows that our experiments are rich and valuable.
Figure 1: The histogram characteristics of training data sets.
4.2.6 Dataset selection
We select the data set is based on the following considerations.
1.	Both binary classification and multi-classification data sets were taken into account for the purpose that some algorithm are superior to binary classification, while others are more suitable for multi-classification problems.
2.	For better evaluation of the adaptability of the algorithms, we chosen both low dimensional data set with small size and high dimensional data set with large size.
3.	We also paid attention to the data type differences, both pure integer and double data type were included. There were also real data type features, which had either wholly positive or negative features.
4.	It can be seen in Figure 1 that the histogram of the distribution of the data we have chosen the difference is quite different. Some are concentrated, just like normal distribution, while others are relatively sparse.
4.3 Parameter setting and selection 4.3.1 Decision tree
1. The maximum depth of tree: The default value was set to 1. With the value increase, classification accuracy and running time will increase. We set maximum depth to 2, 4, 6, 8, 10, 12, 14, 16, 18 and 20 respectively. We found the highest accuracy rate when the maximum depth is 10.
2.	Splitting Criteria Consideration: We tested the wrong classification rule, entropy rule, and Geni coefficient rule respectively. We found that Geni coefficient got the best of all, and mistake classification rule got the worst of all.
3.	The first split minimum number of samples were tested with value of 0, 2, 4, 6, 8 and 10 respectively. We found the minimum number of samples was increased, but performance degraded.
4.	The maximum number of features of the training sample, the default was set to 1000.
, 10- spa»
■ 10'


4.3.2	Random forests
1.	A maximum depth of random forests, as the same of decision tree, the optimal value was set to 10.
2.	The maximum number of training samples feature, the default was set to 1000, the value should change with the number of sample features change, as long as the sample itself to adapt to the largest number of features.
3.	The number of decision trees: we tested the value of 20, 40, 60, 80, 100, 120, 140, 160, 180 and 200 respectively. We found that the number increases, the execution time also increases, and after over the value of 100, the improvement of the accuracy rate is not obvious. So we set it to 100.
4.	The accuracy of the random forest was used to control the iteration. We tested the value of 0.0001, 0.001, 0.01 and 0.1 respectively. We found that the smaller the value was, the longer the execution time was, and after over the value of 0.001, the accuracy had no substantially change. We set it to 0.001.
4.3.3	Extremely randomized trees
Parameters were consistent with the random forest.
4.3.4	Multi-class adaboost classifier
1.	For boosting methods, we tested discrete boosting, gentle boosting and real boosting respectively, found that gentle boosting is the best of all.
2.	Tree depth was initialized with value of 10.
3.	Shrinkage factor: we set it with value of 0.1, 0.2, 0.1 and 0.4 respectively, and found that 0.1 was the best of all.
4.	Maximum boosting number: we set it to the value of 10, 50, 100, 200 and 400 respectively. We found that after over the value of 100, the precision had no significant increase. So we set it to value of 100.
4.3.5
1.
2.
o
3.
4.
5.
Stochastic gradient boosting
Maximum depth of the tree, same as in decision tree, was set it to 10.
Loss function type for classification problems was generally selected for cross-entropy loss. Shrinkage factor was set to 0.1. Subsampling percentage was used to control the sampling percentage for every single tree in the training process, we set it to 0.8. Maximum lift was set to100.
4.3.6 Support vector machines
1.	Type: we set it to c - Support Vector Classification and F - support vector classification, it was found that f -support vector classification is better than c - support.
Penalty factor for c - support vector classification was set to 1.
Penalty factor for f -support vector classification, value range was (0,1], we set it to 0.5, p was set to 0.1.
2,	Kernel types are linear, radial basis, sigmoid-type function, POLY.
jstaLAccura^^at® sets MetSo^---		SPECIF'	Ionosphere	Spam	optdigits
C - SVM	POLY	0.770	0.565	0.618	0.978
	RFB	0.898	0.749	0.772	0.562
	Sigmoid function	0.080	0.693	0.374	0.101
F - SVM	POLY	0.748	0.733	0.629	0.934
	RFB	0.903	0.796	0.779	0.586
	Sigmoid function	0.080	0.840	0.648	0.101
Table 4: Overall accuracy of libSVM on four data sets.
From table 4, we tested non-linear support vector machines total accuracy on four data sets. We selected f -Support Vector POLY for classification as a non-linear support vector machines kernel type.
4.3.7 Linear support vector machines
From table 5, we can see that L2R_L2LOSS_SVC for loss function got good accuracy, so we chose L2R_L2LOSS_SVC loss function as loss type of linear support vector machines.
Total accur^aSy3atasets	SPECTF	Ionosphere	Spam	Optdigits
--				
L2R L2LOSS SVC	0.620	0.856	0.898	0.947
L2R L2LOSSSVC DUAL	0.577	0.856	0.617	0.939
L2R L1LOSS SVC DUAL	0.577	0.848	0.865	0.935
MCSVM CS	0.805	0.860	0.631	0.933
Table 5: Overall accuracy of liblinear on four data sets.
4.3.8	K nearest neighbors
Nearest neighbor K was set to the value of 1, 3, 5, 7, 9, 11, 13 and 15 respectively. The accuracy was found to reduce with increasing K value, so we set it to 1.
4.3.9	Multi-class logistic classifier
According to our single test with variable regularization factor values of 1.0e-12, 1.0e-11, 1.0e-10, 1.0e-9, 1.0e-8, 1.0e-7 and 1.0, we found that the best results is the regularization factor with value of 1.0e-9. So in our comparing experiments, the regularization factor (REG_STRENGTH) was set to 1.0e-9.
Similarly, we did same experiment, most optimum results were found with the iteration range from 800 to 980. So in our comparing experiments, maximum number of iterations was set to 1000 for avoidance of losing optimum values.
4.3.10	Multilayer perceptron
1.	The propagation algorithm was backward and forward propagation respectively. Apparently backward propagation algorithm achieved significantly higher accuracy.
2.	Gradient weight was typically set to 0.1.
3.	The momentum, front weights reflecting differences in two iterations, was typically set to 0.1.
4.3.11	Naive Bayesian Classifier
No parameters needed to be set.
4.3.12	Conditional Random Field Classifier
We only need set the classification type in this experiments.
5 Experimental results and analysis
Firstly, we need define descriptions for the symbol in tables: DecisionTree represents decision tree classifier. RandomForest represents the random forests. ExtraTrees is extremely random tree. BoostedClassifier represents the multi-class adaboost classifier. GradientBoostTree represents stochastic gradient boosting, libSVM is support vector machines. libLinear represents linear support vector machine. Knearest represents K nearest neighbor classifier, MultiClassLogistic is multi-class logistic classifier. MultiLayerPerceptron represents multilayer perceptron.NormalBayesianNet represents the naive Bayesian classifier. CRF represents Conditional Random Fields classifier
5.1 Overall accuracy on five data sets
From table 6, on SPECTF data set, stochastic gradient boosting achieved the highest overall accuracy, average precision, recall and Jaccard value, following by nonlinear support vector machine, random forest, adaboost classifiers and CRF have achieved relatively good performance.
precision, recall and Jaccard value, following by adaboost classifier, extremely randomized trees, CRF and linear support vector machines.
Algorithm	Overall accuracy	Average precision	Average recall	Average jaccard coefficient
DecisionTree	0.657	0.5421	0.631	0.381
RandomForest	0.759	0.5721	0.686	0.456
ExtraTrees	0.668	0.5537	0.667	0.394
BoostedClassifier	0.716	0.5847	0.754	0.440
GradientBoostTree	1.000	1.000	1.000	1.000
libSVM	0.770	0.5664	0.662	0.458
libLinear	0.620	0.5342	0.611	0.356
Knearest	0.604	0.5666	0.724	0.362
Multi-classLogistic	0.620	0.5342	0.611	0.356
MultiLayerPerceptron	0.679	0.5377	0.612	0.391
NaiveBayesianNet	0.588	0.510	0.532	0.327
CRF	0.683	0.543	0.652	0.439
Table 6: Overall accuracy on SPECTF data set.
Algorithm	Overall accuracy	Average precision	Average recall	Average jaccard coefficient
DecisionTree	0.792	0.765	0.799	0.635
RandomForest	0.828	0.807	0.854	0.691
ExtraTrees	0.888	0.862	0.897	0.780
BoostedClassifier	0.900	0.876	0.899	0.798
GradientBoostTree	1.000	1.000	1.000	1.000
libSVM	0.749	0.735	0.775	0.584
libLinear	0.856	0.867	0.787	0.694
Knearest	0.844	0.822	0.804	0.691
Multi-classLogistic	0.768	0.732	0.750	0.593
MultiLayerPerceptron	0.768	0.742	0.775	0.603
NaiveBayesianNet	0.733	0.683	0.633	0.500
CRF	0.863	0.870	0.793	0.612
Table 7: Overall accuracy on Ionosphere data set.
From table 7, on Ionoshere datasets, stochastic gradient boosting achieved the highest overall accuracy, average
Algorithm	Overall accuracy	Average precision	Average recall	Average jaccard coefficient
DecisionTree	0.905	0.905	0.896	0.819
RandomForest	0.930	0.936	0.918	0.862
ExtraTrees	0.863	0.897	0.832	0.738
BoostedClassifier	0.940	0.942	0.933	0.882
GradientBoostTree	0.981	0.981	0.980	0.962
libSVM	0.618	0.687	0.522	0.332
libLinear	0.898	0.892	0.901	0.811
Knearest	0.775	0.765	0.764	0.622
Multi-classLogistic	0.923	0.921	0.918	0.852
MultiLayerPerceptron	0.908	0.904	0.905	0.827
NaiveBayesianNet	0.602	0.801	0.500	0.301
CRF	0.784	0.778	0.790	0.653
Table 8: Overall accuracy on spam data set.
From table 8, on the spam dataset stochastic gradient boosting accuracy achieved the highest overall average precision, recall and Jaccard value, following by multi-lass adaboost classifier, random forests. For binary classification problem but, under the low-dimensional data set, the stochastic gradient boosting and ensemble classifiers such as multi-class adaboost showed a high performance.
From table 9, on optdigits data set, for the multi-classification problem, stochastic gradient boosting achieved the highest overall accuracy, average precision, recall and Jaccard value. K-nearest neighbor, non-linear support vector machine, random forests, extremely randomized trees and CRF also achieved good overall performance. Linear support vector machines, many types of logic, multi-layer perceptron is manifested in ordinal performance, and decision trees and Naive Bayes classifier achieved the worst performance of all.
Algorithm	Overall accuracy	Average precision	Average recall	Average jaccard coefficient
DecisionTree	0.847	0.850	0.847	0.738
RandomForest	0.966	0.966	0.966	0.936
ExtraTrees	0.969	0.970	0.969	0.942
BoostedClassifier	0.870	0.880	0.870	0.778
GradientBoostTree	1.000	1.000	1.000	1.000
libSVM	0.978	0.979	0.978	0.959
libLinear	0.947	0.948	0.946	0.901
Knearest	0.979	0.980	0.979	0.961
Multi-classLogistic	0.943	0.944	0.943	0.893
MultiLayerPerceptron	0.946	0.947	0.946	0.897
NaiveBayesianNet	0.843	0.881	0.844	0.732
CRF	0.960	0.962	0.958	0.932
Table 9: Overall accuracy on optdigits data set.
From table 10, due to the high dimensional feature data of Scene15 data set, multi-layer perceptron, naive Bayesian classifier and nonlinear support vector machine were failed. This mainly caused by the intrinsic shortcomes of nonlinear SVM, Bayesian networks and multi-layer perceptron. They have too many inner loops and intermediate phased which need computation and storage. This situation is worse when feature data is high dimensional. Another reason is that our testing environment is limited. If we have enough memory and strong CPU capability, I think this phenomenon will disappear. In another view, this also reveals that they
Algorithm	Overall accuracy	Average precision	Average recall	Average Jaccard coefficient
DecisionTree	0.397	0.391	0.376	0.246
RandomForest	0.531	0.548	0.493	0.339
ExtraTrees	0.655	0.639	0.637	0.479
BoostedClassifier	0.422	0.502	0.424	0.274
GradientBoostTree	0.999	0.999	0.999	0.999
libSVM	Invalid	Invalid	Invalid	Invalid
libLinear	0.815	0.813	0.8113	0.694
Knearest	0.565	0.602	0.5421	0.382
Multi-classLogistic	0.794	0.797	0.7915	0.665
MultiLayerPerceptron	Invalid	Invalid	Invalid	Invalid
NaiveBayesianNet	Invalid	Invalid	Invalid	Invalid
CRF	0.650	0.625	0.621	0.456
Table 10: Overall accuracy on Scene15 data set.
Algorithm	F		Jaccard coefficient	
	aass0	Classl	aass0	Classl
RandomForest	0.286	0.855	0.167	0.747
ExtraTrees	0.244	0.788	0.139	0.650
BoostedClassifier	0.312	0.822	0.185	0.697
GradientBoostTree	1.000	1.000	1.000	1.000
libSVM	0.271	0.863	0.157	0.760
libLinear	0.202	0.751	0.113	0.601
Knearest	0.260	0.730	0.149	0.575
Multi-classLogistic	0.202	0.751	0.113	0.601
MultiLayerPerceptron	0.211	0.799	0.118	0.665
NaiveBayesianNet	0.154	0.728	0.083	0.572
CRF	0.278	0.813	0.156	0.745
Algorithm	F		Jaccard coefficient	
	Class0	Classl	Class0	Classl
RandomForest	0.919	0.840	0.850	0.724
ExtraTrees	0.916	0.835	0.844	0.717
BoostedClassifier	0.926	0.847	0.863	0.734
GradientBoostTree	1.000	1.000	1.000	1.000
libSVM	0.796	0.674	0.661	0.508
libLinear	0.903	0.723	0.824	0.566
Knearest	0.890	0.735	0.802	0.581
Multi-classLogistic	0.827	0.651	0.706	0.482
MultiLayerPerceptron	0.820	0.678	0.695	0.513
NaiveBayesianNet	0.822	0.464	0.698	0.302
CRF	0.908	0.838	0.846	0.721
Table 12: Inter-class accuracy on Ionosphere data set.
Algorithm	F, Jaccard coefficient			
	Class0	Classl	Class0	Classl
RandomForest	0.944	0.908	0.894	0.831
ExtraTrees	0.897	0.798	0.814	0.664
BoostedClassifier	0.952	0.923	0.908	0.857
GradientBoostTree	0.985	0.977	0.970	0.955
libSVM	0.757	0.107	0.609	0.056
libLinear	0.914	0.878	0.841	0.783
Knearest	0.815	0.715	0.687	0.557
Multi-classLogistic	0.937	0.903	0.881	0.823
MultiLayerPerceptron	0.924	0.886	0.859	0.795
NaiveBayesianNet	0.752	0.002	0.603	0.001
CRF	0.933	0.891	0.885	0.838
Table 11: Inter-class accuracy on SPECTF data set.
have tough condition for real application.
However, stochastic gradient boosting achieved the highest overall accuracy, average precision, recall and Jaccard value, following by linear support vector machines, multi-class logistic regression. Other algorithms got poor performance.
5.2 Inter class accuracy and jaccard
coefTicient evaluation on five data sets
The statistical results above reflected the overall performance of the algorithms. However, inter classes f,
Table 13: Inter-class accuracy statistics on Spam data set.
and Jaccard could reflect more detailed information. Accuracy of each class may vary greatly due to differences of the data. Overall each accuracy was determined by average of sum each class accuracy.
From table 11, on the data set SPECTF, class 0 represents normal, class 1 represents abnormal. Stochastic gradient boosting was fully recognized, so it had the highest f, and Jaccard coefficients in each sub class (both class 0 and class 1). Remains of algorithms' F^ and Jaccard coefficients were not high in class 0, however there were high accuracy in class 1. This
Algorithm		Class0	Classl	Class2	Class3	Class4	Class5	Class6	Class7	Class8	Class9
DecisionTree	F	0.933	0.830	0.813	0.817	0.803	0.891	0.940	0.836	0.777	0.832
	Jaccard	0.875	0.710	0.685	0.690	0.671	0.804	0.887	0.719	0.636	0.712
RandomForest	F	0.989	0.965	0.983	0.962	0.981	0.975	0.986	0.972	0.920	0.933
	Jaccard	0.978	0.933	0.966	0.926	0.962	0.952	0.973	0.945	0.851	0.875
ExtraTrees	F	0.992	0.963	0.994	0.956	0.986	0.975	0.989	0.972	0.938	0.934
	Jaccard	0.983	0.928	0.989	0.915	0.973	0.952	0.978	0.945	0.883	0.877
BoostedClassifier	F	0.960	0.804	0.930	0.813	0.899	0.919	0.960	0.850	0.783	0.798
	Jaccard	0.923	0.672	0.870	0.685	0.816	0.850	0.924	0.740	0.644	0.664
GradientBoostTree	F	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000
	Jaccard	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000
libSVM	F	0.997	0.978	0.994	0.972	0.994	0.986	0.997	0.977	0.948	0.943
	Jaccard	0.994	0.957	0.989	0.946	0.989	0.973	0.995	0.956	0.901	0.892
libLinear	F	0.992	0.914	0.983	0.942	0.964	0.941	0.983	0.954	0.890	0.911
	Jaccard	0.983	0.842	0.966	0.889	0.930	0.889	0.967	0.912	0.801	0.837
Knearest	F	1.000	0.965	0.994	0.978	0.981	0.986	1.000	0.989	0.956	0.949
	Jaccard	1.000	0.933	0.989	0.957	0.962	0.973	1.000	0.978	0.916	0.904
Multi-classLogistic	F	0.977	0.957	0.963	0.934	0.961	0.937	0.978	0.946	0.892	0.885
	Jaccard	0.956	0.918	0.929	0.877	0.925	0.881	0.957	0.897	0.805	0.794
MultiLayerPerceptron	F	0.986	0.948	0.938	0.925	0.951	0.957	0.962	0.949	0.913	0.930
	Jaccard	0.973	0.902	0.884	0.861	0.906	0.918	0.926	0.902	0.840	0.869
NaiveBayesianNet	F	0.942	0.914	0.940	0.891	0.011	0.967	0.949	0.762	0.793	0.883
	Jaccard	0.890	0.842	0.886	0.804	0.006	0.936	0.902	0.616	0.656	0.791
CRF											
Table 14: Inter-class accuracy on optdigits data set.
indicated that the overall accuracy was boosted by the accuracy of the class 1. Distribution trends of Jaccard coefficient was in accordance with that of f . This meant that the higher f was, the higher Jaccard coefficient was in class 1. Further, support vector machines, random forests and CRF also had high f, and Jaccard coefficient in class 1.
From table 12, on data set Ionosphere, class 0 represents the presence of the fact, class 1 represents no presence of the fact. Performance was higher in class 0
than that in class 1. Stochastic gradient boosting to achieve the highest value. Random forests, CRF and multi-class adaboost classifier also got good performance.
From table 14, on optdigits dataset, class 0 to class 9 represent ten digit of 0 to 9. All algorithms had relatively equal performance in each class. Stochastic gradient boosting, the non-linear support vector machines, random forests, Multilayer Perceptron, linear multi-class support vector machines and multi-class logistic classifier had high performance in each class.
Algorithm		aass0	Classl	aass2	aass3	aass4	aass5	aass6	aass7
DecisionTree	F.	0.238	0.618	0.718	0.550	0.304	0.397	0.471	0.397
	Jaccard	0.135	0.447	0.561	0.380	0.179	0.247	0.308	0.248
RandomForest	F	0.848	0.654	0.852	0.327	0.491	0.571	0.618	0.688
	Jaccard	0.736	0.486	0.742	0.196	0.326	0.400	0.448	0.524
ExtraTrees	F	0.805	0.816	0.859	0.779	0.590	0.744	0.652	0.777
	Jaccard	0.673	0.689	0.752	0.638	0.419	0.593	0.483	0.635
Boos tedClass ifier	F	0.606	0.126	0.290	0.415	0.531	0.399	0.447	0.608
	Jaccard	0.435	0.067	0.170	0.262	0.362	0.249	0.287	0.437
GradientBoos tTree	F	1.000	1.000	1.000	1.000	0.997	1.000	1.000	1.000
	Jaccard	1.000	1.000	1.000	1.000	0.994	1.000	1.000	1.000
libSVM	F	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid
	Jaccard	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid
libLinear	F	0.959	0.860	0.931	0.864	0.784	0.887	0.823	0.890
	Jaccard	0.921	0.754	0.871	0.761	0.645	0.797	0.699	0.803
Knearest	F	0.606	0.126	0.290	0.415	0.531	0.399	0.447	0.608
	Jaccard	0.435	0.067	0.170	0.262	0.362	0.249	0.287	0.437
Multi-classLogistic	F	0.941	0.801	0.873	0.820	0.736	0.866	0.770	0.871
	Jaccard	0.936	0.701	0.827	0.721	0.617	0.712	0.656	0.776
MultiLayerPerceptron	F	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid
	Jaccard	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid
NaiveBayesianNet	F	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid
	Jaccard	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid
CRF	F	0.835	0.647	0.843	0.318	0.485	0.565	0.546	0.659
	Jaccard	0.736	0.477	0.729	0.187	0.319	0.389	0.432	0.510
Table 15: Inter-class accuracy on Scene15 data set.
Algorithm		aass8	aass9	aass10	Classll	aass12	aass12	aass14
DecisionTree	F	0.463	0.330	0.142	0.234	0.220	0.326	0.280
	Jaccard	0.301	0.198	0.076	0.133	0.123	0.195	0.162
RandomForest	F	0.495	0.410	0.034	0.197	0.071	0.282	0.482
	Jaccard	0.328	0.258	0.017	0.109	0.037	0.164	0.318
ExtraTrees	F	0.681	0.645	0.339	0.286	0.409	0.496	0.531
	Jaccard	0.517	0.477	0.204	0.167	0.257	0.330	0.362
BoostedClassifier	F	0.611	0.447	0.342	0.320	0.325	0.356	0.485
	Jaccard	0.440	0.288	0.206	0.190	0.194	0.216	0.320
GradientBoostTree	F	0.997	1.000	1.000	1.000	1.000	1.000	1.000
	Jaccard	0.994	1.000	1.000	1.000	1.000	1.000	1.000
BoostedClassifier	F	0.623	0.459	0.309	0.288	0.324	0.415	0.528
	Jaccard	0.452	0.298	0.183	0.168	0.193	0.262	0.359
libSVM	F	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid
	Jaccard	0.000	0.000	0.000	0.000	0.000	0.000	0.000
libLinear	F	0.862	0.898	0.633	0.645	0.701	0.663	0.685
	Jaccard	0.758	0.815	0.463	0.476	0.539	0.495	0.521
Knearest	F	0.611	0.447	0.342	0.320	0.325	0.356	0.485
	Jaccard	0.440	0.288	0.206	0.190	0.194	0.216	0.320
Multi-classLogistic	F	0.847	0.898	0.721	0.613	0.761	0.721	0.655
	Jaccard	0.889	0.668	0.775	0.694	0.582	0.764	0.625
MultiLayerPerceptron	. F.	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid
	Jaccard	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid
NaiveBayesianNet	F	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid
	Jaccard	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid	Invalid
CRF	F	0.488	0.402	0.028	0.188	0.066	0.277	0.466
	Jaccard	0.313	0.248	0.016	0.100	0.036	0.155	0.309
Table 16: Inter-class accuracy on Scene15 data set (cont.)
Algorithm	Binary data sets			Multi-class data sets	
	SPECTF	Ionosphere	Spam	optdigits	Scene 15
DecisionTree	1	30.9	79.9	202.9	19.562(second)
RandomForest	93.9	326.9	2.819	8112	34(minute)
ExtraTrees	266.0	437.0	3.036	12541	1hour22 minute
BoostedClassifier	202.9	451.9	1761	2.697	477.873 ( second )
GradientBoostTree	437.0	656.0	6732	53807	8hour42 minute
libSVM	46.9	108.9	915184	3276	Invalid
libLinear	389.9	749.0	3.165	1006	2hour10 minute
Knearest	1	16.0	280.9	708.9	120.054 ( second )
Multi-classLogistic	30.9	46.9	377.9	2153	190.660 ( second )
MultiLayerPerceptron	857.9	4306	33727	172847	Invalid
NaiveBayesianNet	30.9	93.0	63.0	375.9	Invalid
CRF	168.9	235.2	1.522	6420	66 minute
Table 17: Running time of twelve algorithms on five data sets (unit: millisecond except scene 15 data set).
From table 15 and table 16, on Scene15, class 0 to class 14 represented fifteen classes. Multilayer perceptron and non-support vector machines were failed because of computation cost, and naive Bayesian classifier was failed due to the huge storage. Stochastic gradient boosting, linear support vector machines achieved good performance, following by multi-class logistic classifier.
5.3 Running time performance on five data sets
From table 17, the running time of the 11 kinds by an algorithm on the five data sets can be seen that:
1.	On a small data sets (SPECTE and Ionosphere), running time of multilayer perceptron was significantly slower than that of other algorithms, while other algorithms' running time were almost same. Linear support vector machines' (based on liblinear) running time was inversely lower than that of nonlinear support vector machines based on libsvm.
2.	On the large data sets (spam, optidigits, scene15), the differences of running time were significant. It is clear that the linear support vector machines were significantly faster than the non-linear support vector machines.
3.	For tree classifiers, decision tree was the fastest of all, following by random forests. The slowest was extremely randomized trees.
4.	For boosting methods, stochastic gradient boosting was slower than the multiclass adaboost.
5.	Due to the large dimensionality of data, non-linear support vector machines, and Bayesian multi-layer perceptron did not succeed.
6.	CRF running time is better than ExtraTrees, but slower than RandomForest.
7.	In short, for running time efficiency, naive Bayes classifiers, K nearest neighbor and decision tree were basically fast, following by random forests, multi-class logic and linear support vector machines. The stochastic gradient boosting was the slowest of all.
6 Conclusion and future work
This article compares 12 kinds of commonly used multi-classification algorithm. In the experiments, we found that:
1.
2
4.
2
The same algorithm on different data sets showed different performance. It was the key to choose a more adaptive algorithm based on the data set. Stochastic gradient boosting achieved the best classification accuracy in all test data sets, but its running time was slower than other algorithms except the multilayer perceptron.
The composite classifiers performed well than single classifier. For example, stochastic gradient boosting, random forest, extremely randomized trees were all better than the basic decision tree. However at the same time, the more complex combination model was, the longer running time was. Linear support vector machines achieved good results of both accuracy and total execution on large data sets while compared with the nonlinear support vector machine.
There are still some deficiencies in our comparative study, further research is need: 1. We compare only the basic algorithm of 12 kinds of algorithms, every algorithm has its variants, which are better than the original algorithms. How to choose the optimal parameter settings for algorithm is critical for its performance. There are still more works need to be done. When we deal with large data sets, how much sample should choose for training? How to find the best balance between training time and accuracy is worthy of further exploration.
The combination of classifiers can often lead to higher accuracy, but as mentioned above, model training time will significantly increase. Stochastic gradient boosting were obtained good accuracy in our tests on five data sets, however the running time is longer. Actually stochastic gradient boosting have many sub routine which has sub-iteration, so this will elapse many running time. Because the main routine are highly correlate the sub-iteration, so it cannot directly parallel the sub-iteration. How to improve the running time performance with a little bit of decrease in accuracy is a meaningful research, in other words, we need to find a balance between accuracy and running time performance.
Acknowledgement
This work was supported in part by Scientific Research
4.
Fund of Sichuan Provincial Education Department of
China (No. 15TD0038) and Fundamental Research
Funds for the Central University.
References
[1]	ZHU J, ROSSET S, ZOU H, et al. Multi-class adaboost [J]. Ann Arbor, 2006, 1001(48109): 1612.
[2]	SELFRIDGE O G. Pandemonium: a paradigm for learning in mechanization of thought processes [J]. 1958,
[3]	KANAL L. Patterns in pattern recognition: 19681974 [J]. Information Theory, IEEE Transactions on, 1974, 20(6): 697-722.
[4]	MINSKY M L. Logical versus analogical or symbolic versus connectionist or neat versus scruffy [J]. AI magazine, 1991, 12(2): 34.
[5]	ROSENBLATT F: DTIC Document, 1961.
[6]	BREIMAN L. Random forests [J]. Machine learning, 2001, 45(1): 5-32.
[7]	FRIEDMAN J, HASTIE T, TIBSHIRANI R. Additive logistic regression: a statistical view of boosting (With discussion and a rejoinder by the authors) [J]. The annals of statistics, 2000, 28(2): 337-407.
[8]	FRENCH S. Group consensus probability distributions: A critical survey [J]. Bayesian statistics, 1985, 2(183-202.
[9]	BENEDIKTSSON J A, SWAIN P H. Consensus theoretic classification methods [J]. Systems, Man and Cybernetics, IEEE Transactions on, 1992, 22(4): 688-704.
[10]	BERNARDO J M, SMITH A F. Bayesian theory [J]. Measurement Science and Technology, 2001, 12(2): 221.
[11]	FAN R E, CHANG K W, HSIEH C J, et al. LIBLINEAR: A library for large linear classification [J]. The Journal of Machine Learning Research, 2008, 9(1871-4.
[12]	DUIN R, TAX D. Experiments with classifier combining rules [J]. Multiple classifier systems, 2000, 16-29.
[13]	ALY M. Survey on multi-class classification methods [J]. Neural networks, 2005, 1-9.
[14]	NILLSON N. Learning machines: Foundations of trainable pattern classifying systems [M]. McGraw-Hill, New York. 1965.
[15]	KING R D, FENG C, SUTHERLAND A. Statlog: comparison of classification algorithms on large real-world problems [J]. Applied Artificial Intelligence an International Journal, 1995, 9(3): 289-333.
[16]	BAUER E, KOHAVI R. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants [J]. Machine learning, 1998, 36(1): 2.
[17]	LECUN Y, JACKEL L, BOTTOU L, et al. Comparison of learning algorithms for handwritten digit recognition; proceedings of the International conference on artificial neural networks, F, 1995 [C].
[18]	DING C H, DUBCHAK I. Multi-class protein fold recognition using support vector machines and neural networks [J]. Bioinformatics, 2001, 17(2): 107-38.
[19]	LI T, ZHANG C, OGIHARA M. A comparative study of feature selection and multi-class classification methods for tissue classification based on gene expression [J]. Bioinformatics, 2004, 20(15): 2429-37.
[20]	FOODY G M, MATHUR A. A relative evaluation of multi-class image classification by support vector machines [J]. Geoscience and Remote Sensing, IEEE Transactions on, 2004, 42(6): 133543.
[21]	HSU C-W, LIN C-J. A comparison of methods for multi-class support vector machines [J]. Neural Networks, IEEE Transactions on, 2002, 13(2): 41525.
[22]	CARUANA R, NICULESCU-MIZIL A. An empirical comparison of supervised learning algorithms; proceedings of the Proceedings of the 23rd international conference on Machine learning, F, 2006 [C]. ACM.
[23]	KRUSIENSKI D J, SELLERS E W, CABESTAING F, et al. A comparison of classification techniques for the P300 Speller [J]. Journal of neural engineering, 2006, 3(4): 299.
[24]	WITTEN I H, FRANK E. Data Mining: Practical machine learning tools and techniques [M]. Morgan Kaufmann, 2005.
[25]	MURPHY K P. Machine Learning: a Probabilistic Perspective [J]. 2012,
[26]	QUINLAN J R. Induction of decision trees [J]. Machine learning, 1986, 1(1): 81-106.
[27]	BREIMAN L, FRIEDMAN J, STONE C J, et al. Classification and regression trees [M]. Chapman & Hall/CRC, 1984.
[28]	HO T K. Random decision forests; proceedings of the Document Analysis and Recognition, 1995, Proceedings of the Third International Conference on, F, 1995 [C]. IEEE.
[29]	BREIMAN L. Bagging predictors [J]. Machine learning, 1996, 24(2): 123-40.
[30]	BREIMAN L. Statistical modeling: The two cultures (with comments and a rejoinder by the author) [J]. Statistical Science, 2001, 16(3): 199231.
[31]	IVERSON L R, PRASAD A M, MATTHEWS S N, et al. Estimating potential habitat for 134 eastern US tree species under six climate scenarios [J]. Forest Ecology and Management, 2008, 254(3): 390-406.
[32]	GEURTS P, ERNST D, WEHENKEL L. Extremely randomized trees [J]. Machine learning, 2006, 63(1): 3-42.
[33]	HASTIE T, TIBSHIRANI R, FRIEDMAN J, et al. The elements of statistical learning: data mining, inference and prediction [J]. The Mathematical Intelligencer, 2005, 27(2): 83-5.
[34]	FREUND Y, SCHAPIRE R E. A decision-theoretic generalization of on-line learning and an application
to boosting [J]. Journal of computer and system sciences, 1997, 55(1): 119-39.
[35]	FRIEDMAN J H. Stochastic gradient boosting [J]. Computational Statistics & Data Analysis, 2002, 38(4): 367-78.
[36]	BURGES C J C. A tutorial on support vector machines for pattern recognition [J]. Data mining and knowledge discovery, 1998, 2(2): 121-67.
[37]	CORTES C, VAPNIK V. Support-vector networks [J]. Machine learning, 1995, 20(3): 273-97.
[38]	KEERTHl S S, SUNDARARAJAN S, CHANG KW, et al. A sequential dual method for large scale multi-class linear SVMs; proceedings of the Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, F, 2008 [C]. ACM.
[39]	CRAMMER K, SINGER Y. On the algorithmic implementation of multi-class kernel-based vector machines [J]. The Journal of Machine Learning Research, 2002, 2(265-92.
[40]	COVER T, HART P. Nearest neighbor pattern classification [J]. Information Theory, IEEE Transactions on, 1967, 13(1): 21-7.
[41]	LIU D C, NOCEDAL J. On the limited memory BFGS method for large scale optimization [J]. Mathematical programming, 1989, 45(1): 503-28.
[42]	RONSENBLATT F. The perceptron: a probabilistic model for information storage and organization in the brain [J]. Psychological review, 1958, 65(386-408.
[43]	BlSHOP C M. Pattern recognition and machine learning [M]. Springer New York, 2006.
[44]	WASSERMAN P D, SCHWARTZ T. Neural networks. ll. What are they and why is everybody so interested in them now? [J]. IEEE Expert, 1988, 3(1): 10-5.
[45]	Hang Li. Statistic Learning [M]. Tsinghua press, 2012.
[46]	COLLOBERT R, BENGlO S. Links between Perceptron, MLPs and SVMs; proceedings of the Proceedings of the twenty-first international conference on Machine learning, F, 2004 [C]. ACM.
[47]	GOULD S. DARWIN: A Framework for Machine Learning and Computer Vision Research and Development [J]. Journal of Machine Learning Research, 2012, 13(12): 3499-503.
[48]	CHANG C C, LlN C J. LlBSVM: a library for support vector machines [J]. ACM Transactions on Intelligent Systems and Technology (TIST), 2011, 2(3): 27.
[49]	FAWCETT T. An introduction to ROC analysis [J]. Pattern recognition letters, 2006, 27(8): 861-74.
[50]	DAVlS J, GOADRlCH M. The relationship between Precision-Recall and ROC curves; proceedings of the Proceedings of the 23rd international conference on Machine learning, F, 2006 [C]. ACM.
[51]	JACCARD P. The distribution of the flora in the alpine zone. 1 [J]. New Phytologist, 2006, 11(2): 37-50.
[52]	ASUNClON A, NEWMAN D J. UCl machine learning repository [M]. 2007.
[53]	KURGAN L A, CIOS K J, TADEUSIEWICZ R, et al. Knowledge discovery approach to automated cardiac SPECT diagnosis [J]. Artificial Intelligence in Medicine, 2001, 23(2): 149-69.
[54]	LAZEBNIK S, SCHMID C, PONCE J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories; proceedings of the Computer Vision and Pattern Recognition, 2006 lEEE Computer Society Conference on, F, 2006 [C]. lEEE.
[55]	VEDALDI A, ZISSERMAN A. Efficient additive kernels via explicit feature maps [J]. Pattern Analysis and Machine lntelligence, lEEE Transactions on, 2012, 34(3): 480-92.
[56]	Bradski G, Kaehler A (2008) Learning OpenCV: Computer vision with the OpenCV library. O'Reilly Media, Incorporated
[57]	Delashmit WH, Manry MT (2005) Recent developments in multilayer perceptron neural networks. ln: Proceedings of the seventh Annual Memphis Area Engineering and Science Conference, MAESC.
An Efficient Algorithm for Mining Frequent Closed Itemsets
Gang Fang1, 2, Yue Wu 1, Ming Li1 and Jia Chen1
1School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan, 611731, P. R. China
2School of Computer Science and Engineering, Chongqing Three Gorges University, Wanzhou, Chongqing, 404000, P. R. China
E-mail: gangfang07@sohu.com, ywu@uestc.edu.cn, ming.m.li@alcatel-lucent.com, jchen@uestc.edu.cn
Keywords: frequent closed itemsets, Galois connection, granular computing, association rules, data mining
Received: February 6, 2014
To avoid generating an undesirably large set of frequent itemsets for discovering all high confidence association rules, the problem offending frequent closed itemsets in a formal mining context is proposed. In this paper, aiming to these short^comings of typical algorithms lor mining frequent closed itemsets, such as the algorithm A-close and CLOSET, we propose an efficient algorithm for mining frequent closed itemsets, which is based on Galois connection and granular computing. Firs^^, we present the smallest frequent closed itemsets and its characters, contain some properties and theorems, then propose a novel notion, called the smallest frequent closed granule, which can help the algorithm save reading the database to reduce the costed I/O for discovering frequent closed itemsets. And then we propose a novel model for mining frequent closed itemsets based on the smallest frequent closed granules, and a connection function for generating the smallest frequent closed itemsets. The generator function create the power set of the smallest frequent closed itemsets in the enlarged frequent 1-item manner, which can efficiently a^void generating a^n undesirably large set of candidate smallest frequent closed itemsets to reduce the costed CPU and the occupied main memory for generating the smallest frequent closed granules. Finally, we describe the algorit^hm for the proposed model. On these different da^asets, we report the performances of the algorithm and its trend of the performances t^o discover frequent closed itemsets, and further discuss how to solve the bot^tleneck of the algorithm. For mining frequent closed itemsets, all these experimental results indicate that the performances of the algorithm are better than the traditional and typical algorithms, and it also has a good scalability. It is suitable for mining dynamic transactions dat^asets.
Povzetek: Opisan je nov algoritem asociativnega učenja za pogoste entitete.
generation in [6], called CLOSET. There are more study 1 Introduction	works for mining frequent closed itemsets in [7-13]. The
familiar algorithms include MAFIA in [7], CLOSE+ in Association rules mining is introduced in [1], Agrawal et [8] ^nd DCI-CLOSED in [9]
al. firstly propose a classic algorithm for discovering At present, for mining frequent closed itemsets, there association rules in [2], namely, the Apriori algorithm. ^^e two types of main current methods as follows: However, it is also well known that mining frequent	The first is the method of mining frequent closed
patterns often generates a very large number of frequent itemsets with candidate based on the Apriori algorithm in itemsets and association rules, which reduces not only [3 and 14]. The A-close algorithm in [3] is a well-known
efficiency but also effectiveness of mining since users typical algorithm for the first method, which adopts the
have to sift through a large number of mined rules to bottom-up search strategy as the Apriori-like in [2], and
discover useful ones. In order to avoid the shortcoming, constructs the set of generators in a level-wise manner:
Pasquier et al.introduce the problems of mining frequent ^^ ^ 1) _ g^^^^e^r^e^tc^r^s are created by joining i - genei^ators . closed itemsets in [3], and propose an efficient Apriori- ^ , „ , , , ,	, ,
based mining algorithm, called A-close. Subsequent,	For the first mfthod, the advantagees are the less usage of
Zaki and Hsiao propose another mining algorithm in [4],	meemo^, simple data structure, and easy implementing it
called CHARM, which improves mining efficiency by	and maintaining;its disadvantages are the more occupied
exploring an item-based data structure. However, we find	CPU for,matching candidate patterns, and the ovfrlarge
A-close and CHARM are still costly when mining long	costed I/O for the repeatedly scanning the database to
patterns or low minimum support thresholds in large	compute the support.
database, especially, CHARM depends on the given data	. The second is the. fethod o,f mining frequent closed
structure and need the overlarge memo^. As a continued	itemnsetswit]ioutcaiididaCte baSedTonltioer FP-tree s^ructure
study on frequent patterns mining without candidate	in [6' ^5 and 16],Tle CLOSET algorithm .in [6] is an
generation in [5], J. Pei et al. propose an efficient method	extended. study of thf FP-Gr«wth for. mltlng frequent
for mining frequent closed itemsets without candidate patterns in [5]. For the second method, the advantages
are reducing the overlarge computing corresponding to the joined potential generators in the A-close algorithm, and saving the costed 1/0 of reading the database. But it has these disadvantages, such as complex data structure costs more memory, creating recursion FP-tree occupies more CPU, and implementing it is troublesome.
Rough set theory in [17] and formal concept analysis in [18 and 19] are two efficient methods for the representation and discovery of knowledge in [20 and 21]. Rough set theory and formal concept analysis are actually related and often complementaiy approaches to data analysis, but rough set models enable us to precisely define and analyse many notions of granular computing in [22 and 23].
Reference [22] develops a general framework for the study of granular computing and knowledge reduction in formal concept analysis. In formal concept analysis, granulation of the universe of discourse, description of granules, relationship between granules, and computing with granules are issues that need further scrutiny. Since the basic structure of a concept lattice induced from a formal context is the set of object concepts and eveiy formal concept in the concept lattice can be represented as a join of some object concepts, each object concept can be viewed as an information granule in the concept lattice.
An important notion in formal concept analysis is thus a formal concept, which is a pair consisting of a set of objects (the extension) and a set of attributes (the intension) such that the intension consists of exactly those attributes that the objects in the extension have in common, and the extension contains exactly those objects that share all attributes in the intension in [22]. For the study of granular computing, the formal concept is defined as a granule, such as an information granule.
Based on the notions of granularity in [24] and abstraction in [25], the ideas of granular computing have been widely investigated in artificial intelhgence in [26], such as, granular computing has been applied to association rules mining in [27 and 28], where a partition model of granular computing is apphed to constructing information granule in [26], which depends on rough set theoiy in [29] and quotient space theoiy in [30].
In this paper, we propose a novel model based on granular computing, namely, an efficient algorithm for mining frequent closed itemsets, which constructs the set of generators in the enlarged frequent 1 - item manner to reduce the costed CPU, and adopts granular computing to reduce the costed 1/0.
The rest of the paper is organized as follows:
In Section 2, we present the related concepts with closed itemset and granular computing; In Section 3, we propose a novel model for mining frequent closed itemsets based on granular computing; In Section 4, we describe the efficient mining algorithm; Section 5 reports the performance comparison of our with A-close and CLOSET. In Section 6, we summarize study work and discuss some future research directions.
2 Related concepts
In this section, referring to the definitions and theorems in [3, 4, 6, and 22], we present the following definitions, properties, theorems, and propositions with closed itemsets and granular computing.
Definition 2.1 A formal context is a triplet Ö = (f/, A, R), where
U =	(n=üU p, called the universe of
discourse, is a finite nonempty set of objects;
A = {ai,a^,...,aJ (m=aA J], called the attributes set, is also a finite nonempty set of attributes;
R^Ux A, called the relations, is a binaiy relation between objects U and attributes A , where each couple (u, a)e R denotes the fact that the object u (ueU) is related to the attribute a (a e A).
Here, we make the following ratiocinations become concise, and then let the attribute a(a e A) be Boolean, where each attribute is regarded as an item, i.e. the attributes set ^ is a general itemset. In fact, these ratiocinations are also suitable for the quantitative attributes.
Definition 2.2 Galois connection, let D = {U, A, R) be a formal context, for Oc f/ and A, we define: ®(0): PiU) P{A), namely (0{O)={ie A\yoeO,(o,i)eR}, which denotes the maximal set of items shared by all objects o (oe O) ; (p(I): F(A) P{U), namely (p{I) = {o e f/1 Vi e /, (o,i) e i?}, which denotes the maximal set of objects that have all items i (i e I);
And the couple of applications (co, (p) is defined as a Galois connection between the power set of U (i.e. P{U)) and the power set of A (i.e. P{A)).
Property 2.1 For a formal context D=(U,A,R), if 0,0;, O2 c f/and 1,1^,1^ A, then we have:
(1)/,c	,?(/,) 3,?(/,); (1*)0, cO, ^®(0,)3®(0,);
(2)/cco(0)ö	Oc^(/).
Definition 2.3 Galois closure operators are defined as the operators h = a)o(pinP{A) and h = (poa)in P{U), where they are also expressed as the following notation: h{I) = 0}o(p(I) = o}((p(I)), h(0) = (poo}(0) = (picoiO)).
Property 2.2 For a formal context D=(U,A,R), let (ft),(p) be the Galois connection. If 0,0^,0^ and I, Ii,!^^ A, then we have:
Extension: (3) I c h{I);	(3*) Oc lz(0);
Idempotency: (4) A(A(/)) = h{I);
{A*)h{h{0)) = h{0)-, Monotonicity: (5) /; c ^ ^ c ;
Definition 2.4 Closed itemsets, an itemsets Cc A from ß is a closed itemset if and only ifA(C) = C. The smallest (minimal) closed itemset containing an itemset I is obtained by applying A to /.
Here, we call h{I) the closure of I. Theorem 2.1 For a formal context D={U,A,R), let /;,/2 c ^be two itemsets. We have:
Proof. Let /j, c /1 be two itemsets. ••• /1 c A(/i), 4 c ^(/^) (Extension)
A(/i u 4) c A(A(/i	)) (Monotonicity)
Andv I^yj I^J^ I^yj :. A(/i) c A(/i u	c h{I, u I,)
:. A(A(/i) u )) c A(A(/i u )) (Monotonicity) A(A(/iA(/i u ^) (Idempotency)
Proposition 2.1 For a formal context D = (U, A, R), the closed itemset h(I) corresponding to the closure by h of the itemset 1(1 c A) is the intersection of all objects in U that contain I:
h{I)= n{®({o})|/c®({o})}.
oeU
Proof. Let // = n 0)({o}), where
oe5
5 = {o e f/1 / C 0){{o})} . And we have
h(I) = (0{(p{I)) = n ®({o}) = n ®({o}), where
oecpC/)	oe5°
5° = {oe U\o&(p{I)}.
Let's show that S° = S, i.e. I c 0)({o}) 0 o e (p(I). ■: (p{I) 3 {o}; a){(p{I)) c ®({o}) (Property 2.1) •.• I c a){(p{I)) (Extension)
o e (p{I) ö / c 0){(p{I)) c ft)({o}) We have 5 = 5°, and also have A(/) = H. Definition 2.5 Formal granule, for a formal context D = (U,A,R), a two-tuple G =< /, (p(I) > is defined as a formal granule of the context D = (U,A,R), where
I , called the intension of formal granule, is an abstract description of common features or properties shared by objects in the extension, which is expressed as / = {;;,4,p.
(p(I), called the extension of formal granule, is the maximal set of objects that have all items i (i e I), which is expressed as (p(I) = {oeU\\/ie I,(o,i)e R} .
Definition 2.6 Intersection operation of two formal granules is denoted by ®, which is described as follows:
There are two formal granules =< /„, ^(4) > and Gp =< Iß, (p{Ip) > , respectively; then we have: G =< /,(p(I) >= G„ ® G„ =< 4 u /„,cpil^)r.(p{L)>.
3 A novel mining model
Firstly, we present some definitions, properties, theorems, and corollaries from the Galois connection and granular computing. And propose a novel model for mining frequent closed itemsets based on granule computing.
3.1 Basic concepts
Definition 3.1 Itemset support, for a formal context D = {U, A, R), the support of the itemset I is expressed !&support{I)^(p{I) 0 M .
Definition 3.2 Frequent itemsets, the itemset I is said to be frequent if the support of / in ß is at least the given minsupport. The set FI of frequent itemsets in D is
defined as F/ = {/ c ^ | support{I) > minsupport}.
Property 3.1 All subsets of a frequent itemset are frequent; all supersets of an infrequent itemset are infrequent. (Intuitive in [2])
Definition 3.3 Frequent closed itemsets, the closed itemset C is said to be frequent if the support of C in ß is at least the given minsupport. The set FCI of frequent closed itemsets in D is defined as follows: FCI = {Cc A\ C = h(C) A support(C) > minsupport} .
Property 3.2 Frequent closed itemsets FCI is the subset of frequent itemset FI, namely FCI c FI.
Definition 3.4 The smallest frequent closed itemsets, the frequent itemset I is said to be the smallest frequent closed itemset if V/° c I, support{I) < support{I°). The set FC^.^ of the smallest frequent closed itemsets in D is FC^.^ = {/ e F/1 V/° c / A support{I) < support{I°)}.
Theorem 3.1 For a formal context D={U,A,R) , if / be a frequent closed itemset, and there is the smallest frequent closed itemset r{(p{I) = (piP)), i.e.
y I e FCI ^ BP e FC^,^ a (p(I) = (p{P).
Proof LetD I S k , there are two cases as follows:
(1)	IfV /1 c / (□ /1 i ^-1), and have support{I) < support(I^) ^ V c /j c / a support(I) < support(I°) . Since Ie FCI^ FI, we have Ie FC^,^. Let / ' = /, and we have P e FC^.^ a (p{I) = (p{P).
(2)	If 3 /; c / (□ /1 i ^-1), and have support{I) = supportil^) =i> (pil) = (pil^).
(i)	IfV 4 c /1 c / (□ 4 S k-2), and support{I^) < support{I^) ^ V/° c I^ c A support{I^) < support{I°). Since I, c I e FCI c FI, we have I, e . Let P = I,, and we have P e FC^.^ a cpil) = (p{I^) = (p{P).
(ii)	Otherwise 3 4 c /j c / (□ 4 i ^ - 2), and have support{I^) = support{I^) (p{I^) = (p(I^)...
Go on doing until the k"' step, and 3 support(I) = support{I^) A 4 6 FC^.^. Let P= and we have /' e Fc^, A cpii)=(pa)=...=(pii,)=(Pin.
Based on definition 2.4 and theorem 3.1, we have:
Corollary 3.1 Let/be the smallest frequent closed itemset, i.e. I e FC^.^. And the frequent closed itemset corresponding to I is h(I) = 0)((p(I)).
Corollary 3.2 For a formal context D={U,A,R), the set FCI of frequent closed itemsets in D is expressed
Theorem 3.2 Let ^ Iß ^ A, where support{I„) = support(Ip ) . Then we have A(4) = and V/c A,
Proof. ■: Iß ^ A A support{I^) = support{Iß) □<p(4) i (ß{iß) ••• (Pila) = (pilß) (Oicpilj) = (Oicpilß)), i.e. hilj = hilß)
A(4 u /) = A(A(4) u hil)) (Theorem 2.1) V hilj = hilß)
hil = miß) ^ HI)) = hilß u I). Theorem 3.3 / e FC^.^ ^ V c / a e FC^.^. Proof. Suppose I e ^ 3 /j c / a /j g .
3 4 C /j A supportili) = supportiD
••• (Pik) = (Pik)
3(/' = /-/ja(/'U4 c/) ik=r^k) :. (Pil') 3 (pik) 3 (pil) v /334, i.e. (Pik) C (Pik) (pik) <3 (pik)
:.(pik)r.(pir)3(pik)
■: (pil) = (pik u /') = (pil^) n (piF) (Definition 2.6) :.(pik)^(pil)
:. (pik) = (pil), le. supportik) = supportil) :.3 k C / a supportik) = supportil) :. I ž FC^.^. However, the itemset I is the smallest frequent closed itemset, namely I e FC^,^.
Corollary 3.3 I e FC^.^ 1° c I a 1° e FC^.^, (PI° S E --Q
Definition 3.5 The smallest frequent closed granules set, the formal granule G =< /, (pil) > is said to be the smallest frequent closed granule G^^^ if the intension I of G is the smallest frequent closed itemset. The setFG^^^ of the smallest frequent closed granules is defined as: FG^,„={G=<k(piI)>\IeFC^J
3.2 Frequent closed itemsets mining
In this section, we propose a novel model for mining frequent closed itemsets based on granule computing.
Based on the previous introductions, the following is a formal statement of this model.
For a formal context 0= (f/, A,R), discovering all frequent closed itemsets in D can be divided into two steps as follows:
(1)According	to the minimal support given by user, mining the smallest frequent closed granules set in D. (Details in the steps from (1) to (18) from Section 4.2)
(2)Based	on the smallest frequent closed granules set, discovering all frequent closed itemsets in D. (Details in the steps from (19) to (21) from Section 4.2)
Here the first step is based on definition 3.5, theorem 2.1, and theorem 3.2; the second step refers to Definition 2.4, Proposition 2.1, and Theorem 3.1(Corollary 3.1). From the theory, they provide the demonstration for the novel mining model.
4 The efficient mining algorithm
In this section, we use an efficient mining algorithm to describe the novel model, which is denoted by EMFCI.
4.1 Generator function
Here, we propose a function for generating the intension of the smallest frequent closed granules.
Definition 4.1 Set vector operation □ for two sets is defined as follows:
=	= —9jbe two sets,
and then the set vector operation is expressed as P'^ □ Q
□ (0 {?,} {%} ...
{PJ.
'{Pi}	{Pi,^}
{Pi) {Pi,qi} {Pi,qi}
{A,
= {{Pl} ,{Pl,qi} ,{Pl,q2} >->{Pl><in} APl) >{P2><il\>
■■■Ap,n,(ln) (Formal notation)
= {{Pl},{Piqi},{Piq2}>->{Pl<in}>{P2}>{P2<ii}>{P2<i2}>
(Simple notation)
The operation is the main idea of generator function, let P, Qbe two sets, it is expressed as /(P, Q) = P^ D Q. The application of /(P, Q) refers to Section 4.2.
For example, for a formal context D = iU, A, R), let ^ be a general itemset {a, b, c}, and then we use the set vector operation to generate Pi A) (Vp e P(^) a p 0) as follows:
(1)P(^)	= 0;
(2)	4 = {a} ^ Pi A) = PiA) u (4 □ PiA))
= ({a})D (0) = {{a}};
(3)	4 = {b} ^ P(Ä) = P(Ä) u (/J □ P(Ä))
= {{a}}u (({&})□ (0 {a})) = {{a},{b}Aab}};
(4)	4 = {c} ^ PiA) = P{A) u {II □ P{A))
= {{a},{6},{a6}}u(({c})D (0 {a} {b} {ab}) = {{a},{b}Aab},{c},{ac},{bc},{abc}} . For a formal context D = {U,A,P), if ^ is a general itemsets, namely, it is a set of Boolean attributes, P( A) is general the power set where D PiA) i 2°'' 1. But if A is a set of quantitative attributes, where P(^)is called the extended power set of A , andD P{A) [is expressed as: □	ig +15-1 , here is a reprocessed
ae/l
discrete range of attribute a & A.
4.2 An algorithm for mining frequent
closed itemsets
In this section, we describe the efficient algorithm based on the novel model in Section 3 via the following pseudo code.
Algorithm: EMFCl
Input: a formal context 0= (f/, , the minimal support ininsupport.
Output: frequent closed itemsets FCI.
(1)Readß;
(2)Construct	FG = {FGJ a e ^aV G =< I,(p{I) >e FG^ a/ c A □ / i 1a (ß{I) > Minsupport) ;
(3)F	= {4 c K, |Vve 4 AG=<{v},^({v})>e FG^ a FG^ e FG A ae A}-, H V^is the range of attribute a & A.
(5)For(Va	e F) do begin
(6)	= a □ FC^.^; //Generate the candidate
(7)	For (Vs e do begin
(8)	If ((Vr, e Np,	{yt, e	At.ct s)) then
(9)	Construct G =< s, (p(s) > ;
(10)	lf(0^(s) Bminsupport) then
(11)	lf(V;'csAD^(s)) a (ß{t)) )then
(12)	WriteG=<s,,p(s)>toFG^,,;
(13)
(14)
(15)
WritestoFC„
else
Write s to A^^,
(16)	else
(17)	Write s to
(18)End
(19)For(VG	=< I,(p(I) >e FG^J do begin
(20)	Write hil) = (0i(pil)) to FCI ;
(21)End
(22)Answer	FCI;
These steps from (1) to (18) in the algorithm extract the smallest frequent closed granules set. And these steps from (19) to (21) generate all frequent closed itemsets.
4.3 Example and analysis
Here, we firstly provide an example for the algorithm, and then analyse the pmning strategies in the algorithm.
No.	Operation
1	FG = {< {a},{l, 3,5} >, < {b},{2,3,4} >, <{c},{l,2,5}>,<{e},{3,4,5}>} (Pruning {d}hy property 3. land definition 3.3)
2	F = {{a},{b},{c},{e}}
J	a={a}^S^={{a}}
4	a={b}^S^={{b},{ab}} FG^. = {< {a}, {1,3,5} >, < {6}, {2,3,4} >} FC^, ={{a},{b}} (Pruning{aZ?} by property 3.land definition 3.3)
5	a={c}^S^={{c},{ac},{bc}} = {< {a}, {1,3,5} >, < {b},{2,3,4} >} <{c},{l,2,5}>,<{ac},{l,5}>} FC^,^={{a}Ab}Ac}Aac}} (Prumng{Z?c} by property 3.land definition 3.3)
6	a ={e}^ ={{e},{ae},{te},{ce},{ace}} FG^,^ = {< {a}, {1,3,5} >, < {6}, {2,3,4} >} < {c},{1,2,5} >, < {ac},{1,5} >, < {e},{3,4,5} >, <{ae},{3,5}>,<{Z?e},{3,4} >} = { {a}, {6}, { c}, {ac}, {e}, {ae}, {te} } (Pruning{ce, ace} by property 3.land definition 3.3) Note: the search course is ended, discovering all the smallest frequent closed granules FC^.^
	
7	h{{a}) = {Ui n U3 n U5} = {a} h({b})={u,r^u,r^uj = {b} h({ac})={u,r^u,} = {ac} A({e}) = {U3 n n U5} = {e} h{{ae}) = {u^r^u^} = {ae} }i({be})={u,r.u,} = {be} Note: based on the smallest frequent closed granules set FC^.^, getting all frequent closed itemsets
8	Answer FCI = {{a},{b},{c},{ac},{e},{ae},{be}}
Table 1: Frequent closed itemsets mining for ininsupport = 40% .
For a formal context D = (U, A, R), where A = {a, b, c, d, e},U = {uj, u2, u3, u4, u5}, u^ = {acd}, u2 = {bc}, u3 = {abe}, u4 = {be}, u5 = {ace}; and minsupport = 40% . The course of discovering frequent closed itemsets is described as table 1.
For mining frequent closed itemsets, the algorithm adopts some pruning strategies as follows, property 3.1, definition 3.3 and 3.4, and theorem 3.3. They can help the algorithm efficiently reduce the search space for mining frequent closed itemsets.
5 Performance and scalability study
In this section, we design the following experiments on these different datasets:
Firstly, we report the performances of the algorithm EMFCI with A-Close and CLOSET on the six different datasets.
Secondly, we report the relationships between some parameters of the datasets and the performances of the algorithm EMFCI for mining frequent closed itemsets.
Finally, for the bottleneck of the algorithm EMFCI, we improve it to get the algorithm IEMFCI, and report its performances on the extended high dimension dataset to show the scalability of the algorithm EMFCI.
There are two original datasets as follows:
The first is the Food Mart 2000 retail dataset, which comes from SQL Server 2000. It contains 164558 records in 1998. By the same customer at the same time as a basket, we take items purchased from these records. Because the supports of the bottom items are small, we generalize the bottom items to the product department. Finally, we obtain 34015 transactions with time-stamps. It is a dataset with the Boolean attributes.
The second is from a Web log data, which is a real data that expresses some behaviour of students browsing, where the attributes set is made of login t^ime, durat^ion, network flow, IDtype, a^nd sex . The dataset with the discrete quantitative attributes has 296031 transactions.
Now, we generalize attributes, and replicate some attributes or transactions to create the following extended y datasets described as table 2, where each dataset can be s defined as a formal mining context D = (U, A, R).
All the experiments are performed on an Intel (R) Core (TM)2 Duo CPU (T6570 @) 2.10 GHz 1.19GHz) PC with 1.99 GB main memory, running on Microsoft Window XP Professional. All the programs are written in C# with Microsoft Visual Studio 2008. The algorithm A-close and CLOSET are implemented as described in [3] and [6].
Name	Descriptions	□ P( A) p U0
Dataset 1	The first original dataset	222 -1 ; 34015
Dataset 2	Replicating dataset 1 three attributes	225 -1 ; 34015
Dataset 3	Replicating dataset 1 four times	222 -1 ; 5*34015
Dataset 4	The second original dataset	5*4*4*14*3-1; 296031
Dataset 5	Replicating dataset 1 one attribute	5*4*4*14*3*5-1; 296031
Dataset 6	Replicating dataset 4 one time	5*4*4*14*3-12*296031 '
Dataset 7	For the Food Mart 2000, we regard the same customer at the same time as a basket and generalize the bottom items to the product subcategory	2J02 -1; 34015
Table 2: The datasets used in the experiments.
5.1
The experiments of performance
comparison
In this section, for discovering frequent closed itemsets on these different datasets, we compare the algorithm EMFCI with the algorithm A-close and CLOSET from the following two aspects, namely, one is comparing the performances among them as the minimal support is added; the other is comparing them as the number of frequent closed itemsets is added.
1. Testing on the original datasets
For the two original datasets, we firstly compare the algorithm EMFCI with the A-close and CLOSET based on the varying minimal support and the number of frequent closed itemsets. These experimental results are described as figure 1, 2, 3, and 4, respectively.
Figure 1: Performance comparison with the support on dataset 1.
Figure 2: Performance comparison with the number of frequent closed itemsets on dataset 1.
Figure 3: Performance comparison with the support on dataset 4.
Figure 5: Performance comparison with the support on dataset 2.
Figure 6: Performance comparison with the number of frequent closed itemsets on dataset 2.
Figure 7: Performance comparison with the support on dataset 3.
Figure 8: Performance comparison with the number of frequent closed itemsets on dataset 3.
Figure 4: Performance comparison with the number of frequent closed itemsets on dataset 4.
Based on the comparison results from figure 1, 2, 3, and 4, we know that the performances of the algorithm EMFCI are better than the A-close and CLOSET.
Obviously, the algorithm CLOSET is also superior to the A-close. Hence, we don't compare the EMFCI with the A-close in the following experiments.
2. Testing on the extended datasets
We further report the performances of the algorithm EMFCI on the extended datasets. Based on the different minimal support and the number of frequent closed itemsets, we compare the EMFCI with the CLOSET, the experimental results are described as figure 5 to 12.
Figure 9: Performance comparison with the support on dataset 5.
Figure 10: Performance comparison with the number of frequent closed itemsets on dataset 5.
Figure 11: Performance comparison with the support on dataset 6.
Figure 12: Performance comparison with the number of frequent closed itemsets on dataset 6.
Based on the comparison results from figure 5 to 12, we know that the performances of the algorithm EMFCI are also better than the CLOSET on the datasets with the Boolean or quantitative attributes.
5.2 The relationships between these parameters and performances
In this part, we mainly discuss the relationships between the performances and the following parameters:
UU [, is the number of objects in the formal mining context D = (U, A, R), in other word, it is the number of transactions in the mining database.
□ P(I) [, is the number of nonempty power sets for attribute values, called the search space of the algorithm, where I is the smallest frequent closed itemsets from the attribute set A , P(I) is defined as the power set of I. (Refer to section 4.1)
Here, the representation of the performances has two kinds of parameters as follows:
t(x): is the runtime of algorithm x, which is from input to output for mining frequent closed itemsets.
p, is defined as the improved ratio of the runtime between the algorithm EMFCI and CLOSET, which is denoted by the following equation: p = 1 - t(EMF^CI) / t(CLOSET). 1. The relationships between the performances and the search space
(1)Reporting the relationships on the extended dataset of the first original dataset
For the first original dataset, namely, dataset 1, we test the trend of the performances as the search space is increasing on dataset 2, which is the extended dataset with replicating three attributes of the first dataset. As the search space is varying, the trend of the runtime for the algorithm EMFCI is expressed as figure 13, the trend of the improved ratio between the algorithm EMFCI and CLOSET is expressed as figure 14.
Figure 13: The trend of the runtime on dataset 2.
Figure 14: The trend of the improved ratio on dataset 2.
Based on figure 13, we know that the runtime is added as the search space is increasing. Based on figure 14, we find that the improved ratio is reduced as the search space is increasing.
(2)Reporting the relationships on the extended dataset of the second original dataset
For the second original dataset, namely, dataset 4, we extend an attribute to get dataset 5, and test the trend of the performances on the dataset. The experimental results are expressed as figure 15 and 16, respectively.
Figure 16: The trend of the improved ratio on dataset 5.
According to figure 15 and 16, we get the similar comparisons results as above. Hence, we can draw the following conclusions:
The runtime of the algorithm EMFCI is added as the search space is increasing; on the contrary, the improved ratio is reduced. Namely, if the search space is increasing, the performances of the algorithm EMFCI will become worse and worse. In other word, the algorithm is not suitable for mining the dataset with too many smallest frequent closed itemsets.
2. The relationships among the performances, the search space and the number of objects
(1)Reporting the relationships on the first original dataset and its extended dataset
For the first original dataset (dataset 1), and its extended dataset, dataset 3 with replicating its objects four times, we test the trend of the performances as the search space is increasing on the two datasets. As the search space is varying, the trend of the runtime for the algorithm EMFCI is expressed as figure 17, the trend of the improved ratio between the algorithm EMFCI and CLOSET is expressed as figure 18.
Figure 17: The trend of the runtime on dataset 1 and 3.
Figure 15: The trend of the runtime on dataset 5.
Figure 18: The trend of the improved ratio on dataset 1 and 3.
Based on figure17, we know that the runtime of the algorithm is added as the search space or the number of objects is increasing.
Based on figure18, we find that the improved ratio of the algorithm is reduced as the search space is increasing, but it become relatively stable as the number of objects is increasing.
(2)Reporting the relationships on the second original dataset and its extended dataset
For the second original dataset, namely, dataset 4, we replicate its objects one time to get dataset 6, and test the trend of the performances on the dataset 4 and 6. The
experimental results are expressed as figure 19 and 20, respectively.
Figure 19: The trend of the runtime on dataset 4 and 6
In this section, we adopt a partitioning method to avoid the bottleneck. In other word, the overlarge search space is divided into some smaller search spaces. The theoretical basis can be described as follows:
Let/ = {a''	A), and then we have the
followingll Pil) 11= n"i(|| y, II +1)-1, namely,
\\P(i)\\+i = nz,(\\v^, II+1) =
(IIK, II+1)-(IIK.	Il+I)-


l+l)-...-
V,
1+1);
Figure 20: The trend of the improved ratio on dataset 4 and 6
According to figure 19 and 20, we draw the same conclusions as follows:
The runtime of the algorithm EMFCl is added as the search space or the number of objects is increasing, the improved ratio of the algorithm is reduced as the search space is increasing, but it become relatively stable as the number of objects is adding. Namely, the performances of the algorithm EMFCl will become relatively stable as the number of objects is increasing. Hence, it is suitable for mining dynamic transactions datasets.
According to all these experimental results, we can draw the following conclusions:
(1)	The performances of the algorithm EMFCl are better than the traditional typical algorithms for mining frequent closed itemsets on the datasets with the Boolean attributes or the luantitative attributes.
(2)	The runtime of the algorithm EMFCl is added as the search space. If the search space is too large, its performances will become worse and worse. This is the bottleneck of the algorithm.
(3)	The runtime of the EMFCl is also added as the number of objects is increasing.
(4)	For the algorithm CLOSET, the improved ratio of the algorithm is reduced as the search space is adding, but it become relatively stable as the number of objects is increasing. Namely, the performances of the EMFCl will become relatively stable as the number of objects is increasing. It is suitable for mining dynamic transactions datasets.
5.3 A further discussion for solving the bottleneck of the algorithm
Based on these conclusions in section 5.2, for the formal mining context D = (U, A, R), if ihs search spaceOP(I) I is overlarge, where 1(1 c A) is the smallest frequent closed itemsets, P(I) is defined as the power set of I, the performance of EMFCl will become worse and worse.
(fflj +	m^ = in).
Obviously, we also have || P(I) || +1 =
(II ) II +1) • (II PiL^) II +1) •••• • (II P(L,) II +1);
Where={a'\a'\...,a"'},
T — f a'ai+l a'ai+2 a'ffli+fflz 1 '"k
In this paper, we let || ) ||< 2 = 2". If 2 is too big,
the method also has the same bottleneck; if X is too small, the cost of partitioning search space is expensive. For these two cases, their performances are expressed as figure 23.
The partitioning method is used in the algorithm EMFCl, which is called improved EMFCl, i.e. lEMFCl.
5.3.1 Example
For the example in section 4.3, we use the algorithm lEMFCl to discover frequent closed itemsets, the course of which is described as follows, where 2 = 4.
(Note: 2 = 4 used in the example, X = 2" used in the following experiments)
Stepl. FG = {< {a},{l, 3,5} >, < {/?},{2,3,4} >,
<{c},{l,2,5}>,<{e},{3,4,5}>}. Step2. F = {{a},{6},{c},{e}},|| P(F) ||= 15 > 2 = 4 . Step3. Partitioning the search space, get two search spaces/^ = {{a},{/?}},={{c},{e}}, where|| P(f;)||<4 . Step4. For the first search space F^ = {{a},{b}} , have ©a={a}^5,={{a}} FGi, = {< {a},{l,3,5} >}, FCi, = {{a}} ;
®a={b}^S^={{b},{ab}}
FG\„ = {< {a},{l, 3,5} >, < {b},{2,3,4} >} ,
FC\„={{a},{b}}.
For the second search space ={{c},{e}}, have
©a={c}^5,={{c}}
FGl„ ={<{c},{l,2,5}>}, FC\„ ={{c}};
(Da={e}^5^={{e},{ce}}
FGln = {< {C},{1,2,5} >, < {e},{3,4,5} >} ,
FCi,={{c},{e}}.
Step5. F = {FCl.^,FClJ , repeating the step2, where || P(F) ||= 15 > 4 , but || F\\=2 , the partitioning operation must be ended; otherwise, the algorithm need to continue to partition the search space.
= {< {a},{1,3,5} >, < {bU2,3,4} >} ,

{e}
(0 {a} {Ä}) =
Items	Time complexity	Space complexity
A-close	0(11	0(||C||/||4||)
CLOSET	0(11	0(11 C||)
lEMFCI	0{{L/k+iy\\c\\)	0(\\C\\/k-\\A\\)
Figure 21: Performance comparison with the lower support on dataset 7.
{{c},{ac},{bc},{e},{ae},{be}}; = {< {a}, {1,3,5} >, < {b},{2,3,4} >} <{c},{l,2,5}>,<{ac},{l,5}>,<{e},{3,4,5}>, <{ae},{3,5} >,<{Z?e},{3,4}>} FC^,„={{aUbUcUac},{eUaeUbe}} The rest of steps are the same as the example in section 4.3. The algorithm lEMFCl reduces the checking of itemset{ ace}, but adds the task of partitioning. As the number of tiansactions is lesser, the example does not show its advantage, please see the experiments in section 5.3.3. Here, the example only describes the execution course of lEMFCl.
5.3.2 Comparisons of the time and space complexity
For D=(U,A,R) , let C be a set of frequent closed itemsets, and let L be the average length of frequent closed itemsets, ^ > 2 is a parameter with partitioning the search space. The comparisons are expressed as table 3.
Figure 22: Performance comparison with the higher support on dataset 7.
Then, for the improved algorithm lEMFCl, we adopt different parameters X to test its trend of performance, where X = l\ /1 = 2" and /1 = 2^^. The comparison result is expressed as figure 23, where lEMFCl (Ä= p(2,n)) is the improved algorithm lEMFCl when the parameter of partitioning the search space is A = p(2, n) = 2".
Table 3: Comparisons of the time and space complexity.
5.3.3 Test on the high dimension datasets
In this section, to show the scalability of the algorithm EMFCl, firstiy, we compare the improved algorithm lEMFCl with EMFCl, A-close and CLOSET on the high dimension dataset (dataset 7 as table 1), which is an extended dataset based on the first original dataset. The comparison results are expressed as figure 21 and 22, where the parameter p{2, in) = 2" on the abscissa shows the search space □ P(I) tof the given support.
Figure 23: The trend of performance with the different parameter on dataset 7.
Based on these comparisons, we draw the following conclusions:
Firstly, the improved algorithm lEMFCl is better tiian tiie algorithms EMFCl, A-close and CLOSET.
Secondly, the improved algorithm lEMFCl gets rid of the bottieneck in the algorithms EMFCl, especially, when the search space □ P(I) [is overlarge, the advantage of lEMFCl is very distinct.
Finally, for the improved algorithm lEMFCl, the parameter of partitioning the search space is not too big, but it is not too small.
6 Conclusion
In this paper, for the shortcomings of typical algorithms for mining frequent closed itemsets, we propose an efficient algorithm for mining frequent closed itemsets, which is based on Galois connection and granular computing. We present the notion of smallest frequent closed granule to reduce the costed 1/0 for discovering frequent closed itemsets. And we propose a connection function for generating the smallest frequent closed itemsets in the enlarged frequent 1-item manner to
reduce the costed CPU and the occupied main memory. But the number of the smallest frequent closed itemsets is too many, the performances of the algorithm become worse and worse, so we further discuss how to solve the bottleneck, namely, propose its improved algorithm on high dimension dataset. The algorithm is also suitable for mining dynamic transaction datasets.
Acknowledgement
The authors would like to thank the anonymous reviewers for the constructive comment. This work was a project supported by Chongqing Cutting-edge and Applied Foundation Research Program (Grant No. cstc2014jcyjA40035). And it was also supported by Scientific and Technological Research Program of Chongqing Three Gorges University (Grant No.13ZD20).
References
[1]	R. Agrawal, T. Imielinski, and A. Swami (1993). Mining association rules between sets of items in large databases. ^n Proceedings of the 1993 ACM SIGMOD Int'l Conference on Management of Dat^a, Washington DC, USA, pp. 207-216.
[2]	R. Agrawal and R. Srikant (1994). Fast algorithms for mining association rules. In Proceedings of the 20^ Int'l Conference on Very large Data Bases, Santiago, Chile, pp. 487-499.
[3]	N. Pasquier, Y. Bastide and R. Taouil et al. (1999). Discovering frequent closed itemsets for association rules. In Proceedings of the 7t^h Int'l Conference on Database Theory, Jerusalem, Israel, January, pp. 398-416.
[4]	Mohammed J. Zaki, Ching-Jui Hsiao (1999). Charm: An efficient algorithm for closed association rule mining. Technical Report 99-10, Computer Science, Rensselaer Polytechnic Institute.
[5]	J. Han, J. Pei, and Y. Yin (2000). Mining frequent patterns without candidate generation. In Proceedings of the 2000 ACM SIGMOD Int'l Conference on Management of Data, New York, USA, pp. 1-12.
[6]	J. Pei, J. Han, and R. Mao (2000). CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets. In Proceedings of the 2000 ACM SIGMOD Workshop on Research Issues in Dat^a Mining and Knowledge Discovery. Dallas, Texas, USA, pp. 2130.
[7]	D. Burdick, M. Calimlim, and J. Gehrke (2001). MAFIA: A maximal frequent item set algorithm for transactional databases. In Proceedings of the 17th Int'l Conference on Dat^a Engineering. Heidelberg, pp. 443-452.
[8]	J. Y. Wang, J. Han, and J. Pei (2003). CLOSET+: Searching for the best strategies for mining frequent closed itemsets. In Proceedings of the 9th ACM SIGKDD Int'l Conference on Knowledge Discovery cin^d Dat^a Mining, Washington, DC, pp. 236 - 245.
[9]	C. Lucchese, S. Orlando, and R. Perego (2006). Fast and memory efficient mining of frequent closed itemsets. IEEE Trans on Knowledge and Dada Engineering, vol. 18, no. 1, pp. 21- 36.
[10]	R. Singh, T. Johnsten, and V. Raghavan et al (2010). Efficient Algorithm for Discovering Potential Interesting Patterns with Closed Itemsets.
Proceedings of the 2010 IEEE Int'l Conference on Granular Computing, pp. 414 - 419.
[11]	F. Nori, M. Deypir, and M. Hadi et al. (2011). A new sliding window based algorithm for frequent closed itemset mining over data streams. In Proceedings of the 1st Int^'l Conference on Computer and Knowledge Engineering, IEEE Press, pp. 249253.
[12]	Guang-Peng Chen, Yu-Bin Yang, and Yao Zhang (2012). MapReduce-Based Balanced Mining for Closed Frequent Itemset. ^n Proceedings of the 2012 IEEE 19th Int'l Conference on Web Services, IEEE Press, pp. 652 - 653.
[13]	M. Sreedevi, Reddy L.S.S. (2013). Mining regular closed patterns in transactional databases. In Proceedings of the 2013 7th Int^'l Conference on Intelligent Systems and Control, IEEE Press, pp. 380 - 383.
[14]	Yu-quan Z. and Yu-qing S. (2007). Research on an Algorithm for Mining Frequent Closed Itemsets. Journal of Computer Research and Development, vol. 44, no. 7, pp. 1177-1183.
[15]	Shengwei L., Lingsheng L., and Chong H. (2009). Mining closed frequent itemset based on FP-Tree. In Proceedings of the IEEE Int'l Conference on Granular Computing, IEEE Press, pp. 354 - 357.
[16]	Wachiramethin J., Werapun J. (2009). BPA: A Bitmap-Prefix-tree Array data structure for frequent closed pattern mining. ^n Proceedings of the 2009 Int'l Conference on Machine Learning and Cybernetics, IEEE Press, vol .1, pp. 154 - 160.
[17]	Z. Pawlak (1982). Rough sets. Journal of computing and information science in Engineering, no.11, pp. 341-356.
[18]	R. Wille (1982). Restructuring lattice theory: an approach based on hierarchies of concepts. In: I. Rival (Ed.), Ordered Sets, Reidel, Dordrecht-Boston, pp. 445-470.
[19]	B. Ganter, R. Wille (1999). Formal Concept Analysis, Mathematic Foundations. Springer, Berlin.
[20]	J. Poelmans, D. I. Ignatov, and S. O. Kuznetsov et al (2013). Formal concept analysis in knowledge processing: A survey on applications. Expert Systems with Applications, vol.40, no. 16, pp. 6538-6560.
[21]	M. W. Shao, Y. Leung (2014). Relations between granular reduct and dominance reduct in formal contexts. Knowledge-Based Systems, vol.65, pp. 111.
[22]	Wei-Zhi W., Yee Leung, Ju-Sheng M. (2009). Granular Computing and Knowledge Reduction in Formal Contexts. IEEE Transactions on Knowledge and Data Engineering, vol.21, no.10, pp. 1461-1474.
[23]	R. Belohlavek, B. D. Baets, J. Konecny (2014). Granularity of attributes in formal concept analysis.
Information Sciences, vol.260, pp.149-170.
[24]	Hobbs J. R. (1985). Granularity. In Proceedings of the 9th International Joint Conference on Artificial Intelligence, San Francisco, USA, pp. 432-435.
[25]	Giunchglia F., Walsh T. (1992). A theory of abstraction. Artificial Intelligence, vol. 57, no. 2-3. pp. 323-389.
[26]	Yao Y.Y. (2004). A partition model of granular computing. Lecture Notes in Computer Science Transactions on Rough Sets, vol. 3100, pp.232-253.
[27]	T. R. Qiu, X. Q. Chen, and Q. Liu et al. (2010). Granular Computing Approach to Finding Association Rules in Relational Database. International Journal of intelligent sy^stems, no. 25, pp. 165-179.
[28]	G. Fang, Y. Wu (2013). Frequent Spatiotemporal Association Patterns Mining Based on Granular Computing. Informatica, vol.37, no.4, pp.443-453.
[29]	Pawlak Z. (1998). Granularity of knowledge, indiscernibility and rough sets. In Proceedings of IEEE Int Conf on Fuzzy Systems, IEEE Press, Anchorage, AK, pp.106-110.
[30]	Zhang L., Zhang B. (2003). The quotient space theory of problem solving. Lecture Notes in Computer Science, vol. 2639, pp. 11-15.
An Approach for Context-based Reasoning in Ambient Intelligence
Kristijan Gjoreski
Department of Intelligent Systems, Jožef Stefan Institute Jamova 39, Ljubljana, Slovenia
Thesis Summary
Keywords: machine learning, context, reasoning, artificial intelligence, ambient intelligence, sensors Received: February 2, 2015
This paper presents a summary of the doctoral dissertation of the author, which addresses the task of context-based reasoning in ambient intelligence.
Povzetek: Prispev^ek predstavlja povzetek doktorske disertacije av^torja, ki obravnavna kontekstno sklepanje v ambientalni inteligenci.
1 Introduction
Ambient intelligence (AmI) is a scientific field that refers to environments consisting of smart devices (sensors and actuators) that can sense and respond to the presence of people [1] . The availability of small, wearable, low-cost, power-efficient sensors, combined with advanced signal processing and information extraction, is driving the revolution in AmI domain. This revolution has enabled novel approaches and technologies for accurate measurements in the area of healthcare, enhanced sports and fitness training, and life-style monitoring.
Early AmI systems included a single type of sensors that has made it possible to develop the first proof-of-concept applications. As the field has matured, these systems have gained additional sensors, resulting in the development of advanced and more accurate multisensor techniques and applications. Kowever, combining multiple sources of information from multiple sensors is a challenging task. The first issue is that each sensor has its own technical configuration (for example, the data sampling rate) and requires different data-processing techniques in order to first align the different sensor data, and later to extract useful information. The second issue is that even if the multi-source data is aligned, it can be challenging to find an intelligent way to combine this multi-source information in order to reason about the user or the environment. While several approaches for combining multiple sources of information and knowledge have been developed (such as Kalman filters, ensemble learning, and co-training), these approaches have not been specialized for AmI tasks.
The doctoral dissertation [2] addresses the problem of combining multiple sources of information extracted from sensor data by proposing a novel context-based approach called CoReAmI (Context-based Reasoning in Ambient Intelligence). In particular, CoReAmI creates a multi-view perspective, in which each source of information is used as a context separately.
2 The CoReAmI approach
The CoReAmI approach is shown in Figure 1. At the top are the sensors {s1, ... , sm}, which provide the raw data. The multiple sensors data is usually represented by multivariate time-series with mixed sampling rates, which are input to CoReAmI. The CoReAmI consists of three phases: (A) context extraction, (B) context modeling and (C) context aggregation.
Figure 1. The CoReAmI Approach.
In the first phase (A) the raw sensor data are acquired and the multiple contexts are extracted {c1, ... , cn} using different types of techniques: data-preprocessing techniques, data synchronization, data segmentation, etc. In CoReAmI context represents information about the user which is extracted from the sensor data, e.g., user's activity extracted from wearable accelerometer data. This phase is similar to the feature extraction phase in machine learning (ML). Moreover, the contexts in CoReAmI are features that represent context information. Therefore, each context has values (vc), which can be
numerical or categorical (e.g., "sitting" for the "activity" context).
The second phase B, contains the main logic of the CoReAmI approach. In this phase the context modeling about the problem (activity, fall, energy expenditure, etc.) is performed using the contexts defined in the previous phase. First the context-based partitioning of the dataset is performed, i.e., the dataset is partitioned according to each context and its values. Therefore, for each context value a reasoning model (nf) is constructed using its reasoning data - the reasoning data is a subset of the whole dataset that has that particular context value (R^V). For example, the reasoning data for the "sitting" model will be constructed using the data instances that contain the value "sitting" for the activity context. This way, the approach considers multiple views on the data using each of the features as a context.
In the final phase C, for a given testing data instance, the decisions from each context individually are aggregated and the final decision is provided. In this phase different aggregation techniques can be used, e.g., majority voting, plurality voting, averaging, choosing the median and similar.
The CoReAmI is a general approach for context-based reasoning and can be adapted to a range of tasks in AmI by adapting each of the phases to the particular task.
3 Case studies
The feasibility of the CoReAmI approach was shown in three AmI domains that have emerged as essential building blocks in AmI: activity recognition, energy-expenditure estimation, and fall detection.
The first problem domain, Activity Recognition (AR), can generally be defined as a process of recognizing activities through the analysis of sensor data. In recent years AR gained a lot of research attention, because it provides one of the basic information about a person that is monitored by an AmI system. In our study, we studied the state-of-the-art approaches in AR and observed that it is almost impossible to distinguish standing from sitting activity using a single accelerometer placed on the torso. However, by adapting and applying the CoReAmI approach, we have managed to significantly improve the recognition of these two activities, achieving 86% accuracy - which is for 24 percentage points better than conventional ML approach [4] .
Human Energy-Expenditure (EE) estimation is the process of calculating the amount of expended energy while performing everyday activities. It directly reflects the level of physical activity which makes it important for sports training, weight control, management of metabolic disorders (e.g., diabetes), and other health goals. We adapted and applied the CoReAmI approach to estimate the human energy expenditure using multiple sensor data (accelerometer, heart rate, breath rate, etc.). The CoReAmI significantly improved the estimation performance compared to conventional ML approaches and approaches that are based on single context (such as the activity of the user). Additionally, the CoReAmI provided better energy expenditure estimations than the
BodyMedia device, which is a state-of-the-art commercial device for energy expenditure estimation [5]
The third problem domain on which we applied the CoReAmI approach is the fall detection (FD). FD is a really important application in AmI because falls are among the most critical health problems for the elderly. We adapted and applied the CoReAmI approach to detect human falls. CoReAmI significantly improved the detection performance compared to conventional approaches, such as: threshold-based approaches and approaches based only on ML [6] .
4 Conclusion
This paper summarized the dissertation [2] and presented the main idea and findings of the same. The proposed CoReAmI approach was adapted and tested on three problem domains. The results show that CoReAmI significantly outperforms the competing approaches in each of the domains. This is mainly due to the fact that, by extracting multiple sources of information and combining them by using each source of information as a context, a multi-view perspective is created, which leads to better performance than with conventional approaches.
References
[1]	Ducatel K, Bogdanowicz M, Scapolo F, Leijten J, Burgelman J. Scenarios for ambient intelligence in 2010. Technical report. Retrieved September, 2014 from: http://cordis.europa.eu/ist/istagreports.htm
[2]	Gjoreski H. Context-based Reasoning in Ambient Intelligence, PhD Thesis, IPS Jožef Stefan, Ljubljana, Slovenia, January, 2015.
[3]	Friedman SM, Munoz B, West SK, Rubin GS, Fried LP. Falls and Fear of Falling: Which Comes First? A Longitudinal Prediction Model Suggests Strategies for Primary and Secondary Prevention. Journal of the American Geriatrics Society, 2002; 1329-1335.
[4]	Gjoreski H, Kozina S, Luštrek M, Gams M. Using multiple contexts to distinguish standing from sitting with a single accelerometer. European Conference on Artificial Intelligence (ECAI), 2014.
[5]	Gjoreski H, Kaluža B, Gams M, Milic R, Luštrek M. Ensembles of multiple sensors for human energy expenditure estimation. In Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing (UbiComp), 2013; 359-362.
[6]	Gjoreski H, Gams M, Luštrek M. Context-based fall detection and activity recognition using inertial and location sensors. Journal of Ambient Intelligence and Smart Environments (JAISE), 2014, 6(4); 419-433.
JOŽEF STEFAN INSTITUTE
Jožef Stefan (1835-1893) was one of the most prominent physicists of the 19th century. Born to Slovene parents, he obtained his Ph.D. at Vienna University, where he was later Director of the Physics Institute, Vice-President of the Vienna Academy of Sciences and a member of several scientific institutions in Europe. Stefan explored many areas in hydrodynamics, optics, acoustics, electricity, magnetism and the kinetic theory of gases. Among other things, he originated the law that the total radiation from a black body is proportional to the 4th power of its absolute temperature, known as the Stefan-Boltzmann law.
The Jožef Stefan Institute (JSI) is the leading independent scientific research institution in Slovenia, covering a broad spectrum of fundamental and applied research in the fields of physics, chemistry and biochemistry, electronics and information science, nuclear science technology, energy research and environmental science.
The Jožef Stefan Institute (JSI) is a research organisation for pure and applied research in the natural sciences and technology. Both are closely interconnected in research departments composed of different task teams. Emphasis in basic research is given to the development and education of young scientists, while applied research and development serve for the transfer of advanced knowledge, contributing to the development of the national economy and society in general.
At present the Institute, with a total of about 900 staff, has 700 researchers, about 250 of whom are postgraduates, around 500 of whom have doctorates (Ph.D.), and around 200 of whom have permanent professorships or temporary teaching assignments at the Universities.
In view of its activities and status, the JSI plays the role of a national institute, complementing the role of the universities and bridging the gap between basic science and applications.
Research at the JSI includes the following major fields: physics; chemistry; electronics, informatics and computer sciences; biochemistry; ecology; reactor technology; applied mathematics. Most of the activities are more or less closely connected to information sciences, in particular computer sciences, artificial intelligence, language and speech technologies, computer-aided design, computer architectures, biocybernetics and robotics, computer automation and control, professional electronics, digital communications and networks, and applied mathematics.
ranean Europe, offering excellent productive capabilities and solid business opportunities, with strong international connections. Ljubljana is connected to important centers such as Prague, Budapest, Vienna, Zagreb, Milan, Rome, Monaco, Nice, Bern and Munich, all within a radius of 600 km.
From the Jožef Stefan Institute, the Technology park "Ljubljana" has been proposed as part of the national strategy for technological development to foster synergies between research and industry, to promote joint ventures between university bodies, research institutes and innovative industry, to act as an incubator for high-tech initiatives and to accelerate the development cycle of innovative products.
Part of the Institute was reorganized into several hightech units supported by and connected within the Technology park at the Jožef Stefan Institute, established as the beginning of a regional Technology park "Ljubljana". The project was developed at a particularly historical moment, characterized by the process of state reorganisation, privatisation and private initiative. The national Technology Park is a shareholding company hosting an independent venture-capital institution.
The promoters and operational entities of the project are the Republic of Slovenia, Ministry of Higher Education, Science and Technology and the Jožef Stefan Institute. The framework of the operation also includes the University of Ljubljana, the National Institute of Chemistry, the Institute for Electronics and Vacuum Technology and the Institute for Materials and Construction Research among others. In addition, the project is supported by the Ministry of the Economy, the National Chamber of Economy and the City of Ljubljana.
Jožef Stefan Institute
Jamova 39, 1000 Ljubljana, Slovenia
Tel.:+386 1 4773 900, Fax.:+386 1 251 93 85
WWW: http://www.ijs.si
E-mail: matjaz.gams@ijs.si
Public relations: Polona Strnad
The Institute is located in Ljubljana, the capital of the independent state of Slovenia (or S9nia). The capital today is considered a crossroad between East, West and Mediter-
INFORMATICA
AN INTERNATIONAL JOURNAL OF COMPUTING AND INFORMATICS
INVITATION, COOPERATION
Submissions and Refereeing
Please submit a manuscript to: http://www.informatica.si/Editors/ PaperUpload.asp. At least two referees outside the author's country will examine it, and they are invited to make as many remarks as possible from typing errors to global philosophical disagreements. The chosen editor will send the author the obtained reviews. If the paper is accepted, the editor will also send an email to the managing editor. The executive board will inform the author that the paper has been accepted, and the author will send the paper to the managing editor. The paper will be published within one year of receipt of email with the text in Informat-ica MS Word format or Informatica I^TeX format and figures in .eps format. Style and examples of papers can be obtained from http://www.informatica.si. Opinions, news, calls for conferences, calls for papers, etc. should be sent directly to the managing editor.
QUESTIONNAIRE
Send Informatica free of charge
Yes, we subscribe
Please, complete the order form and send it to Dr. Drago Torkar, Informatica, Institut Jožef Stefan, Jamova 39, 1000 Ljubljana, Slovenia. E-mail: drago.torkar@ijs.si
Since 1977, Informatica has been a major Slovenian scientific journal of computing and informatics, including telecommunications, automation and other related areas. In its 16th year (more than twentyone years ago) it became truly international, although it still remains connected to Central Europe. The basic aim of In-formatica is to impose intellectual values (science, engineering) in a distributed organisation.
Informatica is a journal primarily covering intelligent systems in the European computer science, informatics and cognitive community; scientific and educational as well as technical, commercial and industrial. Its basic aim is to enhance communications between different European structures on the basis of equal rights and international refereeing. It publishes scientific papers accepted by at least two referees outside the author's country. In addition, it contains information about conferences, opinions, critical examinations of existing publications and news. Finally, major practical achievements and innovations in the computer and information industry are presented through commercial publications as well as through independent evaluations.
Editing and refereeing are distributed. Each editor can conduct the refereeing process by appointing two new referees or referees from the Board of Referees or Editorial Board. Referees should not be from the author's country. If new referees are appointed, their names will appear in the Refereeing Board.
Informatica is free of charge for major scientific, educational and governmental institutions. Others should subscribe (see the last page of Informatica).
ORDER FORM - INFORMATICA
Name: ...............................
Title and Profession (optional): .........
Home Address and Telephone (optional):
Office Address and Telephone (optional):
E-mail Address (optional): .............
Signature and Date: ...................
Informatica WWW: http://www.informatica.si/
Referees from 2008 on:
A. Abraham, S. Abraham, R. Accornero, A. Adhikari, R. Ahmad, G. Alvarez, N. Anciaux, R. Arora, I. Awan, J. Azimi, C. Badica, Z. Balogh, S. Banerjee, G. Barbier, A. Baruzzo, B. Batagelj, T. Beaubouef, N. Beaulieu, M. ter Beek, P. Bellavista, K. Bilal, S. Bishop, J. Bodlaj, M. Bohanec, D. Bolme, Z. Bonikowski, B. Boškovic, M. Botta, P. Brazdil, J. Brest, J. Brichau, A. Brodnik, D. Brown, I. Bruha, M. Bruynooghe, W. Buntine, D.D. Burdescu, J. Buys, X. Cai, Y. Cai, J.C. Cano, T. Cao, J.-V. Capella-Hernändez, N. Carver, M. Cavazza, R. Ceylan, A. Chebotko, I. Chekalov, J. Chen, L.-M. Cheng, G. Chiola, Y.-C. Chiou, I. Chorbev, S.R. Choudhary, S.S.M. Chow, K.R. Chowdhury, V. Christlein, W. Chu, L. Chung, M. Ciglaric, J.-N. Colin, V. Cortellessa, J. Cui, P. Cui, Z. Cui, D. Cutting, A. Cuzzocrea, V. Cvjetkovic, J. Cypryjanski, L. CCehovin, D. CCerepnalkoski, I. CCosic, G. Daniele, G. Danoy, M. Dash, S. Datt, A. Datta, M.-Y. Day, F. Debili, C.J. Debono, J. Dedic, P. Degano, A. Dekdouk, H. Demirel, B. Demoen, S. Dendamrongvit, T. Deng, A. Derezinska, J. Dezert, G. Dias, I. Dimitrovski, S. Dobrišek, Q. Dou, J. Doumen, E. Dovgan, B. Dragovich, D. Drajic, O. Drbohlav, M. Drole, J. Dujmovic, O. Ebers, J. Eder, S. Elaluf-Calderwood, E. Engström, U. riza Erturk, A. Farago, C. Fei, L. Feng, Y.X. Feng, B. Filipic, I. Fister, I. Fister Jr., D. Fišer, A. Flores, V.A. Fomichov, S. Forli, A. Freitas, J. Fridrich, S. Friedman, C. Fu, X. Fu, T. Fujimoto, G. Fung, S. Gabrielli, D. Galindo, A. Gambarara, M. Gams, M. Ganzha, J. Garbajosa, R. Gennari, G. Georgeson, N. Gligoric, S. Goel, G.H. Gonnet, D.S. Goodsell, S. Gordillo, J. Gore, M. Grcar, M. Grgurovic, D. Grosse, Z.-H. Guan, D. Gubiani, M. Guid, C. Guo, B. Gupta, M. Gusev, M. Hahsler, Z. Haiping, A. Hameed, C. Hamzagebi, Q.-L. Han, H. Hanping, T. Härder, J.N. Hatzopoulos, S. Hazelhurst, K. Hempstalk, J.M.G. Hidalgo, J. Hodgson, M. Holbl, M.P. Hong, G. Howells, M. Hu, J. Hyvärinen, D. Ienco, B. Ionescu, R. Irfan, N. Jaisankar, D. Jakobovic, K. Jassem, I. Jawhar, Y. Jia, T. Jin, I. Jureta, D. Juricic, S. K, S. Kalajdziski, Y. Kalantidis, B. Kaluža, D. Kanellopoulos, R. Kapoor, D. Karapetyan, A. Kassler, D.S. Katz, A. Kaveh, S.U. Khan, M. Khattak, V. Khomenko, E.S. Khorasani, I. Kitanovski, D. Kocev, J. Kocijan, J. Kollär, A. Kontostathis, P. Korošec, A. Koschmider, D. Košir, J. Kovac, A. Krajnc, M. Krevs, J. Krogstie, P. Krsek, M. Kubat, M. Kukar, A. Kulis, A.P.S. Kumar, H. Kwasnicka, W.K. Lai, C.-S. Laih, K.-Y. Lam, N. Landwehr, J. Lanir, A. Lavrov, M. Layouni, G. Leban, A. Lee, Y.-C. Lee, U. Legat, A. Leonardis, G. Li, G.-Z. Li, J. Li, X. Li, X. Li, Y. Li, Y. Li, S. Lian, L. Liao, C. Lim, J.-C. Lin, H. Liu, J. Liu, P. Liu, X. Liu, X. Liu, F. Logist, S. Loskovska, H. Lu, Z. Lu, X. Luo, M. Luštrek, I.V. Lyustig, S.A. Madani, M. Mahoney, S.U.R. Malik, Y. Marinakis, D. Marincic, J. Marques-Silva, A. Martin, D. Marwede, M. Matijaševic, T. Matsui, L. McMillan, A. McPherson, A. McPherson, Z. Meng, M.C. Mihaescu, V. Milea, N. Min-Allah, E. Minisci, V. Mišic, A.-H. Mogos, P. Mohapatra, D.D. Monica, A. Montanari, A. Moroni, J. Mosegaard, M. Moškon, L. de M. Mourelle, H. Moustafa, M. Možina, M. Mrak, Y. Mu, J. Mula, D. Nagamalai, M. Di Natale, A. Navarra, P. Navrat, N. Nedjah, R. Nejabati, W. Ng, Z. Ni, E.S. Nielsen, O. Nouali, F. Novak, B. Novikov, P. Nurmi, D. Obrul, B. Oliboni, X. Pan, M. Pancur, W. Pang, G. Papa, M. Paprzycki, M. Paralic, B.-K. Park, P. Patel, T.B. Pedersen, Z. Peng, R.G. Pensa, J. Perš, D. Petcu, B. Petelin, M. Petkovšek, D. Pevec, M. Piculin, R. Piltaver, E. Pirogova, V. Podpecan, M. Polo, V. Pomponiu, E. Popescu, D. Poshyvanyk, B. Potocnik, R.J. Povinelli, S.R.M. Prasanna, K. Pripužic, G. Puppis, H. Qian, Y. Qian, L. Qiao, C. Qin, J. Que, J.-J. Quisquater, C. Rafe, S. Rahimi, V. Rajkovic, D. Rakovic, J. Ramaekers, J. Ramon, R. Ravnik, Y. Reddy, W. Reimche, H. Rezankova, D. Rispoli, B. Ristevski, B. Robic, J.A. Rodriguez-Aguilar, P. Rohatgi, W. Rossak, I. Rožanc, J. Rupnik, S.B. Sadkhan, K. Saeed, M. Saeki, K.S.M. Sahari, C. Sakharwade, E. Sakkopoulos, P. Sala, M.H. Samadzadeh, J.S. Sandhu, P. Scaglioso, V. Schau, W. Schempp, J. Seberry, A. Senanayake, M. Senobari, T.C. Seong, S. Shamala, c. shi, Z. Shi, L. Shiguo, N. Shilov, Z.-E.H. Slimane, F. Smith, H. Sneed, P. Sokolowski, T. Song, A. Soppera, A. Sorniotti, M. Stajdohar, L. Stanescu, D. Strnad, X. Sun, L. Šajn, R. Šenkerfk, M.R. Šikonja, J. Šilc, I. Škrjanc, T. Štajner, B. Šter, V. Štruc, H. Takizawa, C. Talcott, N. Tomasev, D. Torkar, S. Torrente, M. Trampuš, C. Tranoris, K. Trojacanec, M. Tschierschke, F. De Turck, J. Twycross, N. Tziritas, W. Vanhoof, P. Vateekul, L.A. Vese, A. Visconti, B. Vlaovic, V. Vojisavljevic, M. Vozalis, P. Vracar, V. Vranic, C.-H. Wang, H. Wang, H. Wang, H. Wang, S. Wang, X.-F. Wang, X. Wang, Y. Wang, A. Wasilewska, S. Wenzel, V. Wickramasinghe, J. Wong, S. Wrobel, K. Wrona, B. Wu, L. Xiang, Y. Xiang, D. Xiao, F. Xie, L. Xie, Z. Xing, H. Yang, X. Yang, N.Y. Yen, C. Yong-Sheng, J.J. You, G. Yu, X. Zabulis, A. Zainal, A. Zamuda, M. Zand, Z. Zhang, Z. Zhao, D. Zheng, J. Zheng, X. Zheng, Z.-H. Zhou, F. Zhuang, A. Zimmermann, M.J. Zuo, B. Zupan, M. Zuqiang, B. Žalik, J. Žižka,
Informatica
An International Journal of Computing and Informatics
Web edition of Informatica may be accessed at: http://www.informatica.si.
Subscription Information Informatica (ISSN 0350-5596) is published four times a year in Spring, Summer, Autumn, and Winter (4 issues per year) by the Slovene Society Informatika, Litostrojska cesta 54, 1000 Ljubljana, Slovenia.
The subscription rate for 2015 (Volume 39) is
-	60 EUR for institutions,
-	30 EUR for individuals, and
-	15 EUR for students
Claims for missing issues will be honored free of charge within six months after the publication date of the issue.
Typesetting: Borut Žnidar.
Printing: ABO grafika d.o.o., Ob železnici 16, 1000 Ljubljana.
Orders may be placed by email (drago.torkar@ijs.si), telephone (+386 1 477 3900) or fax (+386 1 251 93 85). The payment should be made to our bank account no.: 02083-0013014662 at NLB d.d., 1520 Ljubljana, Trg republike 2, Slovenija, IBAN no.: SI56020830013014662, SWIFT Code: LJBASI2X.
Informatica is published by Slovene Society Informatika (president Niko Schlamberger) in cooperation with the
following societies (and contact persons):
Robotics Society of Slovenia (Jadran Lenarcic)
Slovene Society for Pattern Recognition (Janez Perš)
Slovenian Artificial Intelligence Society (Dunja Mladenic)
Cognitive Science Society (Urban Kordeš)
Slovenian Society of Mathematicians, Physicists and Astronomers (Andrej Likar) Automatic Control Society of Slovenia (Sašo Blažic)
Slovenian Association of Technical and Natural Sciences / Engineering Academy of Slovenia (Vojteh Leskovšek) ACM Slovenia (Andrej Brodnik)
Informatica is financially supported by the Slovenian research agency from the Call for co-financing of scientific periodical publications.
Informatica is surveyed by: ACM Digital Library, Citeseer, COBISS, Compendex, Computer & Information Systems Abstracts, Computer Database, Computer Science Index, Current Mathematical Publications, DBLP Computer Science Bibliography, Directory of Open Access Journals, InfoTrac OneFile, Inspec, Linguistic and Language Behaviour Abstracts, Mathematical Reviews, MatSciNet, MatSci on SilverPlatter, Scopus, Zentralblatt Math
Volume 39 Number 1 March 2015
ISSN 0350-5596
Informatica
An International Journal of Computing and Informatics
Discriminating Between Closely Related Languages on Twitter
Call Routing Based on a Combination of the Construction-Integration Model and Latent Semantic Analysis: A Full System
An Exact Analytical Grossing-Up Algorithm for Tax-Benefit Models
The slWaC Corpus of the Slovene Web
A Rule-Based System for Automatic De-identification of Medical Narrative Texts
Probabilistic 2D Point Interpolation and Extrapolation via Data Modeling
Swarm Intelligence and its Application in Abnormal Data Detection
N. Ljubešic, D. KranjCic
G. Jorge-Botana, R. Olmos, A. Barroso
M. Verbic, M. Cok, T. Turk
T. Erjavec, N. Ljubešic, N. Logar
J. Jacimovic, C. Krstev, D. Jelovac
D.J. Jaköbczak
B. Liu, M. Cai, J. Yu
Experimental Comparisons of Multi-class Classifiers L. Li, Y. Wu, M. Ye
An Efficient Algorithm for Mining Frequent Closed Itemsets
An Approach for Context-based Reasoning in Ambient Intelligence
G.	Fang, Y. Wu, M. Li, J. Chen
H.	Gjoreski
23 35 43 53 63
71 87
99
1
9
Informatica 39 (2015) Number 1, pp. 1-101