Volume 39 Number 1 March 2015 ISSN 0350-5596 Informatica An International Journal of Computing and Informatics Editorial Boards Informatica is a journal primarily covering intelligent systems in the European computer science, informatics and cognitive community; scientific and educational as well as technical, commercial and industrial. Its basic aim is to enhance communications between different European structures on the basis of equal rights and international refereeing. It publishes scientific papers accepted by at least two referees outside the author's country. In addition, it contains information about conferences, opinions, critical examinations of existing publications and news. Finally, major practical achievements and innovations in the computer and information industry are presented through commercial publications as well as through independent evaluations. Editing and refereeing are distributed. Each editor from the Editorial Board can conduct the refereeing process by appointing two new referees or referees from the Board of Referees or Editorial Board. Referees should not be from the author's country. If new referees are appointed, their names will appear in the list of referees. Each paper bears the name of the editor who appointed the referees. Each editor can propose new members for the Editorial Board or referees. Editors and referees inactive for a longer period can be automatically replaced. Changes in the Editorial Board are confirmed by the Executive Editors. The coordination necessary is made through the Executive Editors who examine the reviews, sort the accepted articles and maintain appropriate international distribution. The Executive Board is appointed by the Society Informatika. Informatica is partially supported by the Slovenian Ministry of Higher Education, Science and Technology. Each author is guaranteed to receive the reviews of his article. When accepted, publication in Informatica is guaranteed in less than one year after the Executive Editors receive the corrected version of the article. Executive Editor - Editor in Chief Anton P. Železnikar Volariceva 8, Ljubljana, Slovenia s51em@lea.hamradio.si http://lea.hamradio.si/~s51em/ Executive Associate Editor - Managing Editor Matjaž Gams, Jožef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia Phone: +386 1 4773 900, Fax: +386 1 251 93 85 matjaz.gams@ijs.si http://dis.ijs.si/mezi/matjaz.html Executive Associate Editor - Deputy Managing Editor Mitja Luštrek, Jožef Stefan Institute mitja.lustrek@ijs.si Executive Associate Editor - Technical Editor Drago Torkar, Jožef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia Phone: +386 1 4773 900, Fax: +386 1 251 93 85 drago.torkar@ijs.si Contact Associate Editors Europe, Africa: Matjaz Gams N. and S. America: Shahram Rahimi Asia, Australia: Ling Feng Overview papers: Maria Ganzha Editorial Board Juan Carlos Augusto (Argentina) Vladimir Batagelj (Slovenia) Francesco Bergadano (Italy) Marco Botta (Italy) Pavel Brazdil (Portugal) Andrej Brodnik (Slovenia) Ivan Bruha (Canada) Wray Buntine (Finland) Zhihua Cui (China) Hubert L. Dreyfus (USA) Jozo Dujmovic (USA) Johann Eder (Austria) Ling Feng (China) Vladimir A. Fomichov (Russia) Maria Ganzha (Poland) Sumit Goyal (India) Marjan Gušev (Macedonia) N. Jaisankar (India) Dariusz Jacek Jaköbczak (Poland) Dimitris Kanellopoulos (Greece) Samee Ullah Khan (USA) Hiroaki Kitano (Japan) Igor Kononenko (Slovenia) Miroslav Kubat (USA) Ante Lauc (Croatia) Jadran Lenarcic (Slovenia) Shiguo Lian (China) Suzana Loskovska (Macedonia) Ramon L. de Mantaras (Spain) Natividad Martinez Madrid (Germany) Sando Martincic-Ipišic (Croatia) Angelo Montanari (Italy) Pavol Nävrat (Slovakia) Jerzy R. Nawrocki (Poland) Nadia Nedjah (Brasil) Franc Novak (Slovenia) Marcin Paprzycki (USA/Poland) Ivana Podnar Žarko (Croatia) Karl H. Pribram (USA) Luc De Raedt (Belgium) Shahram Rahimi (USA) Dejan Rakovic (Serbia) Jean Ramaekers (Belgium) Wilhelm Rossak (Germany) Ivan Rozman (Slovenia) Sugata Sanyal (India) Walter Schempp (Germany) Johannes Schwinn (Germany) Zhongzhi Shi (China) Oliviero Stock (Italy) Robert Trappl (Austria) Terry Winograd (USA) Stefan Wrobel (Germany) Konrad Wrona (France) Xindong Wu (USA) Yudong Zhang (China) Rushan Ziatdinov (Russia & Turkey) Discriminating Between Closely Related Languages on Twitter Nikola LjubešiC University of Zagreb, Faculty of Humanities and Social Sciences, Ivana Lucica 3 E-mail: nikola.ljubesic@ffzg.hr, http://nlp.ffzg.hr/ Denis Kranjcic University of Zagreb, Faculty of Humanities and Social Sciences, Ivana Lucica 3 E-mail: dkranjcic@ffzg.hr Keywords: microblogging, language identification, closely related languages Received: October 25, 2014 In this paper we tackle the problem of discriminating Twitter users by the language they tweet in, taking into account very similar South-Slavic languages - Bosnian, Croatian, Montenegrin and Serbian. We apply the supervised machine learning approach by annotating a subset of 500 users from an existing Twitter collection by the language the users primarily tweet in. We show that by using a simple bag-of-words model, univariate feature selection, 320 strongest features and a standard classifier, we reach user classification accuracy of ^98%. Annotating the whole 63,160 users strong Twitter collection with the best performing classifier and visualizing it on a map via tweet geo-information, we produce a Twitter language map which clearly depicts the robustness of the classifier. Povzetek: V prispevku raziščemo problem ločevanja uporabnikov družabnega omrežja Twitter glede na to, v katerem jeziku tvitajo, pri čemer obravnavamo zelo podobne južnoslovanske jezike: bosanščino, hrvaščino, srbščino in črnogorščino. Uporabimo pristop nadzorovanega strojnega učenja, kjer označimo vsakega uporabnika iz že obstojece podatkovne množice 500 uporabnikov z jezikom, v katerem najvec tvita. Pokažemo, da z uporabo enostavnega modela vrece besed, univariantno izbiro značilk, 320 najbolj pomembnih značilk in standardnim klasifikatorjem, dosežemo ^97 % tocnost klasifikacije posameznega uporabnika. Če uporabimo najboljši razviti klasifikator za oznacevanje naše celotne zbirke, ki zajema 63.160 uporabnikov, in rezultat prikažemo na zemljevidu z uporabo geografske informacija na tvitih, smo izdelali Twitter zemljevid jezikov, ki jasno pokaže robustnost razvitega pristopa. 1 Introduction this respect, because it does not transcribe foreign words and proper nouns, as the others do. Moreover, due to the The problem of language identification, which was consid- fairly recent standardization of Montenegrin, its additional ered a solved task for some time now, has recently gained in phonemes are extremely rarely represented in writing, es- popularity among researchers by identifying more complex pecially in informal usage. The Serbian language is the subproblems, such as discriminating between language va- only one where both Ekavian and Ijekavian pronunciation rieties (very similar languages and dialects), identifying and writing are standardized and widely used, while all the languages in multi-language documents, code-switching other languages use Ijekavian variants as a standard. The (alternating between two or more languages) and identify- languages share a great deal of the same vocabulary, and ing language in non-standard user-generated content which some words differ only in a single phoneme / graPheme, often tends to be very short (such as tweets). because of phonological, morphological and etymological In this paper we address the first and the last problem, circumstances-There are some grammatical differences re- namely discriminating between very similar languages in garding phonology, morphology and syntax, but they are Twitter posts, with the relaxation that we do not identify arguably scarce and they barely influence mutual intelligi- language on the tweet level, but the user level. bility. The distinction between the four languages is based on the grounds of establishing a national identity, rather The four languages we focus on here, namely Bosnian, ,, . ,, than on prominently different linguistic features. Croatian, Montenegrin and Serbian, belong to the South Slavic group of languages and are all very similar to each other. 2 Related work All the languages, except Montenegrin, use the same phonemic inventory, and they are all based on the write- One of the first studies incorporating similar languages in a as-you-speak principle. Croatian is slightly different in language identification setting was that of [9] who, among others, discriminate between Spanish and Catalan with the accuracy of up to 99% by using second order character-level Markov models. In [11] a semi-supervised model is presented to distinguish between Indonesian and Malay by using frequency and rank of character trigrams derived from the most frequent words in each language, lists of exclusive words, and the format of numbers. [3] use a bag-of-words approach to classify Chinese texts from the mainland and Taiwan with results of up to 92% accuracy. [13] propose a log-likelihood estimation method along with Laplace smoothing to identify two varieties of Portuguese (Brazilian and European) obtaining 99.5% accuracy. In the first attempt at discriminating between the two most distant out of the four languages of interest, namely Croatian and Serbian, [6] have shown that by using a second-order character Markov chain and a list of forbidden words, the two languages can be differentiated with a very high accuracy of ^ 99%. As a follow-up, [12] add Bosnian to the language list showing that most off-the-shelf tools are in no way capable of solving this problem, while their approach by identifying blacklisted words reaches the accuracy of ^97%. [11] have worked with the same three languages as a subtask of producing web corpora of these languages. They have managed to outperform the best-performing classifier from [12] by training unigram language models on the entire content of the collected web corpora, decreasing the error related to the Croatian-Serbian language pair to a fourth. Recently, as a part of the DSL (Discriminating between Similar Languages ) 2014 shared task of discriminating between six groups of similar languages on the sentence level [14], the language group A consisted of Bosnian, Croatian and Serbian and the best result in the group yielded 93.6% accuracy, which is not directly comparable to the aforementioned results because classification was performed on the sentence level, and not on the document level as in previous research. Language identification on Twitter data has become a popular problem in recent years. [1] use language identification to create language specific Twitter collections of low-resource languages such as Nepali, Urdu, and Ukrainian. [2] use character n-gram distance with additional microblogging characteristics such as the language profile of a user, the content of an attached hyperlink, the language profile of mentioned users and the language profile of a hashtag. [7] review a wide range of off-the-shelf tools for Twitter language identification, and achieve their best results with a simple voting over three systems. To the best of our knowledge, there has been only two attempts at discriminating between languages of high level of similarity on Twitter data. The first attempt dealt with Croatian and Serbian [4], where word unigram language models built from Croatian and Serbian web corpora were used in an attempt to divide users from a Twitter collection according to the two languages. An analysis of the annotation results showed that there is a substantial Twitter activity of Bosnian and Montenegrin speakers in the collection and that the the collected data cannot be described with a two-language classification schema, but rather with a 4-class schema that includes the remaining two languages. The second attempt focused on Spanish varieties spoken in five different countries [8] using geo-information as a gold standard, obtaining best results with a voting meta-classifier approach that combines the results of four single classifiers. Our work builds on top of the research presented in [4] by defining a four-language classification schema, inside which Montenegrin, a language that gained official status in 2007, is present for the first time. Additionally, this is the first focused attempt at discriminating between those languages on Twitter data. 3 Dataset The dataset we run our experiments on consists of tweets produced by 500 randomly picked users from the Twitter collection obtained with the TweetCat tool described in [4]. This Twitter collection consists currently of 63,160 users and 42,744,935 tweets. The collection procedure is still running which opens the possibility of the collection becoming a monitor corpus of user-generated content of the four languages. For annotating the dataset there was only one annotator available. Annotating a portion of the dataset by multiple users and inspecting inter-annotator agreement is considered to be future work. Having other languages in the dataset (mostly English) was tolerated as long as more than 50% of the text was written in the annotated language. Among the 500 users there were 10 users who did not comply to any of the four classes and were therefore removed from the dataset. One user, tweeting in Bosnian, had most of the tweets in English, there was one user tweeting in Macedonian and 8 users were tweeting in Serbian, but used the Cyrillic script. The users tweeting in Serbian and using the Cyrillic script were discarded from the dataset because we wanted to focus on discriminating between the four languages based on content and not the script used. The result of the annotation procedure is summarized in the distribution of users according to their language, presented in Table 1. We can observe that Serbian makes up 77% of the dataset. There is a similar amount, around 9%, of Bosnian and Croatian data, while Montenegrin is least represented with around 5% of the data. These results are somewhat surprising because there is a much higher number of speakers of Croatian (around 5 million) than of Bosnian (around 2 million) or Montenegrin (below 1 million). Additionally, Croatia has the highest GDP of all the countries and one would expect that the adaptation rate of such new technology should be higher and not lower than in the remaining countries. Because we plan to discriminate between the four languages on the user level, we are naturally interested in the amount of textual data we have at disposal for each in- language instance # percentage Bosnian (bs) 45 9.18% Croatian (hr) 42 8.57% Montenegrin (me) 25 5.10% Serbian (sr) 378 77.14% Table 1: Distribution of users by the language they tweet in. stance, i.e. user. Figure 1 represents the amount of data available per user, measured in the number of words. The plotted distribution has the minimum at 561 words and the maximum at 29,246 words, whereas the arithmetic mean lies on 6,607 words. This distribution shows that we have quite a large amount of textual data available for the majority of users. We will inspect the impact of data available for predicting the language in Section 4.5. i? - IT O - irOc -1-1-1-1-1-1 5000 10000 15000 20000 25000 30000 # of words per instance Figure 1: Distribution of dataset instances given the size in number of words. 4 Experiments We perform data preprocessing, feature extraction and data formatting using simple Python scripts. All the machine learning experiments are carried out with scikit-learn [10]. Our evaluation metric, if not stated otherwise, is accuracy calculated via stratified 10-fold cross-validation. We extract our features only from the text of the tweets. Using geolocation and user metadata (such as name, bio and location) is considered future work. We experiment with the following preprocessing procedures: - no preprocessing - filtering out mentions, hashtags and URLs (making the data more representative of the user-generated content in general) - dediacritizing the text (thereby lowering data sparsity) and the following sets of features: - words - character 3-grams - character 6-grams - words and character 6-grams Because no significant difference in accuracy was observed when using either different preprocessing procedures or sets of features (except for a slight drop when using character 3-grams), in the remainder of this section we present the results obtained by filtering out mentions, hash-tags and URL-s and using words as features. By skipping dediacritization we keep the preprocessing level to a minimum, while by using words as features we ensure easy understandability of procedures such as feature selection. Finally, by removing textual specificities of Twitter like mentions and hashtags we ensure maximum applicability of the resulting models to other user-generated content besides tweets. 4.1 Initial experiment The aim of the initial experiment was to get a feeling for the problem at hand by experimenting with various classifiers and features. We experiment with traditional classifiers, such as the multinomial Naive Bayes (MultinomialNB), K-nearest neighbors (KNeighbors), decision tree (DecisionTree) and linear support-vector machine (LinearSVM). We use the linear SVM because the number of features is much greater than the number of instances. For each classifier we use the default hyperparameter values except for the linear SVM classifier for which we tune the C hyperparameter for highest accuracy. classifier accuracy ± stdev DecisionTree 0.896 ± 0.026 KNeighbors 0.772 ± 0.040 LinearSVM 0.884 ± 0.034 MultinomialNB 0.806 ± 0.029 Table 2: Accuracy with standard deviation obtained with different classifiers using all words as features. In the results presented in Table 2 we can observe that the LinearSVM and DecisionTree produce the highest accuracy. The significantly lower accuracy of the Multino-mialNB classifier, which normally gives state-of-the-art results on bag-of-words models, but which has no inherent feature selection, provokes us to hypothesize that our results could improve if we applied explicit feature selection on our data. This follows our intuition that similar 0 m CD (J CD O CD m t-^ CD DecisionTree KNeighbors LinearSVM --- MultinomialNB 10000 20000 # of features 30000 40000 Figure 2: Classification accuracy as a function of number of most informative features used. languages can be discriminated through a limited number of features, i.e. words, and not through the whole lexicon, which is normally shared to a great extent among such closely related languages. 4.2 Feature selection Although there are stronger feature selection algorithms, we opt for a simple univariate feature selection algorithm which calculates p-value for each feature regarding the response variable through the F1 ANOVA statistical test. Finally it simply returns the user-specified number (or percentage) of features with lowest p-values. We use this simple feature selection method because we assume independence of our features, i.e. tokens or character n-grams, which is a reasonable assumption for language identification. classifier # of feats acc ± stdev DecisionTree 100 0.927 ± 0.019 KNeighbors 100 0.911 ± 0.041 LinearSVM 320 0.961 ± 0.025 MultinomialNB 320 0.980 ± 0.016 implicit feature selection / weighting, operate similarly on the whole scale of number of features available, but still show better performance when using only a few hundred strongest features. On the other hand, MultinomialNB and KNeighbors show significantly better performance when they have to deal with the strongest features only. The best results are obtained with the MultinomialNB classifier at 320 features, reaching the accuracy of 97.97%. A numerical comparison of the best results obtained with the four classifiers is given in Table 3. We present more detailed results obtained with the best-performing MultinomialNB classifier, trained on 320 features, in Table 4. It contains the confusion matrix of the classification process along with precision, recall and F1 obtained on each class. We can observe that the classification process is most successful on Serbian and Croatian, while the worst results are obtained on Montenegrin, which gets confused with both Bosnian and Serbian. Table 3: Maximum accuracy obtained with each classifier with the number of strongest features used. During these experiments we calculate accuracy via 10fold cross-validation, performing feature selection each time on 90% of data used for model estimation. The results of experimenting with up to 20% (cca. 42,000) of strongest word features are shown in Figure 2. Here we can observe a series of properties of the classifiers used. First of all, LinearSVM and DecisionTree, having bs hr me sr P R F1 bs 42 0 3 0 0.95 0.93 0.94 hr 1 41 0 0 0.98 0.98 0.98 me 0 0 23 2 0.82 0.92 0.87 sr 1 1 2 374 0.99 0.99 0.99 Table 4: Confusion matrix and precision, recall and F1 per class on the best performing classifier. 4.3 Evaluation on the test set To perform a final test of our best performing classifier we produced an independent test set consisting of 101 annotated users. The MultinomialNB classifier, trained on all 490 users available from our development set, with 320 strongest features identified on that dataset, produces accuracy of 99.0%, having just one Bosnian user identified as Montenegrin. This experiment emphasizes the robustness of our classifier. 4.4 Analysis of the selected features Using words as features, and not character 6-grams that perform equally well, enables us to easily interpret our final model. In Table 5 we present a systematization of the 320 features selected on the whole development set by language and the linguistic type of feature. bs hr me sr yat reflex 40.8 42.2 44.4 11.3 phonological 6.0 29.3 9.0 2.3 lexical 6.0 48.6 9.8 2.6 orthography 7.5 7.5 2.0 0.0 toponym, cultural 5.0 19.0 26.0 0.0 sum 65.3 146.8 91.3 16.2 Table 5: Feature type distribution across languages. The features are divided into five categories across the four languages: yat reflex, phonological differences, lexical differences, orthography and toponym or cultural differences. Each feature contributes one point to the table: if a feature is present in more than one language, this point is divided among languages, and if a feature belongs to more than one feature type, the point is divided among those feature types. Almost half of the features belong to the "reflex of yat" category, which is least informative because most of the Ijekavian features are equally present in Croatian, Bosnian and Montenegrin. The exceptions are the words that are distinct both by the "reflex of yat" category and the lexical category, and few examples of Montenegrin-specific reflex of yat in words such as "nijesam" or "de" (which also belongs to the "phonological differences" category). The "phonological differences" category contactins a lot of words present only in Croatian, such as "itko", "kava" or "vecer" ("iko", "kafa" and "vece" in the other three languages). On the other hand, words that differ in only one phoneme and are not specific for Croatian are often spread among the remaining three languages. The category of lexical differences is similar in this respect: more than 70 percent of these features are Croatian. This can be explained by the fact that lexical purism is much more pronounced in Croatian than in the other three languages, which can be observed in the names of the months and some everyday words, such as "obitelj" (family), "glazba" (music), "izbornik" (menu) etc. In place of these words, Bosnian, Montenegrin and Serbian use words with evident foreign origin: "familija" (family), "muzika" (music), "meni" (menu) etc. The category of "orthography" predominantly contains infinitive verb forms without the final "i" letter, which appear in the future tense in Croatian orthography and which are also allowed in Bosnian. Finally, there is the category containing toponyms and culturally- specific items, such as country and city acronyms, names for residents, currency, TV-stations and even some public figures. Although the features in the table are divided according to their real distribution among the languages, their distribution in the model sometimes differs. The reason for this is a significant difference between Croatian users and their language on the one side, and the rest on the other. Whereas Bosnian, Montenegrin and Serbian users are predominantly young people who use Twitter for chatting and sharing their everyday experiences, Croatian users are frequently news portals, shops, musicians, politicians etc. Consequentially, Croatian language on Twitter is marked by a much more formal register compared to the casual register of the other languages in our model. 4.5 Impact of amount of data available for prediction Having a test set at our disposal opened the possibility of performing one additional experiment on the impact of the amount of data available for our language predictions. In our test set the user with the least amount of textual material contains 864 words. Therefore we evaluated the classifier trained on the whole development set by using only first N words from each user in the test set, N ranging from 10 to 850. In Figure 3 we present the obtained results, representing each language with an F1 curve and all the languages with a micro-Fl curve. We can observe that the results peak and stabilize as we have 470 words at disposal for our prediction. This is an interesting result, showing that the large amount of data we have available for each user is actually not necessary. On the other hand, the results show quite clearly that discriminating between these languages on the level of each tweet would, at least with the presented classifier, be impossible given that the average tweet size is 10 words. Having significantly more training data available for each language could make a tweet-level classification possible since for Serbian, which covers 77% of the training data, on 10 words we already obtain a decent F1 of 0.88. 5 Corpus annotation and visualization To be able to distribute separate Twitter collections of the four languages, we annotated each of the 63,160 users from our Twitter collection of Bosnian, Croatian, Montenegrin and Serbian. The annotation was performed with the MultinomialNB classifier trained on both the 490 development and the 101 testing instances, again selecting the 320 strongest features on that dataset. Once we had our collection annotated, we decided to present the result of our language discriminator on a map. CD CD CD C^ CD 100 500 200 300 400 # of words per testing instance Figure 3: F1 per specific language as a function of the length of the testing instances. 600 Figure 4: The Twitter language map. Impact of amount of data. We presented each of the 576,786 tweets having geo-location available as a point on the map, encoding the predicted language of the author of the tweet with a corresponding color. We call the map presented in Figure 4 a "Twitter language map". The Figure shows the area of the four countries in which the four languages have official status. We can observe that the tweets follow quite consistently the country borders, which is an additional argument that our classifier works properly. From the plot we can also confirm that Twitter is much more popular and widespread in Serbia than in the remaining countries. Mixing of the four languages occurs, as one would expect, mostly in big cities, primarily Belgrade, the capital of Serbia. There we can observe a significant number of Montenegrin speaking Twitter users. To perform a sanity check regarding the correctness of these data, we manually inspected ten random users classified as being Montenegrin and tweeting in the wider Belgrade area. The inspection showed that all ten users actually tweet in Montenegrin. Overall, we can observe that Croatia and Serbia have a higher amount of foreign-tweeting users which is easily explained by the well-known migrations from Bosnia to both Croatia and Serbia, and from Montenegro primarily to Serbia. 6 Conclusion In this paper we have presented a straight-forward approach to discriminating between closely related languages of Twitter users by training a classifier on a dataset of 490 manually labeled users. By using the bag-of-words model, 320 strongest features regarding univariate feature selection and the multinomial Naive Bayes classifier, we obtained a very good accuracy of 97.97% on the development set and 99.0% on the test set. Best results were obtained on Croatian and Serbian while most errors occurred when identifying the Montenegrin language. Analyzing the impact of data available for classification showed that classification accuracy stabilizes at ^470 words per user which still does not enable us to use this classifier on the tweet level. Finally we annotated the whole 63k-user-strong collection of tweets and presented the collection on a map we call the "Twitter language map". The map shows that the language used on Twitter quite precisely follows the country borders, large cities being an exception to this rule. Future work includes adding more information to our model besides words from tweets. Strongest candidates are the content to which users link and user meta-information such as username, location and bio. Using the geoloca-tion information from tweets when available is surely a good source of information as well. Additionally, using the geolocation information as our response variable, i.e. redefining our task as predicting the location of a Twitter user is also a very interesting line of research. This surely increases the complexity of the task, but opens the door towards identifying dialects and sociolects. Acknowledgement The research leading to these results has received funding from the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement PIAP-GA-2012-324414 (Abu-MaTran). References [1] Bergsma, S., McNamee, P., Bagdouri, M., Fink, C., and Wilson, T. (2012). Language identification for creating language-specific twitter collections. In Proceedings of the Second Workshop on Language in Social Media, LSM '12, pages 65-74, Stroudsburg, PA, USA. Association for Computational Linguistics. [2] Carter, S., Weerkamp, W., and Tsagkias, M. (2013). Microblog language identification: overcoming the limitations of short, unedited and idiomatic text. Language Resources and Evaluation, 47(1):195-215. [3] Huang, C.-R. and Lee, L.-H. (2008). Contrastive approach towards text source classification based on top-bag-of-word similarity. In PACLIC, pages 404-410. De La Salle University (DLSU), Manila, Philippines. [4] Ljubešic, N., Fišer, D., and Erjavec, T. (2014). Tweet-CaT: a Tool for Building Twitter Corpora of Smaller Languages. In Chair), N. C. C., Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., and Piperidis, S., editors, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), Reykjavik, Iceland. European Language Resources Association (ELRA). [11] Ljubešic, N. and Klubicka, F. (2014). {bs,hr,sr}WaC - web corpora of Bosnian, Croatian and Serbian. In Proceedings of the 9th Web as Corpus Workshop (WaC-9), pages 29-35, Gothenburg, Sweden. Association for Computational Linguistics. [6] Ljubešic, N., Mikelic, N., and Boras, D. (2007). Language identification: How to distinguish similar languages. In Lužar-Stifter, V. and Hljuz Dobric, V., editors, Proceedings of the 29th International Conference on Information Technology Interfaces, pages 541-546, Zagreb. SRCE University Computing Centre. [7] Lui, M. and Baldwin, T. (2014). Accurate language identification of twitter messages. In Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM), pages 17-25. Association for Computational Linguistics. [8] Maier, W. and Gömez-Rodn'guez, C. (2014). Language variety identification in spanish tweets. In Proceedings of the EMNLP'2014 Workshop on Language Technology for Closely Related Languages and Language Variants, pages 25-35, Doha, Qatar. Association for Computational Linguistics. [9] Padrö, L. and Padrö, M. (2004). Comparing methods for language identification. Procesamiento del Lenguaje Natural, 33:155-162. [10] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, v., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour-napeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825-2830. [11] Ranaivo-Malancon, B. (2006). Automatic Identification of Close Languages - Case study: Malay and Indonesian. ECTI Transactions on Computer and Information Technology, 2(2):126-134. [12] Tiedemann, J. and Ljubešic, N. (2012). Efficient discrimination between closely related languages. In Proceedings ofCOLING 2012, pages 2619-2634, Mumbai, India. [13] Zampieri, M. and Gebre, B. G. (2012). Automatic identification of language varieties: The case of Portuguese. In Proceedings of KONVENS2012 - The 11th Conference on Natural Language Processing. [14] Zampieri, M., Tan, L., Ljubešic, N., and Tiedemann, J. (2014). A Report on the DSL Shared Task 2014. In Proceedings of the VARDIAL workshop. Call Routing Based on a Combination of the Construction-Integration Model and Latent Semantic Analysis: A Full System Guillermo Jorge-Botana13, Ricardo Olmos2 3 and Alejandro Barroso3 1Universidad Nacional de Educacion a Distancia (UNED), Calle Juan del Rosal, 10, Madrid, Spain. 2Universidad Autonoma de Madrid, Campus de Cantoblanco, 28049, Madrid, Spain. 3Semantia Lab. c/ Bravo Murillo n° 38, 28015-Madrid, Spain. E-mail: gdejorge@psi.uned.es Keywords: call routing, call steering, natural language, latent semantic analysis, construction-integration framework, psycholinguistically motivated algorithms Received: June 24, 2014 This study stems from a previous article [1] in which we found that a psycholinguistically motivated mechanism based on the Construction-Integration (C-I) model [2,3] could be used for call classifiers in systems based on Lament Semantic Analysis (LSA). In it we showed that with this model more robust results were obtained when categorizing call transcriptions. However, this method was not tested in a content of calls in audio format, where a voice recognition application would be involved. The most direct implication of a voice recognition application is that the text to be categorized may be impoverished and is subject to noise. This impoverishment normally translates into deletions and insertions which are semantically arbitrary but phonetically similar. The aim of this study is to describe the behavior of a complete system, with calls in audio format that are transcribed by a voice recognition application using a Stochastic Language Model (SLM), and then categorized wit^h an LSA model. This process optionally includes a mechanism based on the C-I model. In this study different parameters were analyzed to assess the automatic router's rate of correct choices. The results show that once again the model based on C-I is significantly better, but the benefits are more remarkable when the ut^terances are long. The paper describes the system and examines both the full results and the interactions in some scenarios. The economy of resources and flexibility of the system are also discussed. Povzetek: Razvitje sistem za prepoznavanje govora z namenom uspešnega iskanja storitev ali entitet. 1 Introduction Interactive Voice Response (henceforth, IVR) systems in terms of busy channels (the call will only be in one of enjoy extremely widespread use today to provide the switchboard channels for a few seconds), and also customer service (Self Service). One of their drawbacks beneficial in terms of convenience, as callers will be is that many of them consist of menus that limit user spared the effort associated with linking the options to a few items, meaning that many intentions are representation of what they want to the representation of not described within these items. This happens, for the categories expressed by the menu items, something example, when a person wants something but believes which is not always intuitive [5, 6]. that it is not represented in the menu. This constitutes a Generically speaking, the process of Call-Routing findability problem, since the category that represents described above involves two steps. The first step, voice what the person wants is not found. One of the recognition, involves phonetic models of the language in alternatives that have been used is to employ which the service is offered, as well as a Stochastic spontaneous speech recognition techniques and Language Model (henceforth SLM), which is a formal subsequently categorize spontaneous utterances1 into representation of the probabilities of a word occurring if subject areas. These techniques are usually called "call others have occurred beforehand. These SLM are routing" or "call steering" [4]. In them, rather than habitually 3-gram or 4-gram models, frequently choosing between several items on a menu, the person calculated using Maximum Likelihood Estimation and simply hears an input such as "what would you like to corrected by means of some Smoothing method (for do?" and responds spontaneously in natural language. example Good-Turing). In the end, the ASR module What the person says is recognized with the help of an formulates its recognition hypotheses, taking the Automatic Speech Recognition (ASR) module, and once phonetic models of the language into account, as well as converted to text it is categorized and sent to a route the probabilities provided by the SLM. The relative where the user's needs are catered for. This is beneficial importance of each element (phonetic model vs. SLM) - --in other words, the way of weighting the phonetic model 1 Utterance: This is how we habitually refer to the phrase returned over the SLM or vice-versa - can often be configured in by the ASR module. We will also refer to the audio transcription of voice recognition devices. Finally, from the scores each call as an utterance. yielded by the phonetic and SLM models, the ASR module generates a number of recognition hypotheses ordered by the confidence level which it assigns to each of them (a list of possible utterances). What the user has said is already in a text format, and now is the time to assign it to a destination - this will be the second step. To this end, classification techniques will be used to determine when a text (in this case the user request) belongs to one category or another (in this case to one destination or another). Many classification techniques have been used to this end, with various results, depending also on the text sample and parameters [7]. Some examples are probabilistic models such as Maximum Entropy, artificial neural networks, and high-dimensional spaces, the category to which Latent Semantic Analysis (from now on LSA) would belong. All of the techniques mentioned have been extensively tested, but the current challenge is to achieve a more optimal and less biased representation of the words employed in the utterances, doing so even when data are not labeled or are retrieved and reanalyzed from samples which were not classified due to their difficulty [1, 8, 9]. In this study we have carried out a classification task employing LSA, but, as in these last studies, we have also opted to provide a better vectorial representation of the utterances. To this end, before performing the classification, we pre-processed the utterances using a technique based on a cognitive (or psycholinguistic) model that tries to account for the involvement of prior knowledge in the construction of meaning, and which also perfectly suits the philosophy underlying LSA: the Construction-Integration model [2, 3]. In fact, this study stems from a previous article [1] in which we found that a C-I based technique could be used for call classifiers, but we wanted to test it in a case which involved a voice recognition application, in order to show the performance of the entire system. Both LSA and C-I will be described in more detail in later sections. 2 LSA and Call Routing LSA was originally described as a method for Information Retrieval [10], although some authors went beyond the original conception and adapted it as a psychological model of the acquisition and representation of lexical knowledge. In recent years its capacity to simulate aspects of human semantics has been widely demonstrated. For example, it adequately reflects why children learn the meanings of words without the need for massive exposure [11]. It is, to summarize, a technique derived from the field of application, but which models some parts of the human linguistic behavior. This in turn can provide benefits for the field of application. LSA is a vectorial representation of language which is constituted by exploiting word occurrences in different contexts using large linguistic corpora (or text samples). It can be conceived as an automatic sequence of mathematical and computational methods, including pruning, weighting and dimension reduction via Singular Value Decomposition (SVD), to represent the meaning of words and text passages [10, 11]. A key issue is that every term or document (a document is a paragraph or sentence, or in the case of call categorization, a destination cluster or a call) in the corpus is expressed as a k-dimensional vector. Once the whole process is carried out, the cosine of the angle between two vectors is frequently used to evaluate the semantic relationship between the two terms or between the two documents corresponding to the vectors (formula 1). A close semantic relationship between two words, or between two documents, is shown by a high cosine, with a high value that is close to one, whilst two semantically unrelated words or documents have a cosine that is close to zero (orthogonally) or even a slightly negative one. In addition, sentences or texts that are not in the document matrix can be introduced as if they actually were included, using a technique commonly known as Folding-in, which projects this new document into the document matrix (formula 2). In the case of categorization, a vector of the user's utterance will be projected, plus one vector for each of the exemplar texts that represent the destinations, in such a way that if they are similar it will be inferred that what the user wants must be routed to this destination. A new vector d can be created by computing an utterance c (a new vector column in the occurrence matrix including all the terms that occur in it) and then multiplying it by the term matrix, usually called U, and the inverse of the diagonal matrix, usually called S; c is also computed by applying the same weights as in the creation of the original space. Cos(V1, V2) = ^1^2 V2 d = cTUSr1 (1) Similarity (2) Folding-in In the field of Call Routing, LSA has been used mainly with two motivations: correction of speech recognition hypotheses and assignation of utterances to destinations (both are occasionally used in the same router). Concerning the correction of speech recognition output, some studies have tested lists of common potential confusions on the part of speech recognition applications (homophones and near-homophones, {affect, effect}, {quiet, quite}), obtaining good results if such mistakes were corrected by checking the contextual coherence with indices of semantic similarity taken from an LSA model [12]. Satisfactory results were obtained if these confusions were corrected by checking the contextual coherence by means of indices of semantic similarity from an LSA model. These results were better than those of a model that combines a trigram model of the parts of speech with a Bayesian classifier, leading to the conclusion that LSA is a good option in these conditions. Another study confirmed the usefulness of LSA in conjunction with a model based on n-grams [13]. This study concludes that the analysis by means of n-grams is limited to a text window that is too local (normally 3 or 4 words), and that the LSA model greatly enriches the results from the router, if the probabilities of the predicted word due some history sequence are also calculated in terms of the LSA model, that is, by means of the similarity between the word and the text of the history, which is projected as a pseudodocument. This probability, based on the Semantic Space, will be highest for words whose meaning aligns most closely with the meaning of the history and lowest for words that do not convey any particular information about it, for example, function terms. It is argued that this fact is complementary to n-grams models, because it is its exact opposite. n-gram models tend to assign higher probabilities to (frequent) function words than to (rarer) content words. So the added value of this study is to integrate both kind of probabilities in the generic LM (Language Model) equations. Some authors extend this same procedure, improving the representation of the history sequence that acted as a pseudodocument in the original study [14]. Due to the briefness of the history for the utterances (and hence its pseudodocument), they replace the history pseudodocument by a more representative one, extracted from its semantic vicinity. They also use the first nearest semantic neighbors to estimate unseen sequences, as is usually done when using smoothing methods. Again, other studies also obtained good results by interpolating the LSA model and the n-gram model [15], directly correcting speech recognition hypotheses and reassigning new probabilities to them, taking into account the coherence of the lexical context that accompanies each word [16]. Similarly, a new study performed a series of experiments to correct classification errors [17]. They corrected speech recognition outputs using indices of syntactic and semantic coherence (and a combination of both). As for LSA, the authors reported that it yielded results comparable to those of Word-Net [18, 19] (LSA being more economical and flexible), and that, when certain parameters were used, a combination of the two measures led to an improvement in error correction. This same philosophy has also motivated some studies in which the probabilities of the sequences of SLMs - in other words, the likelihood of a word occurring given a history h (sequence of words that precede it) - are recalculated on the basis of the semantic similarity between the word vector and the vector of its history [20]. As for assigning utterances to destinations, a first study proposed a system in which the user response to the typical "say anything" cue is classified by an LSA module according to candidate destinations [4]. The module will compare the vector representation of what the user has said (utterance vector) with the vectors that represent each of the destinations, made up of a compilation of all the calls that belong to each of those destinations. This module also has a disambiguation mechanism in the event that the utterance vector is similar to several destinations. In this case, terms will be found that represent the difference vectors between the utterance vector and each of the destination vectors. Once found, only the terms that may form bigrams or trigrams with a term from the original request will be used. These terms are used to formulate questions for the users in order to disambiguate. One peculiarity of this study is that they used 4,497 transcriptions from a banking services call-center in the LSA training phase. The occurrence matrix to begin the LSA process consists of terms and routes (rather than terms and transcriptions), and as a result few columns are produced - 23, to be precise. This is precisely the criticism made in a later study [21]: the authors specified that they took the possible destinations rather than call transcriptions as documents, so that the LSA training was quite limited. They obtained better results in their laboratory if the documents of the matrix were composed of call transcriptions. Another study introduced a variant in the preprocessing stage [22]. When the corpus was trained, the Information Gain (IG) was calculated in order to identify the terms that actually contributed useful information to the router. The IG index is based on the variations in document entropy (the amount of information carried by a document) with and without the term analyzed. Good results were later also reported when using a training variant in which the labels flagging the transcriptions (the destinations or routes) were entered as terms [23]. This enabled the authors to bring together transcriptions that had been routed to the same places. Other authors have also obtained improvements by introducing an additional step between recognition and Call Routing [24]. These authors did not enter the "utterance" as collected by the ASR module (with a generated SLM) directly -- rather they corrected it by using the LSA model confidence indices. Using this method, calls that were more than eight words long (about 12 words) were routed slightly better, and this improvement is greater if the ASR module has laxer criteria (a lower confidence threshold of acceptance). As for study [24], the motivation for this study is also to introduce an additional step between recognition and LSA based Call Routing in order to differentially reassign the importance of each term of each utterance. This is done by means of the Construction-Integration model [3,2] (henceforth, C-I). C-I is a psycholinguistically motivated model that seeks to simulate working memory and real-time comprehension mechanisms, but it can be applied to categorization tasks as in the case of Call Routing. We will return to this model in a later section. 3 Objectives There are two main objectives of this study, a general one and a specific one: 1. To implement (and describe in detail) a real full system in which we used LSA and a mechanism based on the Construction-Integration model (C-I). 2. To study the efficiency of the system in a real speech recognition situation by analyzing the percentage of correct classified utterances using different parameters. The percentage of correct classified utterances, thus, showed the quality of the system. The manipulated parameters were (1) the use or not of an additional step based in C-I (over a LSA-based Call Routing), (2) the length of utterances, (3) the decrease in recognizer performance, and (4) the number of hypotheses processed from the n-best list derived from the voice recognition application. 4 Functional description of the system One of the aims of this paper is to present a detailed study of the results obtained in a real call router in which a mechanism based on Kintsch's Construction-Integration model [2] was introduced. This model had already been partially tested in a previous study [1], but in this study the full process is described, including voice recognition for spontaneous speech and classification according to possible destinations. The system comprises several modules and is domain-independent, as it was trained using texts from different subject areas. For this study, the domain subject area was a customer service at a Telco (credit balance, top-ups, complaints, phones, etc.) 4.1 The voice recognition phase Prior to any semantic interpretation, we need a voice recognition stage. The process consists of several layers. The first layer is responsible for recognizing words using the phonetic model of a language. There are numerous packages that perform voice recognition using a phonetic model of a particular language, based on sequences of letters and their pronunciation. These packages usually have standard dictionaries that specify the pronunciation of very common words, as well as dictionaries for the pronunciations that the integrator itself considers to be correct. In general, for recognition of phrases to occur, we need to explicitly specify which phrases we expect will be uttered. They are specified using deterministic grammars in different formats (abnf, grxml, gsl, etc.). But if our aim is the recognition of spontaneous speech, we must generate a statistical language model (SlM), which uses a large linguistic corpus to generate a model where the probabilities of some words appearing are specified, given those that have occurred previously. To calculate these probabilities, Maximum Likelihood Estimation (from now on ML) is commonly used, corrected by a smoothing method that estimates the occurrences of words within some ranges [7]. One of these smoothing methods is Good-Turing. The package that we use to calculate probabilities in our model is SRILM [25]. It has Good-Turing as the default method (see http://www.speech.sri.com/projects/srilm/manpages/ngra m-discount.7.html). It works as follows: by default, the unigrams that occur more than once and the n-grams that occur more than seven times are considered reliable. For this reason, standard ML is applied to calculate probabilities. But if the n-grams occur less than seven times, a correction is applied to the probability extracted from ML using the Good-Turing smoothing technique. It is also possible to estimate n-grams that do not occur in the reference corpus with the Katz method, using the BOW (back-off weight) of the history of each n-gram and the smoothed probabilities, but for simplicity's sake, this is not implemented in our system. Therefore, our model only contains conditional probabilities of n-grams that appear in the training corpus that have been smoothed (using the Good-Turing method). As was previously mentioned , all these calculations are carried out by the SRILM package, in which individual scripts are used to program SLM generation -in our case with classes that group words (days of the week, months, countries, cities, etc.). In the end, the complete model is output into a file with the .arpa extension (Advanced Research Projects Agency format), whose main use is for exchanging language models. Using this .arpa file, we generate a .grxml file where the corrected probabilities from the input file are recorded several times in the form of a tree, but this time in SGRS (Speech Grammar Recognition Specification) format, which can be read by many commercially available recognition packages. In our case, we will use the grxml of the generated SLM in a Nuance 9 recognition engine. 4.2 The call-routing phase Once the recognition hypotheses were established - in other words, what was recognized by the recognition engine- we proceeded with the second process: categorization by destination. In the case of this study, as noted earlier, the categorization procedure is implemented using LSA. So we needed a LSA semantic space to project the user's utterance as well as each of the exemplar texts that represent the destinations in it. 4.2.1 Semantic Space To obtain that semantic space, we used a training corpus of digitized utterances belonging to several phone companies. The LSA was trained and the semantic space was created using this corpus. These utterances were extracted using the Wizard of Oz procedure (a technique where the users are made to believe that they are interacting with a computer but in fact they interact with a person) and the transcriber labeled each of them with the destination to which it was routed. We should note that the labeling was performed using different criteria, was carried out at different times and at different companies, and is not exhaustive. In any case, to provide cohesion for interrelated words, these labels were regarded as additional words, as in Featured LSA [23]. In the end, the LSA comprises the terms and labels that occur in utterances. In a previous study it was demonstrated that retaining these labels boosts categorization performance and produces a positive interaction with usage of C-I, even if they are non-exhaustive [1]. It should also be borne in mind that the only utterances that must be exhaustively labeled are those that are part of the destination sample (see section 4.2). In the LSA training we used Gallito® (see www.elsemantico.es), a tool that has been used in other occasions for the creation of semantic spaces [26, 27, 28]. The words matching a special stop-list for this domain were eliminated, as were all function words. We also eliminated words that did not occur at least three times. Some words which are relevant within the telephony corpus were artificially grouped into a single class, for example countries, mobile phone brands or telecommunications companies, substituting them with the name of the class. In the end, we obtained a matrix of 1,421 terms and 34,905 utterances. In a next step, log-Entropy is calculated in this matrix. log-Entropy estimates the amount of information that a word carries in the documents where it appears. In this way, the terms that might contain the most information are given a heavier weight. So from this last calculation we got a weighed matrix to which SVD is applied and the three resulting matrices are reduced to 270 dimensions. We chose such a dimensionalization based on the assumptions made in previous studies [28]. In those studies it was suggested that the optimal number of dimensions for specific domain corpora does not have to be extremely low, sometimes even approaching the 300 dimensions recommended by Landauer, et al. [11]. In summary, the result of the entire process, the three reduced matrices (terms, diagonal and utterances matrix), is the semantic space of the mobile telephony domain that will be used as a basis for projecting the user's utterance as well as texts that represent the destinations. 4.2.2 Destination Management The service we wish to evaluate has 29 basic destinations, covering the needs of a telephone company call center. Previous LSA-based Call Routing research compared the vector representation of what the user says (what the ASR module returns) in real time with each of the utterances used in training that are labeled with a destination. After the comparison is performed, the label of the most similar exemplar is selected [4]. This label will be the destination selected. The philosophy of our system is similar, although three important points should be highlighted: The first one is that in our system we do not use all utterances from the training corpus as destination exemplars, but rather focus on a representative sample. In particular, a total of 2,329 calls are part of this sample (which are loaded and transformed into a vector of the semantic space only once the router has been launched). This is important because it is not necessary for all the calls from the training corpus to be labeled and participate as exemplars. We manually selected only a few representative calls (an excerpt is shown in Table 1), making the process more economical and the response to changes in the routing model (a change in the definition of the call-center destinations) faster. This is a very important issue if someone wants a successive deployment of a Call Routing service [29]. Massive annotation used to be very time-demanding. We made this pre-selection very carefully given that, as was pointed out in some studies [8], system performance depends on the quality of these destination exemplars. ALTASBAJASC well I'd like to cancel this phone number ALTASBAJASC switch to a monthly contract ALTASBAJASC to cancel a monthly contract ALTASBAJASC activate a number ALTASBAJASC cancel the mobile ALTASBAJASC I'd like to get a phone with you Table 1: Excerpt from the destination exemplars. The second point is that to decide what the most credible destination or destinations are we do not simply individually compare the user utterance with each of the utterances in the destination sample. In this system we use a method called average-4 which proved to be better in a previous study [1]. This method averages the four exemplars with the greatest cosine for each destination. The chosen destination is the one where the average of the four exemplars is the highest. In this way, any bias which an anomalous exemplar (seemingly very similar to what the user said) might have is eliminated. The third point is that after converting the user utterance into a vector and selecting the most likely destinations, a list of the first four destinations is returned in descending order of their cosines. When the user is not sent toward the first option, this allows the first destination and the next most likely destinations to be used as candidates, disambiguating with an explicit question. This will depend on the design of the dialogs. What is common to earlier studies is that both the destination exemplars and the user utterance will be converted into vectors that can be interpreted by the LSA space (in other words, use of the semantic vector space will be essential). The way the destinations and user utterances are converted into vectors will depend on the effectiveness of the system. In the following section we explain different ways of doing so. 4.2.3 Construction-Integration Model vs. direct method There are two ways of representing each of the utterances vectorially in the router, whether utterances are destination exemplars or user utterances. One is Direct routing, where the utterances are projected onto the latent semantic space without any kind of additional algorithm. This is the standard LSA method for constructing new documents in the vector space, and is known as Folding-In [10]. The second is C-I routing, which, in contrast to Direct Routing, has an intermediate step in the construction of both destination exemplars and user utterances. This intermediate step is based on a Construction-Integration network (Figure 1). The importance that a Construction-Integration network might have for routers lies in the fact that the words in an utterance are modulated to their correct meaning, taking into account the entire lexical context, which constrains the final meaning. This is particularly relevant for words that are ambiguous (such as "card") - the meaning that best matches the context will be given priority, thus avoiding predominant sense inundation and other biases frequently observed in vector space models [26, 27] Figure 1: Graphical view of the Call Routing process for Direct routing and C-I routing. Computational models based on Construction-Integration [2, 3] are based on the idea that the meaning of a text is constructed and integrated in the working memory. The mechanism retrieves the content hnked to eveiy word of a text from long-term memoiy, and, once retrieved, ensures that all this content is integrated in real time using a mutual constraint mechanism in the working memory. Therefore, it is a model that seeks to simulate working memory and real-time comprehension mechanisms [30]. Looking at the model details, interpretation of a text is carried out in two phases. In the first phase, all terms related to each of the words that make up the text or phrase are evoked. A network is constructed from all of them, where each one is joined to all the others - this phase is known as the Construction phase. In the second phase, known as the integration phase, each of these terms receives activation from the others proportionally with respect to the similarity they have with each other. After this phase, the most activated terms are those related to the main idea in the text [3,30]. This takes into account the fact that a text is not the sum of the terms included in it, but rather the integration of all of them into a final idea. Each term is constrained by the meaning of the other terms, thus generating the meaning in real time. This type of mechanism has been used on predicate and metaphorical structures [31, 32, 33] and on structures enriched with syntactic dependency relationships, thus demonstrating that the meaning of complex phrases can be modeled [34]. In operational terms, this mechanism makes it possible to differentially reassign the importance of each term. Ultimately, it is an estimate of the relevance of the utterance terms, as was done in previous studies [8]. How is this model implemented in our system? For each utterance, whether it is a user utterance or destination exemplars, a network is constructed based on the Construction-Integration network whose launch will lead to the extraction of new terms (see Figure 2). We might say that these terms produce a better definition of the utterance, an idea common to all the words contained in the original utterance. The procedure, in terms of its functionality, is analogous to that used by a study [35] implemented to improve classical methods for evaluating text with LSA. Figure 2 provides a graphic description of the procedure. Firstly (in the Construction phase), each term of the utterance is compared to all of the terms present in the semantic space, and the 200 neighbors most similar to each of them are extracted (this similarity is calculated using the cosine). A connectionist network is created between all these neighbors (neighbor node layer) and each of the original terms of the utterance (utterance node layer), where the weight of each connection is given by the cosines^ between each of the connected terms (figure 2). Once the weights of the connections have been assigned, the activation of each node is calculated based on the connections received, in the second Integration phase (see formula 3). Thus, the greater the weight of the connections received, the greater the activation. The activation function also favors instances where the source of activation derives from several terms in the utterance, and not just one or two (due to the parameter 5 in formula 3). n 4 = Activation function Where J is the sub-index of the utterance layer, i is the sub-index of the neighbors layer, Cij is the strength of the connection that node i received from node J (the latter node belonging to the first layer), and is a correction factor to avoid unilateral activation (based on the standard deviation of the connections received), as defined in formula 4: 5,= 1 (4) Correction factor -+1 The cosines are calculated using the previously trained semantic space, in other words each of the terms to be compared is represented by a vector in this space. Any term vector might then be compared to another term vector using the cosine. Where n is the total number of connections received by i (or the number of nodes in the neighbors layer) and Ci is the mean of all the strengths that node i received. Finally, the 20 most highly activated neighbors are chosen from all of those activated in the Integration phase, and a new utterance is constructed with them. In other studies an utterance has been replaced by a more representative one [14], but now the aim is also for this new utterance to contain terms which are closer to the meaning originally intended by the user. These 20 terms are used to form a new pseudodocument, this time using Folding-in (see Figure 1), and they will be compared with the destination exemplars (created in the same way) in order to assign them a destination. As the reader can realize, there are a few differences between the procedure to perform C-I used in this study and the original mechanism proposed by Kintsch [3]. Because call utterances are shorter and simpler than propositions within colloquial language, the algorithm used here is not exactly the original construction-integration algorithm. The integration part proposed by Kintsch is a spreading activation algorithm which is iterative until the net is stable (the cycle when the change in the mean activation is lower than a parameterized value), whereas our algorithm is a "one-shot" mechanism. The activation of each node is calculated based on the connections received. Another difference with the C-I algorithm as proposed by Kintsch is that we only consider words and not propositions nor situations. In any case, note that the original C-I is more complete and fine-grained, but our mechanism is sufficient for our purposes and may be more flexibly programmed, because an OOP (Object Oriented Programming) paradigm has been used, with classes such as net, layer, node, connection, etc., instead of the iterative vector x matrix multiplication in the original (see [39] for details of the original conception). Figure 2: C-I based net implemented in this study. The K most strongly activated neighbors (with a square) will be represented in a pseudodocument. 5 Software and Architecture used This section describes the application that we have implemented in our lab, and the auxiliary Software and technologies that has been used. In the last part, we describe the place of each module in the global architecture. 5.1 IVR Application The IVR Application is implemented using VXML technology (Voice Extensible Markup Language), located on a Tomcat application server (icon D in Figure 3). The server dynamically generates VXML files and sends them to the VXML interpreter, in this case VoxPilot (icon B in Figure 3), as it requests them and according to the application flow. First, the interpreter requests the start and welcome VXML files, then the VXML where the user is asked to request a service ("say anything"). This second VXML has voice recognition and therefore requires a grammar - in this case a grxml grammar generated using the SLM. In this way, Voxpilot sends the user response to the voice recognition application (icon C in Figure 3) along with the route where the grammar is located on the application server. Finally, the ASR module returns the text recognized with its confidence interval. This recognized text is sent in the actual request to the next VXML located on the application server, using the SOAP protocol3, to the semantic router (icon E in Figure 3). The semantic router returns a list with the four most likely destinations. The first destination from the list will be inserted into the VXML that is sent once again to Voxpilot for it to run the definitive Call Routing routine. 5.2 ASR Module The recognition engine used is Nuance 9® (icon C in Figure 3), located on the Speech Server®, also by Nuance, with a test license that allows the use of 4 channels in non-production environments. This recognition engine accepts grammars programmed in the standard .grxml (SGRS Speech Grammar Recognition Specification) so it is a good match for our SLM. 5.3 Semantic Router The router is basically dedicated to the task of receiving texts from the application server (icon D in Figure 3) and returning a list with the 4 most likely routes, or simply returning a rejection if it does not reach a confidence threshold. As we explained above, the router application is based in the LSA and C-I framework and was programmed in VB.NET using the object-oriented programming (OOP) paradigm. It is accessible as a web service on a Microsoft IIS server, which offers a series of functions and procedures such as loading the destination exemplars and converting them into vectors, and loading the reference semantic space (the semantic space SOAP (Simple Object Access Protocols) is a standard protocol that defines how two objects in different processes can communicate by exchanging XML data. previously generated with the Gallito® tool mentioned above). It also has a configuration file which specifies the acceptance thresholds of the logistic function, the method of converting utterances into pseudodocuments (C-I Routing and its parameters or Direct Routing) and the classes that will be used according to the training (months, days, mobile phone brands, countries, etc.). The fact that it is implemented as a Web Service means it can be integrated into a SaaS (Software as a Service) structure that allows it to route service utterances in a centralized manner. When the destination has been assigned, this module returns a list with the four most likely destinations and their cosines. recognized words than using only the first option, the independent variable contains two quantities: the first recognized utterance and the concatenation of the first and second utterance. We named the two levels Captur1 and Captur2. A third independent variable is Accuracy, which is measured by a value F' (see later in the method section). This value F' is broken down into two groups: high and low accuracy (see later for a more detailed description). Lastly, another independent variable is the number of tokens that is the length in words of transcriptions of the utterances. Two groups are again formed: Short and long utterances. Figure 3: Software, technologies, modules and system architecture used in our lab. 6 Evaluation of the system In this section we describe the procedure that we followed to examine the efficiency of the system in different configurations. First we will present the variables that were manipulated. Second, we will explain the procedure and the method to test them. 6.1 Dependent variable and independent variables The dependent variable used to assess the efficiency of the system was whether the destination was correctly classified or not. The first independent variable is called Routing Method and has to do with how LSA vectorially represents the test utterances and the utterances that represent the destinations. There are two methods (see section 4.2.3): Direct routing and C-I routing. The second independent variable is the Number of Captures, the number of utterances captured by the voice recognition application that were passed as input to the categorization module. The voice recognition application can be programmed to return only one text containing what it has recognized of the participant's voice message. However, as this captured text derives from a probability-based model, it can also be programmed to return the second, or the third most probable text, and so on. Bearing in mind the idea that by combining the first two options we can create a text including more correctly 6.2 Procedure and Method The evaluation method is as follows: 1,872 (randomized) audio files are obtained from real calls to several telecommunications companies. All of these audio files were then transcribed, to create the ideal condition where the ASR module captures the utterance perfectly. Recognition of each of these audio files was forced using the Nuance® ACC_TEST tool. The outputs from this tool provided the first two captures of what had been recognized (the two first outputs from the n-best list). To obtain objective measures of the third independent variable, Accuracy, measured as the deviation between what is recognized and what is transcribed, Information Retrieval measures were used. Their usage is justified by some studies in substitution of the WER (Word Error Rate) [36]. The measures used for IR were Recall, Precision and the combination of both in F'. Recall is the proportion of the transcription that is present in recognition; Precision is the proportion of the recognition that is present in transcription. These two measures range from 0 to 1, and are combined in an index, F'. The latter is the harmonic mean of the precision and the recall measures. To calculate them, we used the most popular natural language package in the Python environment, the Natural Language Toolkit (NlTK) [37] available at http://www.nltk.org/. Using this tool we extracted recall, precision and F' of each recognition compared to its transcription as well as revealing the recognition loss, and these were introduced as variables in the overall analysis of the router (becoming the Accuracy variable). It is important to note that both precision and recall have been calculated using only relevant terms, meaning those that will be considered by the router. Function words and words from the stop-list, for example, will be excluded from the calculations. At the same time, we also counted the number of tokens in each transcription to obtain a measure of the fourth independent variable, which gave us an idea of the length of the phrases uttered by users to make system requests. In the second phase the router is forced to categorize the text of each audio clip, beginning with the first recognition capture (Captur1), followed by the first two captures concatenated together (Captur2). This operation is repeated twice, once with the router working in Direct Routing mode, and once with C-I Routing. As a result we obtain all the combinations of independent variables (Routing Method x Number of Captures to route). In addition the router is forced to categorize the transcriptions, simply to have a baseline, but without introducing these results into the analysis matrix. The Call Routing is considered correct if the first destination returned by the router coincides with the ideal destination previously assigned to the utterance by a human4. With this data we now have all the necessary conditions and all the grouping variables. A matrix is formed and we proceeded to carry out a Repeated Measures ANOVA (Routing Method x Number of Captures to route) with two grouping variables (Accuracy - high or low, and N° of tokens - long or short). In summary, a 2x2x2x2 ANOVA. The results will be extracted from this analysis, and their implications will be examined in the discussion. At the end of the analysis, a logistic function is proposed to create acceptance zones in the router. There is a first zone where the first hypothesis returned and routed to this destination is accepted as valid; then a second disambiguation zone, where we might disambiguate between four hypotheses returned by the router, asking the user to choose;, and a third rejection zone , where any hypothesis is assumed to be erroneous and is not routed. The results of this mechanism will be shown as the percentage of correct call routings. 7 Results In this section we will present the results of the study. First we will present the percentages of correctly routed calls under all possible conditions, including those which involve voice recognition. In addition, we also present the results of an ANOVA which displays the interactions between the variables involved: Routing Method, Accuracy, number of tokens and Number of Captures. Finally, the results of entering an acceptance criterion by means of a logistical function are presented, as well as the application of various confidence levels for a potential disambiguation strategy. 7.1 Performance of the ASR module (F') As suggested above, one of the proposed means of measuring the ASR module's performance is applying Information Recall indices. To be precise, our system provides the following indices: Recall=.912, Precision=.959, F'=.932. Very often, many categories are interrelated or overlap, such that not even a human classifier could be sure which category to assign. As a last resort, whilst both may even be correct, the choice between one or another is binary and exclusive, so the results may be understated. Some evaluations of Call Routing have used not only binary coincidence to correct this but also ratings of the fitness of the destination returned, or also a scale of possible success [38]. We have not used this type of evaluation due to the high cost involved, which lessens information in the results and implies that they may be understated. 7.2 Decrease in performance caused by voice recognition Another of the relevant issues to be examined is the lowered performance of the whole system when speech recognition output utterances are routed rather than transcriptions. If we consider only the router's first hypothesis (Table 1), the results show that between the best condition with transcriptions (.75 with C-I) and the best condition with recognition (.67 also with C-I), we lose 8 percentage points in the rate of correct choices. If we consider the first four hypotheses returned (Table 2), between the best condition with transcriptions, also using C-I (.91), and the best with recognition, again using C-I (.85), the difference is reduced to 6 percentage points. We should remember, though, that our application has not undergone any optimization process for either voice recognition or the model of categories and that the main aim of this study is to check the scenarios in which C-I might behave more productively. We will look at this in the following section. 1st hypothesis Trans Captur1 Captur2 No C-I .69 .63 .62 C-I .75 .67 .66 Table 1: Percentages of correctly routed calls, considering only the first hypothesis returned by the router. Captur1 and Captur2 represent the conditions where one or two speech recognition hypotheses respectively are used. _Accumulated first 4 hypotheses_ Trans Captur1 Captur2 No C-I C-I .89 .91 .83 .85 .83 .84 Table 2: Percentages of correctly routed calls, considering the four hypotheses returned by the router. Captur1 and Captur2 represent the conditions in which one or two speech recognition hypotheses respectively are used. 7.3 Results of the ANOVA Another of the aims of this study was to analyze the performance of each method in different scenarios -specifically bearing in mind the Accuracy of recognition measured with F', and the number of tokens in user utterances. For this purpose the ANOVA described in the method section was carried out, extracting both main effects and interactions. Figure 4: Interaction between Routing Method and Number of Tokens. Three significant main effects were in fact found. The first of them relates to Routing Method (F(1,1735)=18.13, MSE=.087, p < .001). C-I is better than Direct Routing as a general effect. The second is the N° of tokens (F(1,1735)=34.07, MSE=.637, p < .001). Short phrases boost the router's effectiveness. The third is F' (F(1,1735)=128.14, MSE=.637, p < .001). The Accuracy of voice recognition also increases the router's performance. Figure 5: Interaction between Routing Method and Accuracy. Given that we also found significant interaction effects with the variables above, we will focus on these, leaving the main effects as supplementary information. Firstly, we find a two-way interaction effect between the factors Routing Method and N° of tokens (F(1.,1735)=4.45, MSE=.087, p = .035). The loss of effectiveness observed when the user utters long phrases is greater with Direct Routing than with C-I (Figure 4). Both methods show reduced effectiveness with long utterances, but C-I less so, hence we could say it has a corrective effect. We also found a significant effect for the two-way interaction between Routing Method and Accuracy (F(1,1735)=15.41, MSE=.087, p < .001). C-I shows beneficial effects if the quality of the recognition (measured by F') is good (Figure 5); otherwise, C-I works the same as Direct Routing. Lastly, we found a significant effect for the two-way interaction between Number of Captures and Accuracy (F(1,1735)=8.60, MSE=.087, p = .003). The results of this last interaction show how concatenating the first two speech recognition Captures is advantageous compared to using only one, if the quality of the recognition is low (Figure 6). Figure 6: Interaction between Number of Captures and Accuracy. 7.4 Acceptance levels Beyond straight performance data - in other words, the percentage of occasions a hypothesis returned is correct -we must introduce acceptance levels into the process. These maximize the correct choices and minimize errors by using a confidence threshold above which the route proposed by the system is accepted (formula 5). Previous studies have often introduced objective acceptance criteria into the process, such as logistic functions of acceptance [4]. In our case we will use the same parameters used by a similar study [1] in the logistic function, as they gave good results. In this function two parameters act as predictor variables, and the correct choice / error rate as our dependent variable. The parameters are as follows: firstly we use the cosine from the first hypothesis (average of the 4 largest cosines with the exemplars of this destination). As the second parameter, we use the difference between this first cosine and the cosine from the second hypothesis. The latter parameter takes into account the level of certainty about whether the first hypothesis should be returned rather than the second. P(Y = 1) = 1 1 + e 2.17-3.02 d cos-2.97cos (Formula 5) With the output from the logistic function, we define three zones of acceptance. The first one (0.5 - 1), above which we accept the label directly; the second one (0.4 -0.5), where we then disambiguate (using a question) between the four destinations proposed by the router (those that have the greatest cosine); and the third one (0 - 0.4), where it is directly rejected. In this way, not only does the percentage of correct choices fall within the acceptance zone, but it can also retrieve calls that are in the intermediate zone. Disambiguation mechanisms, such as preparing questions using the four hypotheses returned, could be built in this zone. For example, if the logistic function returned 0.46, the hypothesis returned would not be rejected, but rather disambiguated using the four hypotheses returned by the router, in the hope that the correct route would be among them, which is very likely. The results (Table 3) show that this disambiguation mechanism would be productive. In the acceptance zone (0.5 - 1), the percentage of correct destinations would be 76.54%. If we also disambiguated in the next zone of acceptance (0.4 - 0.5), we would guarantee that 81% of the time the correct destination would be among the four hypotheses used to ask the question. In addition, our data would contain 215 calls that would be rejected directly as they are in the rejection zone (0 - 0.4). This amounts to only 12% of the calls. Error Correct choice N % N % 0.4 to 0.5 (disambiguation) 39 18.14 176 81.86 0.5 to 1 (1st hypothesis) 308 23.46 1005 76.54 Table 3: Correct choice rates in the two zones of acceptance. 8 Discussion The overall results of this system are encouraging. Considering that it is a pilot study performed without optimizing voice recognition, and using a hypothetical customer service operation, the results are very satisfactory. To offer some raw data, 67% of utterances were assigned a destination where router and human agreed (with spontaneous speech recognition and categorization). As explained above, this finding is cautious if we consider the possibility that the destinations overlap or that there is ambiguity in the human assignment of an utterance to a destination. In addition, by introducing an acceptance criterion and using disambiguation mechanisms, a large proportion of the remaining utterances (81% of the total utterances) can be assigned to a correct destination. In any case, since this is an academic study, our aim was to focus on checking the performance of some methods in certain scenarios and not so much on the overall results, which could be improved upon. Our system brings several features together. One of them is that the router is based on LSA, a technique that is independent from the subject domain and is also fairly economical in terms of implementation. We have inserted an additional layer in this router, based on the assumption that the meaning of an utterance is not the sum of its words. Rather it is important to take into account how these words activate others that are not explicit, but participate in the final meaning. Whilst it is not identical, this layer is based on the Construction-Integration model, which makes the same assumption. We have also decided to summarize the final routing destinations in a file of sample utterances that represent them (golden destinations), thus avoiding any change in the organization of destinations, leading to a need to reclassify all utterances in the training corpus. This provides added flexibility to the system and reduces the response times to changes. All this has been integrated with a voice recognition engine (supported by a Stochastic Language Model or SLM), whose outputs are passed to the router in order to assign destinations. In order to maximize the number of correctly recognized words, not only the first hypothesis output from the ASR module is routed: there is an option where the concatenated first and second hypotheses can also be routed. This was tested on audio clips of real calls and an experimental design that allowed us to evaluate performance in some scenarios depending on length of utterances and accuracy of voice recognition. The analysis performed yielded various findings. Firstly that C-I is better than the direct method, in particular when it comes to cushioning the drop in performance in certain scenarios. It is true that in conditions where recognition has greatly declined, the contribution of C-I is not important, but when recognition is not bad, the C-I method seems to behave best with long utterances. Whilst this is not the case in a model like C-I, some previous studies have found benefits in long utterances if the speech recognition outputs are corrected by means of similarities extracted from LSA [24]. It should come as no surprise that in our system the C-I method performed best for long utterances, given that the original C-I model was created to account for longer propositions [2] and that the presence of a number of terms in the phrase facilitates the building of a context. This helps to over-weight the words that are within this context and to under-weight those that are not - for example, substitution or insertion errors. It also biases the meaning of ambiguous terms toward a meaning coherent with this context. The great contribution of this type of models is that they objectively mimic the process carried out in working memory while processing texts. As a text is being read or listened to, it is available in working memory, which retrieves content related with each word in the text from long-term memory. This will be the construction phase. In the integration phase, a mutual constraint mechanism is applied to this linked content [34] in order to extract the key idea. Therefore it is only to be expected that the higher the number of words in working memory (up to a threshold for simultaneous processing), the more data will be available to carry out this integration in a more correct way. Secondly, although this is approximate complementary data, we have also tried using two voice recognition captures in the Call Routing process. The results are modest, although they show a subtle trend. When recognition is expected to decline, there is a tendency that taking two hypotheses rather than one improves the results. We sense that occasionally the errors committed in the first hypothesis are not committed in the second, and vice-versa. Thirdly, we have seen that introducing a logistic function with some parameters helps to form acceptance criteria, above which correct choices are maximized, either by correctly rejecting the label or by accepting a label that later proves to be correct. By doing so, the results rise to 76.54% accuracy (either correct choices or correct rejections). We have also shown that improved performance results from setting up three zones of acceptance: the first one (0.5 - 1), above which we accept the label directly; the second one (0.4 - 0.5), where we then disambiguate (in the form of a question) between the four destinations proposed by the router; and the third one (0 - 0.4), where the label is rejected directly. In this way this, we achieve 79.3% correct Call Routing, and the correct destination of the utterances that remain in the intermediate zone for disambiguation would be among the four labels proposed in the question (the four returned by the router) in 81.56% of cases. Thus we have not only the percentage of correct choices in the acceptance zone, but also those that arise from disambiguation. It is clear that this requires a cost in terms of disambiguation question design, and a cost in terms of satisfaction, but it might be a good way to gradually implement the system. There is one last thing to note: this system's adaptability to change. On the one hand, although in this study we have used labels to identify the destinations for utterances in the training corpus, acceptable results can be obtained without them. We could even extend the training corpus, combining labeled with unlabeled parts, or parts labeled using different criteria. If we did these things, we would find a small, controlled reduction in performance, but not an abrupt drop [1]. On the other hand, the only labels that need to be exhaustive are those that identify the utterances that act as destination exemplars. These exemplars form part of a chosen sample of training utterances, but represent only a small percentage of them. Since there are few of them, they can be examined, changed or expanded quickly. Thus any change in the routing model (for example, a re-dimensioning of skills) can be dealt with, with acceptable response times, and with no need to retrain all utterances, thereby slowing down the process. 9 Conclusion In view of these results, the possibility of using psycholinguistic models in information recovery or utterance categorization systems is encouraging. Simulating human processing is not an easy task. It is not even easy to describe, but it is useful to reflect upon it in order to find possible improvements to current systems. In summary, LSA represents a very flexible and economical means of implementing Call Routing, and also allows us to explore algorithms and methods derived from research in Cognitive Science that may prove very promising. In this case, we have presented a system that combines LSA with a network based on Construction-Integration. The results obtained are good, although fine tuning is needed to optimize the voice recognition process, as well as more coherent organization of the destinations (which would be the case in a working production system). In any case, what has proved most interesting about this study is not the overall results in absolute terms, but rather testing the C-I layer and how it works. This has offered quite promising results, demonstrating superiority in some scenarios and stability in others. We believe that establishing links between computational science and psycholinguistics can help to find ways to optimize current categorization systems. Acknowledgements The authors wish to thank Airenas Vaiciunas from the Department of Applied Informatics at Vytautas Magnus University (Lithuania) for his useful comments about SLMs and also to Semantia Lab (www.semantialab.es) for supporting the logistic of this research. References [1] Jorge-Botana, G., Olmos, R. & Barroso, A. (2012). The Construction-Integration algorithm: A means to diminish bias in LSA-based Call Routing. International Journal of Speech Technologies, 15(2): pp. 151-164. [2] Kintsch, W. (1998). The use of knowledge in discourse processing: A construction integration model. Psychological Review, 95: 163-182 [3] Kintsch, W., & Welsch, D. The construction-integration model: A framework for studying memory for text. In W.E. Hockley & S. Lewandowsky (Eds.), Relating theory and data: Essays on human memory in honor of Bennet B. Murdock. Hillsdale, NJ: Erlbaum. 1991: 367-385. [4] Chu-Carroll, J., y Carpenter, B. (1999). Vector-based natural language call routing, Computational Linguistics, 25(3): pp. 361-388. [5] Brumby, D. & Howes, A. Good enough but Ill just check: Web page search as attentional refocusing. Procedings of the 6th International Conference on Cognitive Modelling, 46-51. 2004. [6] Brumby,D. & Howes, A. Interdependence and past experience in menu choice assessment. In: Alterman, R., Kirsh, D. (Eds.), Proceedings of the 25thAnnualConference of the Cognitive Society . Lawrence Erlbaum Associates, Mahwah, NJ, p. 1320. 2003. [7] Jurafsky, D. & and Martin, J. Speech and Language Processing: A^n Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics. 2nd edition. Prentice-Hall. 2009. [8] Gasanova, T., Zhukov E., Sergienko, R., Semenkin, E. & Minker, W. A Semi-supervised Approach for Natural Language Call Routing. Proceedings of the 14th Annual Meeting of the Special Int^erest Group on Discourse and Dialogue (SIGDIAL). August 2013. Metz, France. [9] Sarikaya, R., Hinton, G. & Ramabhadran, B. Deep Belief Nets for Natural Language Call-Routing. In ICASSP-2011. [10] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. Indexing. (1990). By Latent Semantic Analysis. Journal of the American Society For Information Science, 41: pp. 391-407. [11] Landauer, T. K., & Dumais, S. T. A solution to Plato's problem: the Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. (1997). Psychological Review, 104, pp. 211-240. [12] Jones, M.P., & Martin, J.H. Contextual spelling correction using latent semantic analysis. In Proceedings of the Fifth Conference on Applied Natural Language Processing. pp. 163-176. 1997. [13] Bellegarda, J.R. Exploiting latent semantic information in statistical language modeling. In Proceedings of the ^E^EE:. 88 (8), 1279- 1296. 2000 [14] Jen-Tzung Chien, Meng-Sung Wu and Hua-Jui Peng. (2004). Latent semantic language modeling and smoothing, International Journal of Computational Linguistics and Chinese Language Processing, vol. 9, no. 2, pp. 29-44. [15] Pucher, M., Y. Huang, Y. Combination of latent semantic analysis based language models for meeting recognition. ^n Proceedings of the Second LASTED International Conference on Computational Intelligence, San Francisco, California, USA, November 20-22, 2006 [16] Lim, B.P, Ma, B., & Li, H. Using Semantic Context to Improve Voice Keyword Mining, In Proceedings of the International Conference on Chinese Computing (ICCC 2005), 2005; 21-23, March, Singapore [17] Shi, Y. j^n Investigation of Linguistic Information for Speech Recognition Error Defection, Ph.D. University of Maryland, Baltimore County, Baltimore. 2008. [18] Miller, G.A. (1995). WordNet: A Lexical Database for English. Communications of the ACM Vol. 38, No. 11: 39-41. [19] Fellbaum, C. (1998, ed.) WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press. [20] Wandmacher, T. & Antoine, J. Methods to Integrate a Language Model with Semantic Information for a Word Prediction Component. In Proceedings of EMNLP-CoNLL, 506-513. 2007. [21] Cox, S. & Shahshahani, B. A Comparison of some Different Techniques for Vector Based Call-Routing. In Proceedings of 7th European Conf. on Speech Communication and Technology, Aalborg. 2001. [22] Li, L. & Chou, W. Improving latent semantic indexing based classifier with information gain, In Proceedings of the International Conference on Spoken Language Processing, ICSLP-2002, 11411144. September 16-20, 2002 Denver, Colorado, USA. [23] Serafin, R. & Di Eugenio. B. (2004). FLSA: Extending Latent Semantic Analysis with features for dialogue act classification. In Proceedings of ACL04, 42nd Annual Meeting of the Associat^ion for Computational Linguistics, Barcelona, Spain, July. [24] Tyson, N. & Matula, V.C. Improved LSI-Based Natural Language Call Routing Using Speech Recognition Confidence Scores, In Proceedings of EMNLP. 2004. [25] Stolcke, A. SRILM - An Extensible Language Modeling Toolkit, In Proceedings of Intl. Conf. Spoken Language Processing, Denver, Colorado, September 2002. [26] Jorge-Botana, G., Leon, J.A., Olmos, R. & Hassan-Montero, Y. (2010). Visualizing polysemy using LSA and the predication algorithm. Journal of the American Society for Information Science and Technology, 61(8), pp. 1706-1724. [27] Jorge-Botana, G., Olmos, R., Leon J.A. (2009) Using LSA and the predication algorithm to improve extraction of meanings from a diagnostic corpus. Spanish Journal of Psychology, 12(2), pp. 424-440. [28] Jorge-Botana, G., Leon, J.A.,Olmos, R., & Escudero, I. (2010). Latent Semantic Analysis Parameters for Essay Evaluation using Small-Scale Corpora. Journal of Quantitative Linguistics, 17(1), pp. 1-29 [29] Suendermann, D., Hunter, P., Pieraccini, R., Call Classification with Hundreds of Classes and Hundred Thousands of Training Utterances ... and No Target Domain Data, in Perception in Multimodal Dialog Systems, Proceedings of the 4th IEEE Tutorial and Research Workshop on Perception and Interactive Technologies for Speech-Based Systems, PIT 2008, Kloster Irsee, Germany, June 16-18, 2008, Springer-Verlag, pp. 81-87. [30] Kintsch, W., Patel, V., & Ericsson, K. A. (1999) The role of Long-term working memory in text comprehension. Psychologia, 42: 186-198. [31] Kintsch, W. (2000) Metaphor comprehension: A computational theory. Psychonomic Bulletin & Review, 7, pp. 257-266. [32] Kintsch, W. (2001). Predication. Cognitive Science, 25, pp. 173-202. [33] Kintsch, W., & Bowles, A. (2002). Metaphor comprehension: What makes a metaphor difficult to understand? Metaphor and Symbol, 17, pp. 249262. [34] Kintsch, W. Meaning in context. In Landauer, T. K, McNamara, D., Dennis, S. & Kintsch, W. (Eds.) Handbook of Latent Semantic Analysis. Mahwah, NJ: Erlbaum. 2007: 89-105. [35] Olmos, R., Leon, J.A., Jorge-Botana, G.,& Escudero I. (2009). New algorithms assessing short summaries in expository texts using Latent Semantic Analysis. Behavior Research Methods, 41, pp. 944-950. [36] McCowan, Moore, Dines, Gatica-Perez, Flynn, Wellner & Bourlard. On the use of information retrieval measures for speech recognition evaluation. Technical Report. IAIDP. 2005. [37] Bird, S. and Loper, E. NLTK: The Natural Language Toolkit. In Proceedings of ACL04, 42nd Annual Meeting of the Association for Computational Linguistics. Barcelona, Spain, July. 2004. [38] Malmström P. E. Methods for Evaluating a Natural Language Call Routing Application: A Case Study. Master's thesis of the Uppsala University. Department of Linguistics and Philology. 2010 An Exact Analytical Grossing-Up Algorithm for Tax-Benefit Models Miroslav Verbič, Mitja Čok and Tomaž Turk University of Ljubljana, Faculty of Economics, Kardeljeva ploščad 17, 1000 Ljubljana, Slovenia Keywords: deductive data imputation, household budget survey, microsimulation, tax-benefit models Received: November 27, 2014 In this paper, we propose a grossing-up algorithm that allows for gross income calculation based on tax rules and observ^ed v^ariables in the sample. The algorism is applicable in ^ax-beneUt microsimula^ion models, which are mostly used by taxation policy makers to support government legislative processes. Typically, tax-benefit microsimulation models are based on datasets, where only the net income is known, though the data about gross income is needed to successfully simulate the impact of taxation policies on the economy. The algorithm that we propose allows for an exact reproduction of a missing v^ariable by applying a set of taxation rules that are known to refer t^o the variable in question and to other variables in the dataset during the dat^a generation process. Researchers and policy makers can adapt the proposed algorithm wit^h respect to the rules and variables in their legislative environment, which allows for complete and exact restoration of the missing variable. The algorithm incorporates an estimation of partial analytical solutions and a trial-and-error approach t^o find the initial true value. Its validity was proven by a set of tax rule combinations at different levels of income that are used in contemporary tax systems. The algorithm is generally applicable, with some modifications, for data imputation on datasets derived from various t^ax systems around the world. Povzetek: Članek predstavlja algoritem obrutenja, ki omogoča izračunavanje bruto dohodkov iz neto dohodkov ob širokem naboru davčnih pravil različnih davčnih sistemov. Algoritem omogoča reproduciranje manjkajočih spremenljivk in je široko uporaben pri mikrosimulacijskem modeliranju. 1 Introduction There are various techniques for data imputation, which give to the researcher an opportunity to remedy the situation when the dataset is not complete. This does not come without costs, since data imputation can easily introduce biased parameter estimates in statistical applications [1, 2], or in other domains [3, 4]. Imputation techniques rely on deterministic and stochastic approaches, mostly under the assumption that the variable in question is in some way related to other variables under investigation. In this paper, we are exploring a case of deductive approach [5], a possibility to estimate a missing variable by applying a set of rules for which it is known that they refer to the variable in question and to other variables in the dataset. This set of rules might be enforced in various contexts, for instance by legislation, government policy, or other institutional or social constraints. If there is a consistent set of rules, which are enforced in practice, and the rules are comprehensive, the researcher could develop a formal algorithm with respect to the rules and variables in the dataset, which would allow for the data imputation of the missing variable. Let us consider the case of household budget survey (hereinafter HBS) datasets. HBS surveys are implemented at the national level of EU member states [6], where taxpayers report their net income for different income sources (e.g. wages, rents, pensions), as well as socio-economic data, which enable estimations of tax allowances and tax credits. HBS datasets are most valuable in many microsimulation tax-benefit models. Such models are standard tools in academia, in financial industry, and for underpinning everyday policy decisions and government legislative processes [7, 8, 9, 10]. Gross income represents a starting point for any tax simulation (including tax-benefit modelling), but HBS datasets are usually reporting only net amounts. As noted in [7], one possibility to generate gross income is the statistical approach based on information on both net and gross income. Using this information, a statistical model can be developed that yields estimates of net/gross ratios. These estimates are then applied to net incomes in order to compute gross amounts. The second known technique is the iterative algorithm that exploits the tax and contribution rules already built into tax-benefit models to convert gross income into net income [8, 9]. The procedure takes different levels of gross income for each observation, applies them to the tax rules, calculates the net income, and compares it with the actual net income as long as the gross income fits the actual net income within approximation limits. Both techniques, namely the statistical approach and the iterative algorithm, give gross income values that are estimates and not the actual gross income values. The task is not trivial, since modern tax systems include various rules for taxation and their combinations, and they usually involve a bracketing system for one or more parameters. Involvement of bracketing systems (and especially their combination) means that the calculation of net income from gross income is analytically nonreversible function. In this paper, we are presenting a solution to this problem, namely an algorithm that enables a full restoration of the gross income value. The algorithm includes a set of analytical inversions combined with a trial-and-error approach to deal with bracketing system combinations. The proposed algorithm allows for the calculation of gross income from net income for a broad set of taxation possibilities, where only information on net income is available, along with information on tax reliefs. The algorithm is feasible in cases of proportional and progressive tax schedules of personal income tax (hereinafter PIT) and social security contributions. It also covers tax allowances as well as tax credits. It is thereby generally applicable to contemporary tax systems around the world. The validity and accuracy of the proposed algorithm was tested by its application to a synthetic sample of taxpayers using an artificial system of personal income tax (PIT) and social security contributions. A comparison of gross income, calculated from net income using the proposed technique, with the initial gross income demonstrates the complete accuracy of the algorithm. The rest of the paper is organized as follows. In Section 2, we analyse taxation rules that are used in contemporary taxation systems. The analysis is a basis for the formalization of the imputation algorithm, which is explained in Section 3, including detailed solutions and proofs for various combinations of tax rules and bracketing systems. A test of the validity and accuracy of the proposed algorithm is presented in Section 4. In the Conclusion, the proposed algorithm is presented in its full form, which can be directly applied in practice. 2 Analysis of taxation rules Gross income is a starting point for the taxation of personal income, as to which we can distinguish three basic approaches [11, 12]: comprehensive income tax, dual income tax, and flat tax. Under a comprehensive income tax system, all types of labour and capital incomes are taxed uniformly by the same progressive tax schedule. A dual income tax system retains progressive rates on labour income, while introducing a proportional tax rate on capital income, e.g. the Scandinavian dual income tax [13]. The third option, which has been dominating income tax reforms in Eastern Europe [14, 15], is the flat-tax concept, although it is noted that this concept has not been implemented in any country in Western Europe [6]. Hereby we follow the most comprehensive procedure for the taxation of gross income, which is presented in Table 1 and includes a combination of progressive tax schedules and flat rates, with the addition of tax allowances and tax credits. Gross income - Social security contributions^ ■ determined by social security contributions schedule ■ set as a proportion of gross income ■ given in absolute amounts - Other costs related to the acquisition of net income ■ determined by a schedule ■ set as a proportion of gross income ■ given in absolute amounts - Tax allowances = Personal income tax base X Personal income tax rateb = Initial personal income tax - Personal income tax credit = Final personal income tax Net income = Gross income - Social security contributions _- Final personal income tax_ a Employee social security contributions b Either a single (flat) tax rate or set by the tax schedule Table 1: General procedure for taxing gross income. Table 1 contains the general procedure for the taxation of gross income. From gross income, employee social security contributions and other costs related to the acquisition of income (e.g. travel allowances or standardized costs set as a proportion of gross income) are deducted. Further, the tax allowances are subtracted and the tax base is obtained, which is subject to a PIT calculation using the tax schedule or a proportional (flat) tax rate. In this way, the initial PIT is calculated, which could be further reduced by a tax credit in order to calculate the final PIT and the net income. From the taxation point of view, other costs related to the acquisition of income have consequences identical to social security contributions or tax allowances. Therefore, our further development implicitly incorporates these costs into the concepts of social security contributions and tax allowances. When the schedules are applied, the PIT schedule or social security contributions schedule consists of a number of tax brackets with different marginal tax rates. The amount of PIT is calculated from the tax base according to the PIT schedule. Likewise, the amount of social security contributions is determined by gross income and the social security contributions schedule. In general, at the annual level tax bases from different income sources are summed up into a single tax base, which is subject to a single-rate schedule, and then the final annual PIT is calculated. An alternative option is a dual-tax system, where the PIT is calculated separately for different income sources (multiple-rate schedule). The procedure from Table 1 covers the existing tax systems to a great extent. In several OECD countries [16], the employee social security contributions are determined by the schedule (i.e. Austria, France) or set as a proportion of gross income (i.e. Spain, Norway), while social security contributions set by absolute amount are not very common and can be found, e.g., in Slovenia for certain categories of the self-employed. The algorithm applies the logic of social security contributions to the Other costs related to the acquisition of net income (i.e. cost connected with the real estate maintenance in the case of taxing income from rents) and tax allowances (i.e. for children or interest of housing loan allowances), which are found across the tax systems. The algorithm also covers the case of social security ceiling (i.e. in Austria, Germany). Regarding the calculation of PIT, the algorithm covers the prevailing progressive PIT schedule, as well as flat-tax systems (e.g. in Hungary or Bulgaria). However, the algorithm (more precisely, equations that cover specific combinations of tax parameters) has to be adapted to certain country specifics, which are not explicitly set out by the procedure from Table 1. For example, if social security contributions are not included in the PIT base, the gross income shall be calculated by the algorithm assuming that social security contributions are zero. Another example refers to above mentioned social security contribution ceiling. In this case, a zero rate tax bracket of social security contribution schedule above the set ceiling should be applied in an appropriate equation that is suitable to the specific combination of tax parameters of the particular country. The algorithm hereinafter is derived for the case when there is only one PIT instrument and one SSC instrument, thus two instruments in total. However, in actual fiscal systems there are cases when two or more PIT or SSC instruments are applied to a single income source at the same time. In these cases, the net to gross conversion is more complicated, since a "compression" of two (or more) PIT or SSC instruments should be done into a single PIT or SSC instrument. During the year, when a particular income source is paid out, the advance (in-year) PIT is usually paid at the time of disbursing the income source. This advance PIT is taken into account once the final annual PIT is calculated (i.e. the advance PIT is consolidated with the annual PIT). This procedure is called withholding. Understanding the mechanism of (in-year) advance PIT is important when we are dealing with survey data, such as HBS datasets. In a typical survey, respondents report their net income from different income sources for a certain period of the year, when their income sources are only subject to (in-year) advance PIT. In order to calculate the overall annual gross income, the reported net income from different income sources should be initially grossed-up using the algorithms that take account of various rules of advance (in-year) PIT and social security contributions for each income source separately. The focus of our paper is the development of these grossing-up algorithms for different income sources. Once the grossed-up amounts from different income sources are calculated, they can be summed up into overall taxpayers' annual gross income, which is the starting point for building a microsimulation model. Thus, the calculation of gross income from net income of a single income source (i.e. calculation of gross wages from given net wages) thus depends on different combinations of tax parameters from Table 1, which are described in detail as an algorithm in the paper. For example, in a case when gross wages are subject of: (a) progressive PIT schedule, (b) social security contributions, which are set as a proportion of gross wage with a ceiling, (c) tax allowances, and (d) without tax credits (e.g. in-year taxation of wages in Croatia), then equation (9) should be applied. Since the ceiling of social security contributions is set, this implies that the applied value of social security contribution rate above the ceiling should be zero. Table 1 can be transformed into the following expression: N = G - S - PIT, (1) where N and G represent net and gross income, respectively, S is the sum of social security contributions, and PIT is the personal income tax. Social security contributions S are a function of gross income. Similarly, PIT is a function of the tax base, which is the difference between gross income (reduced by social security contributions) and tax allowances, TA. This can be generalized as follows: N = G - fs (G) - fpiT (G - fs (G) - TA). (2) Function fS(G) can be defined in practice in different ways. A common approach is to use a schedule system, but it can also be defined as a proportion of gross income or as an absolute amount. In practice, function fpn{G - fs(G) - TA) is usually defined by a schedule system (different from the schedule system for social security contributions). As mentioned, function fPIT can also incorporate the concept of a tax credit. Our task is to estimate gross income G from expression (2) from the known values of N and TA and from a set of constraints that are usually given by social security contributions and PIT schedule systems, or by other legislative rules. The combination of two schedule systems makes solving equation (2) for G particularly challenging. The solution we propose in this paper has a trial-and-error nature. The idea is to prepare a set of all possible (PIT and social security contribution) bracket combinations. Then, we calculate for each taxpayer 'candidate' gross income values for each bracket combination, calculate net incomes from these candidate gross incomes, and compare the results to the starting value of net income. The gross income candidate that fits (or equals) the net income is the true gross income value. The fit is exact, i.e. we find the actual gross income in a non-iterative way. The following section describes the construction and design of the procedures we propose to deal with different income sources taxed by different rules. The general setup of the grossing-up algorithm is explained. Sections 3.2 and 3.3 set out a detailed examination of various taxation rules for social security contributions and tax crediting, together with the proposed grossing-up procedures for specific tax rule combinations. 3 Data imputation algorithm In this section, we explore a general setup where the tax system involves a combination of the following elements: (1) a social security contributions schedule, (2) a PIT schedule, and (3) tax allowances. This general setup forms the basis for development of the proposed algorithm. In the next steps, we incorporate other tax complexities, i.e. other rules for calculating social security contributions and various rules for determining tax credits. Function I^S(G) can be expanded by the rules of the social security contributions schedule to: f,(G) = S = Sr^(G- L^) + X Srj(Hj - Lj), (3) j =1 i.e. for each bracket, social security contributions are equal to the social security contributions marginal rate Sts, multiplied by the difference between gross income G and the lower bracket margin (s denotes the social security contributions bracket). This amount is added to the social security contributions, which are collected for all 'lower' brackets (i.e. brackets from 1 to s - 1). Hs and Ls denote the upper and lower social security bracket margins. Similarly, function fpn can be expanded by the rules of the schedule system for PIT to: fpiT (G) = PIT = Tfb (G - f, (G) - TA - L^) + Tr, (Hi - Li), (4) where Trb is the marginal tax rate for PIT bracket b, Lb is the lower margin for bracket b, Tri is the marginal rate for bracket i, and Hi and Li are the upper and lower margins of bracket i, respectively. By combining (3) and (4), we obtain: Nsb = Gsb - Srs (Gsb - L,) + X Srj(Hj - Lj) j=1 / / / Tr, Gb - Sr, (Gsb V Sr, (Gsb - L,) + X Srj (Hj - Lj) j=1 - (5) -TA- L,) + X Ti^i (Hi - ^i) where G = Nb - Sr^L^ - + SrJr,L^ (Sis - 1)(Trb -1) - TrJA + Tr, Z s -^s -Z, (Sr^ -l)(Trb -1) , Zb =Z Tr^i (Hi - L^i) (6) and Zs =ZSrs(H, -L^). j=1 Following our general trial-and-error scheme, the grossing-up algorithm is as follows: 1. For each statistical unit, calculate the matrix with K • B candidate gross incomes as its elements, according to equation (6): G, G,. G„ G„ Gk^b where K and B are the number of social security contributions brackets and the number of PIT brackets, both defined by the PIT and social security contributions system, respectively, and where k = K and 1 = ,B. 2. Calculate the net incomes from the matrix of candidate gross incomes according to the tax rules: N, N,. N, N1B Nkb The above equation holds for an individual taxpayer, when PIT was calculated by the tax authorities in such a way that social security bracket s corresponds to gross income G, and PIT bracket b corresponds to (G - S -TA). Since we do not know the actual G and S, we cannot directly establish, which PIT and social security contributions brackets (and corresponding marginal rates) were actually used for each individual taxpayer by the tax authorities. By reordering expression (5), we can express gross income as follows: 3. In the above matrix, find the net income Nk1, which is equal to the starting net income for this individual taxpayer: N = Nk^,. 4. The actual gross income G for this individual taxpayer is then: G = Gk^i. In the next subsections, we discuss the following extensions to this general setup: (1) social security contributions are not determined by the schedule system, i=1 i=1 =1 but as a proportion of gross income or as an absolute amount (Section 3.1), and (2) tax credits are included according to various rules for their determination (Section 3.2). 3.1 Variations of social security contributions In the following section, we extend the general setup to include cases where social security contributions are not determined by a schedule, but as a proportion of gross income or as an absolute amount. 3.1.1 Social security contributions as a proportion of gross income When social security contributions are set as a proportion of gross income, equation (2) can be rewritten as: N =(1 - 5r) G - f ((1 - 5r) G - TA), (7) where Sr is the rate of social security contributions, expressed as a proportion of the gross income. By simplifying equation (5), we obtain: Nb = (1 - Sr) Gb -(Trb ((1 - Sr) G^ - TA- L^) + Tr, (Hi - Li) (8) which holds for each PIT bracket b. By reordering, we can express the gross income with the equation: G ^ N, - Tr,TA - Tr,L, +Sb b (1 - Sr)- Tr, (1 - Sr) (9) From here, we can proceed according to the general setup, outlined above. 3.1.2 Social security contributions as an absolute amount When social security contributions are set as an absolute amount, we can simplify equation (5): Nb = Gb - S-(Tr, (Gb - S - TA - L,) + +]-: Tri (Hi - ^i) (10) and by reordering we obtain: G _ Nb + S - TrbS - TrbTA - Tr^b^, +S Gb _ rrrr (11) 3.2 Grossing-up procedure when PIT is subject to a tax credit A tax credit means that PIT is reduced by a certain amount (called a tax credit) and that the gross income source is effectively not taxed with the full PIT (the 'initial PIT'), but with the PIT reduced by the amount of the tax credit (the 'final PIT'). In practice, if a tax credit is calculated to be greater than the initial PIT, then net income N equals gross income G, as the net income cannot exceed the gross income (i.e. a tax credit can be as high as the initial PIT). In various tax systems, a tax credit can be defined in three ways: (1) as a proportion of the initial PIT, (2) as a proportion of the gross income, or (3) as an absolute amount. 3.2.1 Tax credit as a proportion of the initial PIT In general, we can express a tax credit as a proportion of the initial PIT as: N _ G - f, (G) - fp,T (G - fs (G) - TA) + +Cp.T ■ fpT(G - fs (G) - TA), (12) where CPIT is the share of the tax credit in the initial PIT. Following the above, we can write: Nb _ Gb - / / Tr, Gb - V V +c, 'b) f / Tr, ^ V V Gsb- J _1 S^,(Gb - L,) + XSfj(Hj - Lj) J'1 Sr,(Gb - L^) + Srj(H. - L.) j_1 -TA-L,) + i1 T^,{H, - Li.)^]+ (13) Sr,(G,b - L^) + ]r Srj(H. - L.) -TA- L,) + (H, - L,) which holds for a specific combination of social security contributions and PIT brackets. Solving (13) for G, we obtain: G _ Nb - SrbL, - Trb^, + cTrb^, + TrbSrbL, (1 - Sr^){Trb (CpiT -1) +1) CPITTrbSrbLs + TrbTA - CPITTrbTA - (1 - Sr:){Tr^b {cpiT-1) + 1) [T^, (1 - )- 1) + (CpiT - 1)Sb (1 - Sr,){Trb (C^T -1) +1) (14) From here, we can proceed according to the general setup, outlined above. When the tax credit is set as a proportion of the initial PIT and social security contributions are defined by a schedule, the above equation should be used instead of expression (6) in the general setup. _1 i _1 Tax credit as a proportion of the initial PIT and social security contributions as a propotiion of the gross income Where social security contributions are set as a proportion of the gross income, the general procedure can be simplified. In this case, the net income can be expressed as: N =(1 - Sr) G - f ((1 - Sr) G - TÄ) + f ((1 - Sr) G - Tä). (15) By reordering we obtain: Gb = Nb-(1 - CpiT )(TrbLb + TrbTÄ-E^) (1 - Sr)(1 - Trb + CpT) 3.2.2 Tax credit as a proportion of the gross income If the amount of a tax credit is defined as a proportion of the gross income, the net income calculation can be formalized as: N = G - fS (G) - fpT (G - fS (G) - TÄ) + Cg • G, (20) where CG is the tax credit share of the gross income. For clarity, we can denote the initial PIT as: PIT, = fpT(G- fS(G) - TÄ) The following equation holds for a particular tax bracket b: and the final PIT as: Nb = (1 - Sr) Gb - (Trb ((1 - Sr) Gb - TÄ - Lb) + Tr, (Hi - Li)1 + CpiT (Trb ((1 - Sr) Gb - (16) 1=1 / -TÄ- Lb) + ][: Tri (H, - Li) PITf = fpiT (G - fS (G) - TÄ) - CG ■ G. (21) (22) Since a tax credit can be as high as the initial PIT, the following rule applies: = \G- fS(G)- PITf if CG • G< PIT,; IG - fS (G) - -- ( ) if CG ■ G > PIT,. (17) Thus, when a tax credit is set as a proportion of the initial PIT and social security contributions are set as a proportion of the gross income the above equation should be used instead of expression (6) in the general setup. Due to this rule, the gross income cannot be easily estimated from net income N and tax allowances TÄ, as PITi and cg ■ G are not known at this stage. The rule implies that the actual calculation of net income N for each taxpayer was done by the tax authorities either by: N = G - fS (G) - fPT (G - fS (G) - TÄ) + CG ■ G (24) when cG ■ G < f,,;^,(G- fs(G) - TÄ), or by: Tax credit as a proportion of the initial PIT and social security contributions as an absolute amount If social security contributions are set as an absolute amount, we can redefine equation (10) to incorporate tax credit as a proportion of the initial PIT: Nb = Gsb - S-(Trb (Gb - S - TÄ - Lb) + Tr, (H, - L,)] + c„T (Trb (Gb - S - TÄ - Lb)+ (18) i=1 y +][: Tr, (Hi - Li) N = G - fS (G) (25) when cG • G > fP,T(G- fS(G) - TÄ). When we are interested in G, we can use these two approaches in reverse fashion (calculating G and not N), but we do not know which one, (24) or (25), is correct. Let us consider the case when we calculate G for a particular taxpayer from known values of N, TÄ and the PIT schedule (as in Table 1), once by using the rule expressed in equation (24) and once by using the rule expressed in (25). We obtain two estimates for the taxpayer's gross income G: G' = N + fS(G) + ff^^(G- fS(G) - TÄ) - cG ■ G (26) and by reordering we obtain: G = Nb + S + (CPIT -1)(TrbLb + TrbS + TrbTÄ - £b) . (19) b (1 - Trb + CPrTTrb) ' Thus, when a tax credit is set as a proportion of the initial PIT and social security contributions are set in an absolute amount, the above equation should be used instead of expression (6) in the general setup. and G" = N + fS(G). (27) If the net income N for this particular taxpayer was actually calculated according to expression (24), this inequality holds true: i=1 (N + f,(G) + fpT(G - fs (G) - TA) --c^ • G)>(N + f,(G)), (28) since cG ■ G < fj.^(G- fs(G) - TA) must hold. By using (26) and (27), we obtain: G' > G". (29) (N + f,(G) + fpT(G- fs(G) - TA) --cg ■ G)<(N + f,(G)), (30) From expression (34) we obtain: Nb - Sr^L^ - Tr,^^ - Tr,L, sb 1 + cG - Srb - Trb + SrbTrb Sr.Tr.L - Ti^,TA + S +S, + b b s_b_s_b. 1 + cG - Srb - Trb + SrbTrb The proper value of gross income G is G', since net income N for this particular taxpayer was actually calculated according to expression (24). Let us consider the opposite case where net income N for our taxpayer was actually calculated (by the tax authorities) according to (25). In this case, we can write: and from expression (35): G" _ Nsb - Srt^s Gsb _ ^ ;; 1 - Sn (36) (37) According to expression (32), we can establish the right value for gross income G^b: Gsb _ max(G'b, G",). (38) since cG ■ G > fj.^(G- fS(G) - TA) must hold. By using (26) and (27), we obtain: G" < G". (31) Tax credit as a proportion of the gross income and social security contributions as a pr'opor'tion of the gross income The proper value of gross income G in this case is G". Following (29) and (31), we can conclude that in both cases the highest value of G' and G" is the one that actually holds: Where social security contributions are set as a proportion of the gross income, the calculation of net income N for each taxpayer was done by the tax authorities either by: G _ max(G', G") (32) or G _ max ((N + (G) + fj.^(G- f^(G) - TA) --cg • G), (N + fS (G))). (33) For the construction of a general setup in the case of tax credits given as a proportion of the gross income, where social security contributions and PIT are calculated according to their schedules, we need to express equation (33) in a more exact way, for a specific combination of social security contribution and PIT brackets. The specific form for equation (26) is then: Nb _ Gb - Tr Sr^.(Gb - L^) + XSrj(Hj - Lj) J _1 Gb - Sr^s (Gb - L,) + 2 Srj (Hj - Lj) -TA- L,) + (H, - ) j_1 - CGGsb (34) N _ (1 - Sr)G- ((1 - Sr)G- TA) + cG ■ G (39) when cG • G < ((1 - Sr)G- TA), or by: N _ (1 - Sr)G (40) when cG ■ G > f,,^ ((1 - Sr)G - TA). The reasoning is similar to that above where we constructed equations (34) and (35). These two equations can be simplified since we only have one social security contributions rate Sr, and we obtain: N, _ (1 - Sr) G, -(T,^, ((1 - Sr) G, - TA - L,) + +2 Ti (H, - L,) + cgG, and N, _(1 - Sr) Gb From this, we obtain two solutions for Gb (41) (42) and for equation (27): Nb _ Gb - Sr^ (Gsb - L,) + 2 Srj (Hj - Lj). (35) G,_ N, - Tr,L, - Tr,TA + 1, b 1 + cg - Sr - Tr, + SrTr, and + i _1 i _1 j _1 GNb T-sr' N _ (44) Nb _ Gb - S-{Tfb {Gb - S - TA - Lb) + Tr {H, - L,) + CcGb and Nb _ Gb - S . The gross income for both cases can then be calculated from: G - fs (G) - PI^Tp if C < PIT,; G - fS (G) if C > PIT,. (51) which should be used in the general setup instead of (36) and (37), respectively. Again, the matrix of candidate solutions is one-dimensional (a vector for gross income candidates, i.e. one value for each PIT bracket), since there is only one social security contributions rate. Tax credit as a proportion of the gross income and social security contributions as an absolute amount In this case, the procedure can follow the same principles we used to construct equations (34) and (35). Since social security contributions are now set as an absolute amount, these two equations can be simplified: If net income N for a particular taxpayer was actually calculated (by the tax authorities) according to C < PIT, in (51), this inequality holds true: {N + fS(G) + fp,T(G- fS(G) - TA) - C) > >{ N + fS (G)),, (52) since C < fPI^T(G - fS (G) - TA) must hold. In the opposite case, i.e. if net income N was calculated according to C > PIT,, then: {N + fS(G) + fP,T(G- fS(G) - TA) - C)< <{ N + fS (G)), (53) (45) since C > fp,^,(G- fS(G) - TA) must hold. Following a similar reasoning to that in Section 2.3.2, we can conclude that the actual gross income G for a particular taxpayer must be: (46) G _ max{{N + fS(G) + fP,T(G- fS(G) --TA) - C),{N + fS(G))) Quantity G' can be estimated from: (54) and G,_ Nb + S - TfbLb - TfbS - TfbTA+Sb (47) Gb_-^-- (4') 1 + Cg - Tfb Gl_ Nb + S, (48) which should be used in the general setup instead of (36) and (37), respectively. Nsb _ Gsb - Tr Srs(Gb - Ls) + XSr,(H, - L,) J _1 Gsb - Srs(Gsb - Ls) + 2Srj(H, - L,) -TA- Lb) + 2Tr,{Hi - L,) J■_1 + C and solving for Gsb: (55) 3.2.3 Tax credit as an absolute amount If the amount of a tax credit is defined as an absolute amount, the procedure is similar to the one described in Section 2.3.2. The net income can be expressed as: N _ G - fS (G) - fP^,T (G - fS (G) - TA) + C, PITp _ fPIT(G - fS (G) - TA) - C. G _ Nb + C + SrsLs + TrbLb - Sr^TrbLs {Srs-1){Trb -1) TrbTA + Trb-Es -^s -^b {Srs-l){Trb -1) ' + (49) (56) where C is the amount of the tax credit. The initial PIT is the same as in Section 2.3.2, equation (21), and the final PIT is: (50) whereas the estimation of G is already explained in (36) and (38). We can conclude that in cases where the amount of a tax credit is defined as an absolute amount, the general setup is the same as that described in Section 2.3.2, except for equation (36), which should be substituted by equation (56). The following rule applies: i_1 i _1 + Tax credit as an absolute amount and social security contributions as a propotiion of the gross income Where when social security contributions are set as a proportion of the gross income, the calculation of net income N for each taxpayer was done by the tax authorities either by: N = (1 - Sr)G- fj,^((1 - Sr)G- TA) + C (57) when C < fj,^((1 - Sr)G- TA), or by: N = (1 - Sr)G (58) Nb =(1 - Sr) Gb-(Trb ((1 - Sr) G^ - TA - L^) + Tri (H, - L,) + C (59) and from this, we can obtain: Nb - C - TrbLb - TrbTA +^b (Sr-l)(Trb -1) (60) which should be used in the general setup instead of (43), whereas equation (44) also applies in this case for obtaining G". Again, the matrix of candidate solutions is one-dimensional (a vector for gross income candidates, i.e. one value for each PIT bracket). Tax credit as an absolute amount and social security contributions as an absolute amount In this case, the procedure can follow the same principle we introduced in Section 2.3.2. Since social security contributions are given as an absolute amount, equation (34) can be written in this way: Nb = Gb - S-(Trb (Gb - S - TA - Lb) + ^b b-1 Tr, (Hi - Li) + C, (61) 3.3 Algorithm in its full form For clarity, the grossing-up procedure that we developed in the above sub-sections is given below, including all combinations of the taxation rules that we described in the subsections following the basic setup at the beginning of Section 3. 1. when C > fi,^ ((1 - Sr)G - TA). The reasoning is similar to that in Section 2.3.2. Equation (41) can be rewritten in the following form: For each statistical unit, calculate the matrix of K ■ B candidate gross incomes according to equation (6): G = G,, G. G,., G„ Gk^b where K and B are the number of social security contribution brackets and the number of tax brackets, both defined by the tax and social security contribution systems, respectively, where k = ,K and l = B. Formulas for specific combinations of taxation rules can be found in Table 2. In cases where only the tax schedule system is used and social security contributions related to the acquisition of the income are set as one parameter, the above matrix of candidate gross incomes becomes a vector {G1,^, G,Gg}. 2. Calculate the net incomes from the matrix of candidate gross incomes according to the tax rules: N = N,, N, N, Nk1 N or {N1,^,Ni,Ng}. In the above matrix, find net income Nkl (or Nl), which is equal to the starting net income: whereas equation (46) also holds in the case social security contributions are set as an absolute amount. The gross income can then be calculated from (61) as: G, Nb - C + S - TrbLb - TrbS - TrbTA + ^b (62) Gb =-^-, (62) 1 - Tr N = N,, (or N = Ni). 4. The actual gross income G is then: G = G,, (or G = Gi). which should be used in the general setup instead of (36), together with (48), which was derived from (46). i =1 3. '=1 32 Informatica 39 (2015) 23-34 M. Verbič et al. System without tax credits Equation I Schedule for social security contributions Gsb Nsb - SrsLs - ti'l, + SrbTrbLs - ti'ta - ti'ZS +Zs +Zb (Srs - 1)(Trb -1) (6) Social security contributions as Gb Nb - TrbTA - TrbLb +Zb II a proportion of gross income (1 - Sr)- Tr, (1 - Sr) (9) Social security contributions as Gb Nb + S - TrbS - TrbTA - TI'L' +Z III an absolute amount 1 - Tr, (11) Tax credit as a proportion of the initial PIT Equation IV Schedule for social security contributions Gsb Nsb - SrbLs - TrbL, + CpIJTrbL, + TrbSrbLs - CpIJTrbSrbLs - TrbTA + (1 - Srs)(Trb(cpiT -1) +1) + CpTTA - Zs (Tr, (1 - CpiT) -1) - (cp^ - 1)Zb (1 - Sr,)(Trb(CpT -1) +1) (14) V Social security contributions as Gb _ Nb -(1 - CpiT)(TrbLb + TrbTA-Z,) a proportion of gross income (1 - Sr)(1 - Tr, + CpTTr,) (17) VI Social security contributions as Gb Nb + S + (CpiT -1)(TrbLb + TrbS + TI'TA- Z,) an absolute amount (1 - Trb + cpIJTrb) (19) Tax credit as a proportion of gross income Equation Gb = max {G'b ) VII Schedule for social security contributions G, Nb - Sr^L^ - TrbZ, - Tr,L, + SrJr,L^ - TrbTA + + Z, G,b =- 1 + Cg - Sr, - Tr, + SrbTr, G„ = Nb - SrbLs +Zs Gsb =-^—- 1 - Sr, (38) (36) (37) VIII Social security contributions as a proportion of gross income G, = max (G',, G,') G, = N, - TrbLb - TrJA + Zb b 1 + cG - Sr - Trb + SrTr„ g:==- nb 1 - Sr (38) (43) (44) Gb = max (Gb, G,) IX Social security contributions as g = Nb + S + Tr^L^ - TrbS - TrbTA + Zb an absolute amount b ^ " 1 - CG - Trb Gl = N, + S (38) (47) (48) Tax credit as an absolute amount Equation Gb=max (G'b ,G:b) X Schedule for social security contributions N , + C + S/-L + Ti^.L, - S/-TrbL + Tr^/TA + TrbZ -Z -Zb _ sb_s s b b s b s b b s s b (Srs -1)( Trb -1) G„ Nb - Sr,L, +Zs (38) (56) (37) G, = max (Gb, G') XI Social security contributions as a proportion of gross income G = N, - C - Tr,L, - TrJA + Zb (Sr - 1)(Trb -1) G' ^ Nb 1 - Sr (38) (60) (44) G, = max (Gb, G, ) Social security contributions as r^ _ N, - C + S - TI'L' - TI'S - TrbTA + Z XI^ ^ ^ , Gb =-^——- an absolute amount 1 - Tr Gb = Nb + S (38) (62) (48) Table 2: Equations for specific combinations of taxation rules. 4 Results and discussion Table 3 presents a summary of all possible social security contributions and tax credit combinations explored in Section 3. In reality, for any income source one of these combinations is applicable. Parallel to this, the PIT schedule system and tax allowances in absolute amounts are assumed. Our approach can also be applied to flat PIT systems (i.e. with a single proportional PIT rate). If this is a case, we apply only one PIT bracket with a positive marginal PIT rate. Where tax allowances are not set as absolute amounts, they can be expressed as an 'additional layer' of social security contributions. To test for the validity and accuracy of the proposed algorithm, we created a synthetic sample of 10,000 taxpayers with a normally distributed gross income, where the mean gross income was 50,000 mu (monetary units) and the standard deviation was 11,500 mu. We assumed the following tax parameters: 1. The PIT schedule includes three brackets: • 0 - 20,000 mu, a 15% marginal PIT rate; • 20,000 - 50,000 mu, a 25% marginal PIT rate; • over 50,000 mu, a 45% marginal PIT rate. 2. The social security schedule includes three brackets: • 0 - 10,000 mu, a 17% marginal rate; • 10,000 - 40,000 mu, a 20% marginal rate; over 40,000 mu, a 0% marginal rate. Schedule for Social Social social security security security contributions contributions contributions as a proportion of the gross income as an absolute amount System without tax I II III credits Tax credit as a proportion IV V VI of the initial PIT Tax credit as a proportion VII VIII IX of the gross income Tax credit as an absolute X XI XII amount Table 3: Summary of tax rules combinations (detailed equations are given in Table 4). The comparison of the gross income, calculated from the net income using the grossing-up algorithm, with the initial gross income demonstrates the complete accuracy of the algorithm for all income types. As an example, we can repeat the steps for an individual taxpayer with a gross income equal to 49,433.10 mu. In the second step, we calculated the net income amount for all 12 combinations of the tax rules (see Table 4). 3. Social security contributions as a proportion of the gross income were set at 22%. 4. Social security contributions as an absolute amount were set at 500 mu. 5. Tax allowances were set at an absolute amount of 2,000 mu. 6. The amounts of tax credits were given as follows: 13% of the gross income, 6% of the initial PIT, or 200 mu. These parameters were applied to the entire population of taxpayers according to the general procedure for taxing gross income (Table 1) and the specific combination of tax rules from Table 3. In the first step, we generated the amount of gross income for each taxpayer. In the second step, we calculated the net income according to combinations I to XII (from Table 3) of the tax rules, as is done in practice by tax authorities. In the third step, we applied the proposed grossing-up algorithm to combinations I-XII for each taxpayer. Finally, we compared the grossed-up income with the initial gross income. Schedule for Social Social social security security security contributions contributions contributions as a as an proportion of absolute the gross amount income System without tax credits Tax credit as a proportion of the initial PIT Tax credit as a proportion of the gross income Tax credit as an absolute amount 33,799.80 (I) 34,275.80 (IV) 40,226.10 (VII) 33,999.80 (X) 31,418.30 (II) 31,846.70 (V) 37,844.60 (VIII) 31,618.30 (XI) 39,199.80 (III) 39,783.80 (VI) 45,626.10 (IX) 39,399.80 (XII) Table 4: Net income for a chosen taxpayer with G = 49,433.10 mu. For each net income from Table 4, we applied the grossing-up algorithm (i.e. equations from Table 2). According to the technique, several gross income candidates were calculated for each of these net incomes. Due to space limitations, here we (arbitrarily) present the gross income candidates for net income VII: GVI1 _ 48,465.20 50,134.40 48,465.20 49,907.60 51,371.40 49,907.60 47,926.10 49,433.10 47,926.10 To each of these gross income candidates we applied the taxation rules (in this case the combination of taxation rules VII) and calculated the net income: N"11 _ 39,374.40 40,843.20 39,374.40 40,643.70 41,931.80 40,643.70 38,900.00 40,226.10 38,900.00 By comparing the elements of matrix NVII with the net income for a combination of tax rules VII from Table 4, which equals 40,226.10 (VII), we identified the matching element in the third row and the second column. The corresponding gross income in matrix G"11 equals 49,433.10, which is identical to the initial gross income of this particular taxpayer. In other words, for this combination of tax rules (VII), the proposed grossing-up algorithm is accurate. We repeated such tests for all 12 tax rule combinations and for 10,000 individual cases. 5 Conclusion In this paper, we presented a detailed construction of deterministic data imputation algorithm. In particular, we described an exact grossing-up algorithm for calculating the pre-tax income from data, which are only available in net (after-tax) form, and proved its successfulness, since it leads to a complete data reconstruction. Contemporary tax systems are rich in complexity, and some of tax rules combinations might not be covered by our technique. However, we believe that the general architecture of our proposition is sound and flexible enough to incorporate (with some modifications) additional, locally specific tax rules. In general, if a set of rules that relate to the variables under investigation could be assembled, researchers and policy makers can perform data imputation in deterministic fashion, and construct the algorithm for the exact analytical generation of the missing values. In future research efforts, a framework for feasibility assessment of such approach could be envisioned, which would employ estimates on rules' consistency and complexity on the one hand, and measures of the quality of replicated data on the other hand. References [1] Rancourt, E. (2007). Assessing and dealing with the impact of imputation through variance estimation. StatistiCal Dat^a Edifying: ImpaCt on Dat^a Quality. New York: United Nations. [2] Rueda, M. M., Gonzalez, S. & Arcos, A. (2005). Indirect methods of imputation of missing data based on available units. Applied MathematiC, and Computation 164: 249-261. [3] Smirlis, Y. G., Maragos, E. K. & Despotis, D. K. (2006). Data envelopment analysis with missing values: An interval DEA approach. Applied MathematiC, and Computation 177: 1-10. [4] Raghunathan, T. E., Lepkowski, J. M., van Hoewyk, J. & Solenberger, P. (2001). A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models. Survey Methodology 27: 85-95. [5] Franklin, S. & Walker, C. (2003). Surrey method, and praCtiCe,. Ottawa: Statistics Canada. [6] Fuest, C., Peichl, A. & Schaefer, T. (2008). Is a flat tax reform feasible in a grown-up democracy of Western Europe? A simulation study for Germany. International Tax and Pub^iC FinanCe 15: 620-636. [7] Immervoll, H. & O'Donoghue, C. (2001). Imputation of gross amounts from net incomes in household surveys: an application using EUROMOD, EUROMOD Working Paper, EM1/01. Colchester: ISER-Institute for Social and Economic Research. [8] D'Amuri, F. & Fiorio, C. V. (2009). Grossing-Up and Validation Issues in an Italian Tax-Benefit Microsimulation Model. EConpubbliCa Working Paper, 117, Milano: University of Milan. [9] Betti, G., Donatiello, G. & Verma, V. (2011). The Siena microsimulation model (SM2) for net-gross conversion of EU-silc income variables. International Journal of Microsimula^ion 4: 3 5-53. [10] ISER - Institute for Social and Economic Research, https://www.iser.essex.ac.uk/euromod (April 16th, 2012) [11] OECD - Organisation for Economic Co-operation and Development (2006). Reforming Personal Income Tax. Policy Brief March. Paris: OECD. [12] Zee, H. H. (2005). Personal income tax reform: Concepts, issues, and comparative country developments. IMF Working Paper 87. Washington: International Monetary Fund. [13] Sorenson, P. B. (2005). Dual income tax: Why and how? FinanzArchiv61: 559-586. [14] Ivanova, A., Keen, M. & Klemm, A. (2005). The Russian 'flat tax' reform. Economic Policy 20: 397444. [15] Moore, D. (2005). Slovakia's 2004 tax and welfare reforms. IMF Working Paper 133, Washington: International Monetary Fund. [16] OECD - Organisation for Economic Co-operation and Development (2013). Taxing Wage, 2011-20^12. Paris: OECD. The slWaC Corpus of the Slovene Web Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Jamova cesta 39, Ljubljana E-mail: tomaz.erjavec@ijs.si Nikola Ljubešic Faculty of Humanities and Social Sciences, University of Zagreb Ivana Lucica 3, Zagreb, Croatia E-mail: nikola.ljubesic@ffzg.hr Nataša Logar Faculty of Social Sciences, University of Ljubljana Kardeljeva plošcad 5, Ljubljana E-mail: natasa.logar@fdv.uni-lj.si Keywords: language technologies, corpus linguistics, World Wide Web, Slovene language Received: February 25, 2015 The availability of large collections of text (language corpora) is crucial for empirically supported linguistic investigations of various languages; however, such corpora are complicated and expensive to collect. In recent years corpora made from texts on the World Wide Web have become an attractive alternative to traditional corpora, as they can be made automatically, contain varied text types of contemporary language, and are quite large. The paper describes version 2 of slWaC, a Web corpus of Slovene containing 1.2 billion tokens. The corpus extends the first version of slWaC with new materials and updates the corpus compilation pipeline. The paper describes the process of corpus compilation with a focus on near-duplicate removal, presents the linguistic annotation, format and accessibility of the corpus via Web concordancers. It then investigates the content of the corpus using the method of frequency profiling, by comparing its lemma and part-of-speech annotations with three corpora: the first version of slWaC, with Gigafida, the one billion word reference corpus of Slovene, and KRES, the hundred million word reference balanced corpus of Slovene. Povzetek: Dostopnost velikih zbirk besedil (jezikovnih korpusov) je nujna za empirično podprte jezikoslovne raziskave posameznih jezikov, vendar pa je izdelava takih korpusov draga in zamudna. Korpusi besedil, zajetih s spleta, so v zadnjem času postali popularen vir jezikovnih vsebin, saj jih lahko zgradimo avtomatsko, vsebujejo pester nabor sodobnih besedilnih zvrsti in so zelo veliki. Prispevek predstavlja drugo različico korpusa slWaC, spletnega korpusa slovenščine, ki vsebuje 1,2 milijardi pojavnic. Korpus dopolnjuje prvo različico slWaC z novimi besedili, pridobljenimi z izboljšanimi orodji za zajem. V prispevku opišemo izdelavo korpusa s poudarkom na odstranjevanju podobnih vsebin ter jezikoslovno označevanje, format korpusa in njegovo dostopnost prek konkordančnika. Nato raziščemo vsebino korpusa z uporabo metode frekvenčnega profila, pri katerem leme in oblikoskladenjske oznake druge različice korpusa slWaCprimerjamo s tremi korpusi: s prvo različice korpusa slWaC, z referenčnim korpusom Gigafida, ki vsebuje milijardo besed, in s stomilijonskim referenčnim uravnoteženim korpusom KRES. 1 Introduction mats, was very expensive in terms of time and labour. With the advent of the Web, a vast new source of lin- Large collections of digitally stored and uniformly encoded guistic information has emerged. The exploitation of this texts - language corpora - have for a number of years been resource has especially gained momentum with the WaCky the basic data resources that linguists, including lexicog- initiative [1], which has popularised the concept of "Web raphers, have used for their investigations into language as Corpus". It has also made available tools for compiling and for making dictionaries. However, the traditional way such corpora and produced large WaC corpora for a num- of compiling corpora, which involved acquiring texts from ber of major European languages. Now such corpora are authors and publishers, which exists in many disparate for- also being built for the so called smaller languages, such as Norwegian [8], Czech [18] and Serbian [11], moving the concept of a "large corpus" for smaller languages up to the 1 billion token frontier. As Web corpus acquisition is much less controlled than that for traditional corpora, the necessity of analysing their content gains in significance. The linguistic quality of the content is mostly explored through word lists and collocates [1] while the content itself is explored using unsu-pervised methods, such as clustering and topic modelling [17]. For Slovene, a Web corpus has already been built [12]. However, the first version of slWaC (hereafter slWaCi) was rather small, as it contained only 380 million words. Furthermore, it contained domains from the Slovene top-level domain only, i.e. only URLs ending with ".si" were harvested. In the meantime, hrWaC, the Croatian Web corpus had already moved to version 2, touching the 2 billion token mark, and Web corpora for Serbian and Bosnian were built as well [11], all of them passing the size of slWaC1, making it high time to move forward also with slWaC. This paper presents version 2 of slWaC (hereafter slWaC2) which tries to overcome the limitations of slWaC1: it extends it with a new crawl, which also includes well known Slovene Web domains from other top-level domains, and introduces a new pipeline for corpus collection and cleaning, resulting in a corpus of 1.2 billion tokens with removed near-duplicate documents and flagged near-duplicate paragraphs. The rest of the paper is structured as follows: Section 2 presents the corpus construction pipeline, Section 3 introduces the linguistic annotation of the corpus, its format and its availability for on-line concordancing, Section 4 investigates the content of the corpus, by comparing it to slWaCi, to the balanced corpus of Slovene KRES, and the reference corpus of Slovene Gigafida, while Section 5 gives some conclusions and directions for future work. 2 Corpus construction 2.1 Crawling For performing the new crawl we used the SpiderLing crawler1 with its associated tools for guessing the character encoding of a Web page, its content extraction (boilerplate removal), language identification and near-duplicate removal [19]. The SpiderLing crawler uses the notion of yield rate to optimize the crawling process regarding the amount of unique textual material retrieved given the overall amount of data retrieved. Yield rate is calculated for each Web domain as the ratio of bytes of text contributed to the final corpus and the bytes retrieved from that domain. Web domains with a yield rate under a predefined threshold are discarded from further crawling, thereby focusing the remaining crawl on the domains where more unique textual mate- rial is to be found. SpiderLing has two predefined yield rates that control when a low-yield-rate Web domain is blacklisted; we used the lower one which is recommended for smaller languages. As seed URLs we used the home pages of Web domains obtained during the construction of slWaCi and additionally 30 well known Slovene Web domains, which are outside the .si top-level domain. The crawl was run for 21 days, with 8 cores used for document processing, which includes guessing the text encoding, text extraction, language identification and physical duplicate removal, i.e. removing copies of identical pages which appear under different URLs. After the first 14 days there was a significant decrease in computational load, showing that most of the domains had been already harvested and that the process of exhaustively collecting textual data from the extended Slovene top-level domain was almost finished. After completing the crawling process, which already included document preprocessing, we merged the new crawl with slWaCi . We added the old dataset to the end of the new one, thereby giving priority to new data in the following process of near-duplicate removal. It should be noted that the corpus can, in cases when the content has changed, contain two texts with the same URL but with different crawl dates. 2.2 Near-duplicate removal We performed near-duplicate identification both on the document and the paragraph level using the onion tool2 with its default settings, i.e. by calculating 5-gram overlap and using the 0.5 duplicate content threshold. We removed the document-level near-duplicates entirely from the corpus, while keeping paragraph-level near-duplicates, labelling them with a binary attribute on the
element. This means that the corpus still contains the (near)duplicate paragraphs, which is advantageous for showing contiguous text from Web pages, but if, say, language modelling for statistical machine translation were to be performed [10], near-duplicate paragraphs can easily be removed.
The resulting size of the corpus (in millions of tokens) after each of the three duplicate removal stages is given in Table 1. We compare those numbers to the ones obtained on the Croatian, Bosnian and Serbian domains [11], showing that the second versions of the corpora (hrWaC and slWaC), which merge two crawls obtained with different tools and were collected three years apart, show a smaller level of reduction (around 30%) at each step of near-duplicate removal, while the first versions of corpora (bsWaC and srWaC), obtained with SpiderLing only and in one crawl, suffer more data loss in this process (around 35-40%).
1http://nlp.fi.muni.cz/trac/spiderling
2https://code.google.com/p/onion/
PHY DND PND R1 R2
slWaC2 1,806 1,258 895 0.31 0.29
hrWaC2 2,686 1,910 1,340 0.29 0.30
bsWaCi 722 429 288 0.41 0.33
srWaCi 1,554 894 557 0.42 0.37
Table 1: Sizes of the Web corpora in millions of tokens after removing physical duplicates (PHY), document near-duplicates (DND) and paragraph near-duplicates (PND), with the reduction ratio (R1 and R2) after the DND and subsequent PND steps.
2.3 Linguistic annotation
slWaC2 was tagged and lemmatised with ToTaLe [4] trained on JOS corpus data [5]. However, it should be noted that ToTaLe had been slightly updated, so in particular the tokenisation of slWaCi and slWaC2 at times differs. The morphosyntactic descriptions (MSDs) that the words of the corpus are annotated with follow the JOS MSD specifications, however, these do not define a tag for punctuation. As practical experience has shown this to be a problem, we have introduced a punctuation category and MSD, named "Z" in English and "U" in Slovene.
3 Overview of the corpus 3.1 Size of the corpus
Table 2 gives the size of slWaC2, showing separately the amount of information from the 2011 crawl, from the 2014 crawl, and overall amount of information. For each of the counted elements we give the size of the corpus after removing document near-duplicates (DND from Table 1), and for the corpus which has also paragraph near-duplicates removed (PND).
Starting with the number of domains, it can be seen that the new crawl produced less domains than the first one, due to a large number (of the complete space of URLs) of static domains being removed in the physical deduplica-tion stage (PHY). Nevertheless, the complete corpus has, in comparison to slWaC1, about 12,000 new domains. Observing the URLs, we note that the new crawl gave somewhat less URLs than the old one, and that there is little overlap between the two, i.e. about 1%: 28,315 URLs are the same from both crawls, which means that their content has changed in the last three years (and are then in the corpus distinguished by having a different crawl date).
Regarding the number of paragraphs, we give both the numbers for DND and PND, with the reduction being very similar to the reduction on the token level already expressed in Table 1, i.e. 29%. For paragraphs, sentences, words and tokens, the complete corpus is simply the sum of the items for each of the two crawls. The most important numbers are the sizes of the complete corpus in tokens, i.e. 1.25 billion words for the DND and 900 million for PND,
which makes the corpus almost as large as Gigafida [13], the largest corpus of Slovene to date.
3.2 Corpus format
The annotated corpus is stored in the so called vertical format, used by many concordancing engines. This is an XML-like format in that it has opening and closing or empty (structural) XML tags, but the tokens themselves are written one per line, with the first (tab separated) column giving the token (word or punctuation) itself, the second (in our case) its lemma (or, for punctuation, again the token), the third its MSD in English and the fourth the MSD in Slovene, as illustrated by Figure 1.
is said to be the smallest frequent closed granule G^^^ if the intension I of G is the smallest frequent closed itemset. The setFG^^^ of the smallest frequent closed granules is defined as: FG^,„={G=
Izmed
vseh toFG^,,;
(13)
(14)
(15)
WritestoFC„
else
Write s to A^^,
(16) else
(17) Write s to
(18)End
(19)For(VG =< I,(p(I) >e FG^J do begin
(20) Write hil) = (0i(pil)) to FCI ;
(21)End
(22)Answer FCI;
These steps from (1) to (18) in the algorithm extract the smallest frequent closed granules set. And these steps from (19) to (21) generate all frequent closed itemsets.
4.3 Example and analysis
Here, we firstly provide an example for the algorithm, and then analyse the pmning strategies in the algorithm.
No. Operation
1 FG = {< {a},{l, 3,5} >, < {b},{2,3,4} >, <{c},{l,2,5}>,<{e},{3,4,5}>} (Pruning {d}hy property 3. land definition 3.3)
2 F = {{a},{b},{c},{e}}
J a={a}^S^={{a}}
4 a={b}^S^={{b},{ab}} FG^. = {< {a}, {1,3,5} >, < {6}, {2,3,4} >} FC^, ={{a},{b}} (Pruning{aZ?} by property 3.land definition 3.3)
5 a={c}^S^={{c},{ac},{bc}} = {< {a}, {1,3,5} >, < {b},{2,3,4} >} <{c},{l,2,5}>,<{ac},{l,5}>} FC^,^={{a}Ab}Ac}Aac}} (Prumng{Z?c} by property 3.land definition 3.3)
6 a ={e}^ ={{e},{ae},{te},{ce},{ace}} FG^,^ = {< {a}, {1,3,5} >, < {6}, {2,3,4} >} < {c},{1,2,5} >, < {ac},{1,5} >, < {e},{3,4,5} >, <{ae},{3,5}>,<{Z?e},{3,4} >} = { {a}, {6}, { c}, {ac}, {e}, {ae}, {te} } (Pruning{ce, ace} by property 3.land definition 3.3) Note: the search course is ended, discovering all the smallest frequent closed granules FC^.^
7 h{{a}) = {Ui n U3 n U5} = {a} h({b})={u,r^u,r^uj = {b} h({ac})={u,r^u,} = {ac} A({e}) = {U3 n n U5} = {e} h{{ae}) = {u^r^u^} = {ae} }i({be})={u,r.u,} = {be} Note: based on the smallest frequent closed granules set FC^.^, getting all frequent closed itemsets
8 Answer FCI = {{a},{b},{c},{ac},{e},{ae},{be}}
Table 1: Frequent closed itemsets mining for ininsupport = 40% .
For a formal context D = (U, A, R), where A = {a, b, c, d, e},U = {uj, u2, u3, u4, u5}, u^ = {acd}, u2 = {bc}, u3 = {abe}, u4 = {be}, u5 = {ace}; and minsupport = 40% . The course of discovering frequent closed itemsets is described as table 1.
For mining frequent closed itemsets, the algorithm adopts some pruning strategies as follows, property 3.1, definition 3.3 and 3.4, and theorem 3.3. They can help the algorithm efficiently reduce the search space for mining frequent closed itemsets.
5 Performance and scalability study
In this section, we design the following experiments on these different datasets:
Firstly, we report the performances of the algorithm EMFCI with A-Close and CLOSET on the six different datasets.
Secondly, we report the relationships between some parameters of the datasets and the performances of the algorithm EMFCI for mining frequent closed itemsets.
Finally, for the bottleneck of the algorithm EMFCI, we improve it to get the algorithm IEMFCI, and report its performances on the extended high dimension dataset to show the scalability of the algorithm EMFCI.
There are two original datasets as follows:
The first is the Food Mart 2000 retail dataset, which comes from SQL Server 2000. It contains 164558 records in 1998. By the same customer at the same time as a basket, we take items purchased from these records. Because the supports of the bottom items are small, we generalize the bottom items to the product department. Finally, we obtain 34015 transactions with time-stamps. It is a dataset with the Boolean attributes.
The second is from a Web log data, which is a real data that expresses some behaviour of students browsing, where the attributes set is made of login t^ime, durat^ion, network flow, IDtype, a^nd sex . The dataset with the discrete quantitative attributes has 296031 transactions.
Now, we generalize attributes, and replicate some attributes or transactions to create the following extended y datasets described as table 2, where each dataset can be s defined as a formal mining context D = (U, A, R).
All the experiments are performed on an Intel (R) Core (TM)2 Duo CPU (T6570 @) 2.10 GHz 1.19GHz) PC with 1.99 GB main memory, running on Microsoft Window XP Professional. All the programs are written in C# with Microsoft Visual Studio 2008. The algorithm A-close and CLOSET are implemented as described in [3] and [6].
Name Descriptions □ P( A) p U0
Dataset 1 The first original dataset 222 -1 ; 34015
Dataset 2 Replicating dataset 1 three attributes 225 -1 ; 34015
Dataset 3 Replicating dataset 1 four times 222 -1 ; 5*34015
Dataset 4 The second original dataset 5*4*4*14*3-1; 296031
Dataset 5 Replicating dataset 1 one attribute 5*4*4*14*3*5-1; 296031
Dataset 6 Replicating dataset 4 one time 5*4*4*14*3-12*296031 '
Dataset 7 For the Food Mart 2000, we regard the same customer at the same time as a basket and generalize the bottom items to the product subcategory 2J02 -1; 34015
Table 2: The datasets used in the experiments.
5.1
The experiments of performance
comparison
In this section, for discovering frequent closed itemsets on these different datasets, we compare the algorithm EMFCI with the algorithm A-close and CLOSET from the following two aspects, namely, one is comparing the performances among them as the minimal support is added; the other is comparing them as the number of frequent closed itemsets is added.
1. Testing on the original datasets
For the two original datasets, we firstly compare the algorithm EMFCI with the A-close and CLOSET based on the varying minimal support and the number of frequent closed itemsets. These experimental results are described as figure 1, 2, 3, and 4, respectively.
Figure 1: Performance comparison with the support on dataset 1.
Figure 2: Performance comparison with the number of frequent closed itemsets on dataset 1.
Figure 3: Performance comparison with the support on dataset 4.
Figure 5: Performance comparison with the support on dataset 2.
Figure 6: Performance comparison with the number of frequent closed itemsets on dataset 2.
Figure 7: Performance comparison with the support on dataset 3.
Figure 8: Performance comparison with the number of frequent closed itemsets on dataset 3.
Figure 4: Performance comparison with the number of frequent closed itemsets on dataset 4.
Based on the comparison results from figure 1, 2, 3, and 4, we know that the performances of the algorithm EMFCI are better than the A-close and CLOSET.
Obviously, the algorithm CLOSET is also superior to the A-close. Hence, we don't compare the EMFCI with the A-close in the following experiments.
2. Testing on the extended datasets
We further report the performances of the algorithm EMFCI on the extended datasets. Based on the different minimal support and the number of frequent closed itemsets, we compare the EMFCI with the CLOSET, the experimental results are described as figure 5 to 12.
Figure 9: Performance comparison with the support on dataset 5.
Figure 10: Performance comparison with the number of frequent closed itemsets on dataset 5.
Figure 11: Performance comparison with the support on dataset 6.
Figure 12: Performance comparison with the number of frequent closed itemsets on dataset 6.
Based on the comparison results from figure 5 to 12, we know that the performances of the algorithm EMFCI are also better than the CLOSET on the datasets with the Boolean or quantitative attributes.
5.2 The relationships between these parameters and performances
In this part, we mainly discuss the relationships between the performances and the following parameters:
UU [, is the number of objects in the formal mining context D = (U, A, R), in other word, it is the number of transactions in the mining database.
□ P(I) [, is the number of nonempty power sets for attribute values, called the search space of the algorithm, where I is the smallest frequent closed itemsets from the attribute set A , P(I) is defined as the power set of I. (Refer to section 4.1)
Here, the representation of the performances has two kinds of parameters as follows:
t(x): is the runtime of algorithm x, which is from input to output for mining frequent closed itemsets.
p, is defined as the improved ratio of the runtime between the algorithm EMFCI and CLOSET, which is denoted by the following equation: p = 1 - t(EMF^CI) / t(CLOSET). 1. The relationships between the performances and the search space
(1)Reporting the relationships on the extended dataset of the first original dataset
For the first original dataset, namely, dataset 1, we test the trend of the performances as the search space is increasing on dataset 2, which is the extended dataset with replicating three attributes of the first dataset. As the search space is varying, the trend of the runtime for the algorithm EMFCI is expressed as figure 13, the trend of the improved ratio between the algorithm EMFCI and CLOSET is expressed as figure 14.
Figure 13: The trend of the runtime on dataset 2.
Figure 14: The trend of the improved ratio on dataset 2.
Based on figure 13, we know that the runtime is added as the search space is increasing. Based on figure 14, we find that the improved ratio is reduced as the search space is increasing.
(2)Reporting the relationships on the extended dataset of the second original dataset
For the second original dataset, namely, dataset 4, we extend an attribute to get dataset 5, and test the trend of the performances on the dataset. The experimental results are expressed as figure 15 and 16, respectively.
Figure 16: The trend of the improved ratio on dataset 5.
According to figure 15 and 16, we get the similar comparisons results as above. Hence, we can draw the following conclusions:
The runtime of the algorithm EMFCI is added as the search space is increasing; on the contrary, the improved ratio is reduced. Namely, if the search space is increasing, the performances of the algorithm EMFCI will become worse and worse. In other word, the algorithm is not suitable for mining the dataset with too many smallest frequent closed itemsets.
2. The relationships among the performances, the search space and the number of objects
(1)Reporting the relationships on the first original dataset and its extended dataset
For the first original dataset (dataset 1), and its extended dataset, dataset 3 with replicating its objects four times, we test the trend of the performances as the search space is increasing on the two datasets. As the search space is varying, the trend of the runtime for the algorithm EMFCI is expressed as figure 17, the trend of the improved ratio between the algorithm EMFCI and CLOSET is expressed as figure 18.
Figure 17: The trend of the runtime on dataset 1 and 3.
Figure 15: The trend of the runtime on dataset 5.
Figure 18: The trend of the improved ratio on dataset 1 and 3.
Based on figure17, we know that the runtime of the algorithm is added as the search space or the number of objects is increasing.
Based on figure18, we find that the improved ratio of the algorithm is reduced as the search space is increasing, but it become relatively stable as the number of objects is increasing.
(2)Reporting the relationships on the second original dataset and its extended dataset
For the second original dataset, namely, dataset 4, we replicate its objects one time to get dataset 6, and test the trend of the performances on the dataset 4 and 6. The
experimental results are expressed as figure 19 and 20, respectively.
Figure 19: The trend of the runtime on dataset 4 and 6
In this section, we adopt a partitioning method to avoid the bottleneck. In other word, the overlarge search space is divided into some smaller search spaces. The theoretical basis can be described as follows:
Let/ = {a'' A), and then we have the
followingll Pil) 11= n"i(|| y, II +1)-1, namely,
\\P(i)\\+i = nz,(\\v^, II+1) =
(IIK, II+1)-(IIK. Il+I)-
l+l)-...-
V,
1+1);
Figure 20: The trend of the improved ratio on dataset 4 and 6
According to figure 19 and 20, we draw the same conclusions as follows:
The runtime of the algorithm EMFCl is added as the search space or the number of objects is increasing, the improved ratio of the algorithm is reduced as the search space is increasing, but it become relatively stable as the number of objects is adding. Namely, the performances of the algorithm EMFCl will become relatively stable as the number of objects is increasing. Hence, it is suitable for mining dynamic transactions datasets.
According to all these experimental results, we can draw the following conclusions:
(1) The performances of the algorithm EMFCl are better than the traditional typical algorithms for mining frequent closed itemsets on the datasets with the Boolean attributes or the luantitative attributes.
(2) The runtime of the algorithm EMFCl is added as the search space. If the search space is too large, its performances will become worse and worse. This is the bottleneck of the algorithm.
(3) The runtime of the EMFCl is also added as the number of objects is increasing.
(4) For the algorithm CLOSET, the improved ratio of the algorithm is reduced as the search space is adding, but it become relatively stable as the number of objects is increasing. Namely, the performances of the EMFCl will become relatively stable as the number of objects is increasing. It is suitable for mining dynamic transactions datasets.
5.3 A further discussion for solving the bottleneck of the algorithm
Based on these conclusions in section 5.2, for the formal mining context D = (U, A, R), if ihs search spaceOP(I) I is overlarge, where 1(1 c A) is the smallest frequent closed itemsets, P(I) is defined as the power set of I, the performance of EMFCl will become worse and worse.
(fflj + m^ = in).
Obviously, we also have || P(I) || +1 =
(II ) II +1) • (II PiL^) II +1) •••• • (II P(L,) II +1);
Where={a'\a'\...,a"'},
T — f a'ai+l a'ai+2 a'ffli+fflz 1 '"k
In this paper, we let || ) ||< 2 = 2". If 2 is too big,
the method also has the same bottleneck; if X is too small, the cost of partitioning search space is expensive. For these two cases, their performances are expressed as figure 23.
The partitioning method is used in the algorithm EMFCl, which is called improved EMFCl, i.e. lEMFCl.
5.3.1 Example
For the example in section 4.3, we use the algorithm lEMFCl to discover frequent closed itemsets, the course of which is described as follows, where 2 = 4.
(Note: 2 = 4 used in the example, X = 2" used in the following experiments)
Stepl. FG = {< {a},{l, 3,5} >, < {/?},{2,3,4} >,
<{c},{l,2,5}>,<{e},{3,4,5}>}. Step2. F = {{a},{6},{c},{e}},|| P(F) ||= 15 > 2 = 4 . Step3. Partitioning the search space, get two search spaces/^ = {{a},{/?}},={{c},{e}}, where|| P(f;)||<4 . Step4. For the first search space F^ = {{a},{b}} , have ©a={a}^5,={{a}} FGi, = {< {a},{l,3,5} >}, FCi, = {{a}} ;
®a={b}^S^={{b},{ab}}
FG\„ = {< {a},{l, 3,5} >, < {b},{2,3,4} >} ,
FC\„={{a},{b}}.
For the second search space ={{c},{e}}, have
©a={c}^5,={{c}}
FGl„ ={<{c},{l,2,5}>}, FC\„ ={{c}};
(Da={e}^5^={{e},{ce}}
FGln = {< {C},{1,2,5} >, < {e},{3,4,5} >} ,
FCi,={{c},{e}}.
Step5. F = {FCl.^,FClJ , repeating the step2, where || P(F) ||= 15 > 4 , but || F\\=2 , the partitioning operation must be ended; otherwise, the algorithm need to continue to partition the search space.
= {< {a},{1,3,5} >, < {bU2,3,4} >} ,
{e}
(0 {a} {Ä}) =
Items Time complexity Space complexity
A-close 0(11 0(||C||/||4||)
CLOSET 0(11 0(11 C||)
lEMFCI 0{{L/k+iy\\c\\) 0(\\C\\/k-\\A\\)
Figure 21: Performance comparison with the lower support on dataset 7.
{{c},{ac},{bc},{e},{ae},{be}}; = {< {a}, {1,3,5} >, < {b},{2,3,4} >} <{c},{l,2,5}>,<{ac},{l,5}>,<{e},{3,4,5}>, <{ae},{3,5} >,<{Z?e},{3,4}>} FC^,„={{aUbUcUac},{eUaeUbe}} The rest of steps are the same as the example in section 4.3. The algorithm lEMFCl reduces the checking of itemset{ ace}, but adds the task of partitioning. As the number of tiansactions is lesser, the example does not show its advantage, please see the experiments in section 5.3.3. Here, the example only describes the execution course of lEMFCl.
5.3.2 Comparisons of the time and space complexity
For D=(U,A,R) , let C be a set of frequent closed itemsets, and let L be the average length of frequent closed itemsets, ^ > 2 is a parameter with partitioning the search space. The comparisons are expressed as table 3.
Figure 22: Performance comparison with the higher support on dataset 7.
Then, for the improved algorithm lEMFCl, we adopt different parameters X to test its trend of performance, where X = l\ /1 = 2" and /1 = 2^^. The comparison result is expressed as figure 23, where lEMFCl (Ä= p(2,n)) is the improved algorithm lEMFCl when the parameter of partitioning the search space is A = p(2, n) = 2".
Table 3: Comparisons of the time and space complexity.
5.3.3 Test on the high dimension datasets
In this section, to show the scalability of the algorithm EMFCl, firstiy, we compare the improved algorithm lEMFCl with EMFCl, A-close and CLOSET on the high dimension dataset (dataset 7 as table 1), which is an extended dataset based on the first original dataset. The comparison results are expressed as figure 21 and 22, where the parameter p{2, in) = 2" on the abscissa shows the search space □ P(I) tof the given support.
Figure 23: The trend of performance with the different parameter on dataset 7.
Based on these comparisons, we draw the following conclusions:
Firstly, the improved algorithm lEMFCl is better tiian tiie algorithms EMFCl, A-close and CLOSET.
Secondly, the improved algorithm lEMFCl gets rid of the bottieneck in the algorithms EMFCl, especially, when the search space □ P(I) [is overlarge, the advantage of lEMFCl is very distinct.
Finally, for the improved algorithm lEMFCl, the parameter of partitioning the search space is not too big, but it is not too small.
6 Conclusion
In this paper, for the shortcomings of typical algorithms for mining frequent closed itemsets, we propose an efficient algorithm for mining frequent closed itemsets, which is based on Galois connection and granular computing. We present the notion of smallest frequent closed granule to reduce the costed 1/0 for discovering frequent closed itemsets. And we propose a connection function for generating the smallest frequent closed itemsets in the enlarged frequent 1-item manner to
reduce the costed CPU and the occupied main memory. But the number of the smallest frequent closed itemsets is too many, the performances of the algorithm become worse and worse, so we further discuss how to solve the bottleneck, namely, propose its improved algorithm on high dimension dataset. The algorithm is also suitable for mining dynamic transaction datasets.
Acknowledgement
The authors would like to thank the anonymous reviewers for the constructive comment. This work was a project supported by Chongqing Cutting-edge and Applied Foundation Research Program (Grant No. cstc2014jcyjA40035). And it was also supported by Scientific and Technological Research Program of Chongqing Three Gorges University (Grant No.13ZD20).
References
[1] R. Agrawal, T. Imielinski, and A. Swami (1993). Mining association rules between sets of items in large databases. ^n Proceedings of the 1993 ACM SIGMOD Int'l Conference on Management of Dat^a, Washington DC, USA, pp. 207-216.
[2] R. Agrawal and R. Srikant (1994). Fast algorithms for mining association rules. In Proceedings of the 20^ Int'l Conference on Very large Data Bases, Santiago, Chile, pp. 487-499.
[3] N. Pasquier, Y. Bastide and R. Taouil et al. (1999). Discovering frequent closed itemsets for association rules. In Proceedings of the 7t^h Int'l Conference on Database Theory, Jerusalem, Israel, January, pp. 398-416.
[4] Mohammed J. Zaki, Ching-Jui Hsiao (1999). Charm: An efficient algorithm for closed association rule mining. Technical Report 99-10, Computer Science, Rensselaer Polytechnic Institute.
[5] J. Han, J. Pei, and Y. Yin (2000). Mining frequent patterns without candidate generation. In Proceedings of the 2000 ACM SIGMOD Int'l Conference on Management of Data, New York, USA, pp. 1-12.
[6] J. Pei, J. Han, and R. Mao (2000). CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets. In Proceedings of the 2000 ACM SIGMOD Workshop on Research Issues in Dat^a Mining and Knowledge Discovery. Dallas, Texas, USA, pp. 2130.
[7] D. Burdick, M. Calimlim, and J. Gehrke (2001). MAFIA: A maximal frequent item set algorithm for transactional databases. In Proceedings of the 17th Int'l Conference on Dat^a Engineering. Heidelberg, pp. 443-452.
[8] J. Y. Wang, J. Han, and J. Pei (2003). CLOSET+: Searching for the best strategies for mining frequent closed itemsets. In Proceedings of the 9th ACM SIGKDD Int'l Conference on Knowledge Discovery cin^d Dat^a Mining, Washington, DC, pp. 236 - 245.
[9] C. Lucchese, S. Orlando, and R. Perego (2006). Fast and memory efficient mining of frequent closed itemsets. IEEE Trans on Knowledge and Dada Engineering, vol. 18, no. 1, pp. 21- 36.
[10] R. Singh, T. Johnsten, and V. Raghavan et al (2010). Efficient Algorithm for Discovering Potential Interesting Patterns with Closed Itemsets.
Proceedings of the 2010 IEEE Int'l Conference on Granular Computing, pp. 414 - 419.
[11] F. Nori, M. Deypir, and M. Hadi et al. (2011). A new sliding window based algorithm for frequent closed itemset mining over data streams. In Proceedings of the 1st Int^'l Conference on Computer and Knowledge Engineering, IEEE Press, pp. 249253.
[12] Guang-Peng Chen, Yu-Bin Yang, and Yao Zhang (2012). MapReduce-Based Balanced Mining for Closed Frequent Itemset. ^n Proceedings of the 2012 IEEE 19th Int'l Conference on Web Services, IEEE Press, pp. 652 - 653.
[13] M. Sreedevi, Reddy L.S.S. (2013). Mining regular closed patterns in transactional databases. In Proceedings of the 2013 7th Int^'l Conference on Intelligent Systems and Control, IEEE Press, pp. 380 - 383.
[14] Yu-quan Z. and Yu-qing S. (2007). Research on an Algorithm for Mining Frequent Closed Itemsets. Journal of Computer Research and Development, vol. 44, no. 7, pp. 1177-1183.
[15] Shengwei L., Lingsheng L., and Chong H. (2009). Mining closed frequent itemset based on FP-Tree. In Proceedings of the IEEE Int'l Conference on Granular Computing, IEEE Press, pp. 354 - 357.
[16] Wachiramethin J., Werapun J. (2009). BPA: A Bitmap-Prefix-tree Array data structure for frequent closed pattern mining. ^n Proceedings of the 2009 Int'l Conference on Machine Learning and Cybernetics, IEEE Press, vol .1, pp. 154 - 160.
[17] Z. Pawlak (1982). Rough sets. Journal of computing and information science in Engineering, no.11, pp. 341-356.
[18] R. Wille (1982). Restructuring lattice theory: an approach based on hierarchies of concepts. In: I. Rival (Ed.), Ordered Sets, Reidel, Dordrecht-Boston, pp. 445-470.
[19] B. Ganter, R. Wille (1999). Formal Concept Analysis, Mathematic Foundations. Springer, Berlin.
[20] J. Poelmans, D. I. Ignatov, and S. O. Kuznetsov et al (2013). Formal concept analysis in knowledge processing: A survey on applications. Expert Systems with Applications, vol.40, no. 16, pp. 6538-6560.
[21] M. W. Shao, Y. Leung (2014). Relations between granular reduct and dominance reduct in formal contexts. Knowledge-Based Systems, vol.65, pp. 111.
[22] Wei-Zhi W., Yee Leung, Ju-Sheng M. (2009). Granular Computing and Knowledge Reduction in Formal Contexts. IEEE Transactions on Knowledge and Data Engineering, vol.21, no.10, pp. 1461-1474.
[23] R. Belohlavek, B. D. Baets, J. Konecny (2014). Granularity of attributes in formal concept analysis.
Information Sciences, vol.260, pp.149-170.
[24] Hobbs J. R. (1985). Granularity. In Proceedings of the 9th International Joint Conference on Artificial Intelligence, San Francisco, USA, pp. 432-435.
[25] Giunchglia F., Walsh T. (1992). A theory of abstraction. Artificial Intelligence, vol. 57, no. 2-3. pp. 323-389.
[26] Yao Y.Y. (2004). A partition model of granular computing. Lecture Notes in Computer Science Transactions on Rough Sets, vol. 3100, pp.232-253.
[27] T. R. Qiu, X. Q. Chen, and Q. Liu et al. (2010). Granular Computing Approach to Finding Association Rules in Relational Database. International Journal of intelligent sy^stems, no. 25, pp. 165-179.
[28] G. Fang, Y. Wu (2013). Frequent Spatiotemporal Association Patterns Mining Based on Granular Computing. Informatica, vol.37, no.4, pp.443-453.
[29] Pawlak Z. (1998). Granularity of knowledge, indiscernibility and rough sets. In Proceedings of IEEE Int Conf on Fuzzy Systems, IEEE Press, Anchorage, AK, pp.106-110.
[30] Zhang L., Zhang B. (2003). The quotient space theory of problem solving. Lecture Notes in Computer Science, vol. 2639, pp. 11-15.
An Approach for Context-based Reasoning in Ambient Intelligence
Kristijan Gjoreski
Department of Intelligent Systems, Jožef Stefan Institute Jamova 39, Ljubljana, Slovenia
Thesis Summary
Keywords: machine learning, context, reasoning, artificial intelligence, ambient intelligence, sensors Received: February 2, 2015
This paper presents a summary of the doctoral dissertation of the author, which addresses the task of context-based reasoning in ambient intelligence.
Povzetek: Prispev^ek predstavlja povzetek doktorske disertacije av^torja, ki obravnavna kontekstno sklepanje v ambientalni inteligenci.
1 Introduction
Ambient intelligence (AmI) is a scientific field that refers to environments consisting of smart devices (sensors and actuators) that can sense and respond to the presence of people [1] . The availability of small, wearable, low-cost, power-efficient sensors, combined with advanced signal processing and information extraction, is driving the revolution in AmI domain. This revolution has enabled novel approaches and technologies for accurate measurements in the area of healthcare, enhanced sports and fitness training, and life-style monitoring.
Early AmI systems included a single type of sensors that has made it possible to develop the first proof-of-concept applications. As the field has matured, these systems have gained additional sensors, resulting in the development of advanced and more accurate multisensor techniques and applications. Kowever, combining multiple sources of information from multiple sensors is a challenging task. The first issue is that each sensor has its own technical configuration (for example, the data sampling rate) and requires different data-processing techniques in order to first align the different sensor data, and later to extract useful information. The second issue is that even if the multi-source data is aligned, it can be challenging to find an intelligent way to combine this multi-source information in order to reason about the user or the environment. While several approaches for combining multiple sources of information and knowledge have been developed (such as Kalman filters, ensemble learning, and co-training), these approaches have not been specialized for AmI tasks.
The doctoral dissertation [2] addresses the problem of combining multiple sources of information extracted from sensor data by proposing a novel context-based approach called CoReAmI (Context-based Reasoning in Ambient Intelligence). In particular, CoReAmI creates a multi-view perspective, in which each source of information is used as a context separately.
2 The CoReAmI approach
The CoReAmI approach is shown in Figure 1. At the top are the sensors {s1, ... , sm}, which provide the raw data. The multiple sensors data is usually represented by multivariate time-series with mixed sampling rates, which are input to CoReAmI. The CoReAmI consists of three phases: (A) context extraction, (B) context modeling and (C) context aggregation.
Figure 1. The CoReAmI Approach.
In the first phase (A) the raw sensor data are acquired and the multiple contexts are extracted {c1, ... , cn} using different types of techniques: data-preprocessing techniques, data synchronization, data segmentation, etc. In CoReAmI context represents information about the user which is extracted from the sensor data, e.g., user's activity extracted from wearable accelerometer data. This phase is similar to the feature extraction phase in machine learning (ML). Moreover, the contexts in CoReAmI are features that represent context information. Therefore, each context has values (vc), which can be
numerical or categorical (e.g., "sitting" for the "activity" context).
The second phase B, contains the main logic of the CoReAmI approach. In this phase the context modeling about the problem (activity, fall, energy expenditure, etc.) is performed using the contexts defined in the previous phase. First the context-based partitioning of the dataset is performed, i.e., the dataset is partitioned according to each context and its values. Therefore, for each context value a reasoning model (nf) is constructed using its reasoning data - the reasoning data is a subset of the whole dataset that has that particular context value (R^V). For example, the reasoning data for the "sitting" model will be constructed using the data instances that contain the value "sitting" for the activity context. This way, the approach considers multiple views on the data using each of the features as a context.
In the final phase C, for a given testing data instance, the decisions from each context individually are aggregated and the final decision is provided. In this phase different aggregation techniques can be used, e.g., majority voting, plurality voting, averaging, choosing the median and similar.
The CoReAmI is a general approach for context-based reasoning and can be adapted to a range of tasks in AmI by adapting each of the phases to the particular task.
3 Case studies
The feasibility of the CoReAmI approach was shown in three AmI domains that have emerged as essential building blocks in AmI: activity recognition, energy-expenditure estimation, and fall detection.
The first problem domain, Activity Recognition (AR), can generally be defined as a process of recognizing activities through the analysis of sensor data. In recent years AR gained a lot of research attention, because it provides one of the basic information about a person that is monitored by an AmI system. In our study, we studied the state-of-the-art approaches in AR and observed that it is almost impossible to distinguish standing from sitting activity using a single accelerometer placed on the torso. However, by adapting and applying the CoReAmI approach, we have managed to significantly improve the recognition of these two activities, achieving 86% accuracy - which is for 24 percentage points better than conventional ML approach [4] .
Human Energy-Expenditure (EE) estimation is the process of calculating the amount of expended energy while performing everyday activities. It directly reflects the level of physical activity which makes it important for sports training, weight control, management of metabolic disorders (e.g., diabetes), and other health goals. We adapted and applied the CoReAmI approach to estimate the human energy expenditure using multiple sensor data (accelerometer, heart rate, breath rate, etc.). The CoReAmI significantly improved the estimation performance compared to conventional ML approaches and approaches that are based on single context (such as the activity of the user). Additionally, the CoReAmI provided better energy expenditure estimations than the
BodyMedia device, which is a state-of-the-art commercial device for energy expenditure estimation [5]
The third problem domain on which we applied the CoReAmI approach is the fall detection (FD). FD is a really important application in AmI because falls are among the most critical health problems for the elderly. We adapted and applied the CoReAmI approach to detect human falls. CoReAmI significantly improved the detection performance compared to conventional approaches, such as: threshold-based approaches and approaches based only on ML [6] .
4 Conclusion
This paper summarized the dissertation [2] and presented the main idea and findings of the same. The proposed CoReAmI approach was adapted and tested on three problem domains. The results show that CoReAmI significantly outperforms the competing approaches in each of the domains. This is mainly due to the fact that, by extracting multiple sources of information and combining them by using each source of information as a context, a multi-view perspective is created, which leads to better performance than with conventional approaches.
References
[1] Ducatel K, Bogdanowicz M, Scapolo F, Leijten J, Burgelman J. Scenarios for ambient intelligence in 2010. Technical report. Retrieved September, 2014 from: http://cordis.europa.eu/ist/istagreports.htm
[2] Gjoreski H. Context-based Reasoning in Ambient Intelligence, PhD Thesis, IPS Jožef Stefan, Ljubljana, Slovenia, January, 2015.
[3] Friedman SM, Munoz B, West SK, Rubin GS, Fried LP. Falls and Fear of Falling: Which Comes First? A Longitudinal Prediction Model Suggests Strategies for Primary and Secondary Prevention. Journal of the American Geriatrics Society, 2002; 1329-1335.
[4] Gjoreski H, Kozina S, Luštrek M, Gams M. Using multiple contexts to distinguish standing from sitting with a single accelerometer. European Conference on Artificial Intelligence (ECAI), 2014.
[5] Gjoreski H, Kaluža B, Gams M, Milic R, Luštrek M. Ensembles of multiple sensors for human energy expenditure estimation. In Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing (UbiComp), 2013; 359-362.
[6] Gjoreski H, Gams M, Luštrek M. Context-based fall detection and activity recognition using inertial and location sensors. Journal of Ambient Intelligence and Smart Environments (JAISE), 2014, 6(4); 419-433.
JOŽEF STEFAN INSTITUTE
Jožef Stefan (1835-1893) was one of the most prominent physicists of the 19th century. Born to Slovene parents, he obtained his Ph.D. at Vienna University, where he was later Director of the Physics Institute, Vice-President of the Vienna Academy of Sciences and a member of several scientific institutions in Europe. Stefan explored many areas in hydrodynamics, optics, acoustics, electricity, magnetism and the kinetic theory of gases. Among other things, he originated the law that the total radiation from a black body is proportional to the 4th power of its absolute temperature, known as the Stefan-Boltzmann law.
The Jožef Stefan Institute (JSI) is the leading independent scientific research institution in Slovenia, covering a broad spectrum of fundamental and applied research in the fields of physics, chemistry and biochemistry, electronics and information science, nuclear science technology, energy research and environmental science.
The Jožef Stefan Institute (JSI) is a research organisation for pure and applied research in the natural sciences and technology. Both are closely interconnected in research departments composed of different task teams. Emphasis in basic research is given to the development and education of young scientists, while applied research and development serve for the transfer of advanced knowledge, contributing to the development of the national economy and society in general.
At present the Institute, with a total of about 900 staff, has 700 researchers, about 250 of whom are postgraduates, around 500 of whom have doctorates (Ph.D.), and around 200 of whom have permanent professorships or temporary teaching assignments at the Universities.
In view of its activities and status, the JSI plays the role of a national institute, complementing the role of the universities and bridging the gap between basic science and applications.
Research at the JSI includes the following major fields: physics; chemistry; electronics, informatics and computer sciences; biochemistry; ecology; reactor technology; applied mathematics. Most of the activities are more or less closely connected to information sciences, in particular computer sciences, artificial intelligence, language and speech technologies, computer-aided design, computer architectures, biocybernetics and robotics, computer automation and control, professional electronics, digital communications and networks, and applied mathematics.
ranean Europe, offering excellent productive capabilities and solid business opportunities, with strong international connections. Ljubljana is connected to important centers such as Prague, Budapest, Vienna, Zagreb, Milan, Rome, Monaco, Nice, Bern and Munich, all within a radius of 600 km.
From the Jožef Stefan Institute, the Technology park "Ljubljana" has been proposed as part of the national strategy for technological development to foster synergies between research and industry, to promote joint ventures between university bodies, research institutes and innovative industry, to act as an incubator for high-tech initiatives and to accelerate the development cycle of innovative products.
Part of the Institute was reorganized into several hightech units supported by and connected within the Technology park at the Jožef Stefan Institute, established as the beginning of a regional Technology park "Ljubljana". The project was developed at a particularly historical moment, characterized by the process of state reorganisation, privatisation and private initiative. The national Technology Park is a shareholding company hosting an independent venture-capital institution.
The promoters and operational entities of the project are the Republic of Slovenia, Ministry of Higher Education, Science and Technology and the Jožef Stefan Institute. The framework of the operation also includes the University of Ljubljana, the National Institute of Chemistry, the Institute for Electronics and Vacuum Technology and the Institute for Materials and Construction Research among others. In addition, the project is supported by the Ministry of the Economy, the National Chamber of Economy and the City of Ljubljana.
Jožef Stefan Institute
Jamova 39, 1000 Ljubljana, Slovenia
Tel.:+386 1 4773 900, Fax.:+386 1 251 93 85
WWW: http://www.ijs.si
E-mail: matjaz.gams@ijs.si
Public relations: Polona Strnad
The Institute is located in Ljubljana, the capital of the independent state of Slovenia (or S9nia). The capital today is considered a crossroad between East, West and Mediter-
INFORMATICA
AN INTERNATIONAL JOURNAL OF COMPUTING AND INFORMATICS
INVITATION, COOPERATION
Submissions and Refereeing
Please submit a manuscript to: http://www.informatica.si/Editors/ PaperUpload.asp. At least two referees outside the author's country will examine it, and they are invited to make as many remarks as possible from typing errors to global philosophical disagreements. The chosen editor will send the author the obtained reviews. If the paper is accepted, the editor will also send an email to the managing editor. The executive board will inform the author that the paper has been accepted, and the author will send the paper to the managing editor. The paper will be published within one year of receipt of email with the text in Informat-ica MS Word format or Informatica I^TeX format and figures in .eps format. Style and examples of papers can be obtained from http://www.informatica.si. Opinions, news, calls for conferences, calls for papers, etc. should be sent directly to the managing editor.
QUESTIONNAIRE
Send Informatica free of charge
Yes, we subscribe
Please, complete the order form and send it to Dr. Drago Torkar, Informatica, Institut Jožef Stefan, Jamova 39, 1000 Ljubljana, Slovenia. E-mail: drago.torkar@ijs.si
Since 1977, Informatica has been a major Slovenian scientific journal of computing and informatics, including telecommunications, automation and other related areas. In its 16th year (more than twentyone years ago) it became truly international, although it still remains connected to Central Europe. The basic aim of In-formatica is to impose intellectual values (science, engineering) in a distributed organisation.
Informatica is a journal primarily covering intelligent systems in the European computer science, informatics and cognitive community; scientific and educational as well as technical, commercial and industrial. Its basic aim is to enhance communications between different European structures on the basis of equal rights and international refereeing. It publishes scientific papers accepted by at least two referees outside the author's country. In addition, it contains information about conferences, opinions, critical examinations of existing publications and news. Finally, major practical achievements and innovations in the computer and information industry are presented through commercial publications as well as through independent evaluations.
Editing and refereeing are distributed. Each editor can conduct the refereeing process by appointing two new referees or referees from the Board of Referees or Editorial Board. Referees should not be from the author's country. If new referees are appointed, their names will appear in the Refereeing Board.
Informatica is free of charge for major scientific, educational and governmental institutions. Others should subscribe (see the last page of Informatica).
ORDER FORM - INFORMATICA
Name: ...............................
Title and Profession (optional): .........
Home Address and Telephone (optional):
Office Address and Telephone (optional):
E-mail Address (optional): .............
Signature and Date: ...................
Informatica WWW: http://www.informatica.si/
Referees from 2008 on:
A. Abraham, S. Abraham, R. Accornero, A. Adhikari, R. Ahmad, G. Alvarez, N. Anciaux, R. Arora, I. Awan, J. Azimi, C. Badica, Z. Balogh, S. Banerjee, G. Barbier, A. Baruzzo, B. Batagelj, T. Beaubouef, N. Beaulieu, M. ter Beek, P. Bellavista, K. Bilal, S. Bishop, J. Bodlaj, M. Bohanec, D. Bolme, Z. Bonikowski, B. Boškovic, M. Botta, P. Brazdil, J. Brest, J. Brichau, A. Brodnik, D. Brown, I. Bruha, M. Bruynooghe, W. Buntine, D.D. Burdescu, J. Buys, X. Cai, Y. Cai, J.C. Cano, T. Cao, J.-V. Capella-Hernändez, N. Carver, M. Cavazza, R. Ceylan, A. Chebotko, I. Chekalov, J. Chen, L.-M. Cheng, G. Chiola, Y.-C. Chiou, I. Chorbev, S.R. Choudhary, S.S.M. Chow, K.R. Chowdhury, V. Christlein, W. Chu, L. Chung, M. Ciglaric, J.-N. Colin, V. Cortellessa, J. Cui, P. Cui, Z. Cui, D. Cutting, A. Cuzzocrea, V. Cvjetkovic, J. Cypryjanski, L. CCehovin, D. CCerepnalkoski, I. CCosic, G. Daniele, G. Danoy, M. Dash, S. Datt, A. Datta, M.-Y. Day, F. Debili, C.J. Debono, J. Dedic, P. Degano, A. Dekdouk, H. Demirel, B. Demoen, S. Dendamrongvit, T. Deng, A. Derezinska, J. Dezert, G. Dias, I. Dimitrovski, S. Dobrišek, Q. Dou, J. Doumen, E. Dovgan, B. Dragovich, D. Drajic, O. Drbohlav, M. Drole, J. Dujmovic, O. Ebers, J. Eder, S. Elaluf-Calderwood, E. Engström, U. riza Erturk, A. Farago, C. Fei, L. Feng, Y.X. Feng, B. Filipic, I. Fister, I. Fister Jr., D. Fišer, A. Flores, V.A. Fomichov, S. Forli, A. Freitas, J. Fridrich, S. Friedman, C. Fu, X. Fu, T. Fujimoto, G. Fung, S. Gabrielli, D. Galindo, A. Gambarara, M. Gams, M. Ganzha, J. Garbajosa, R. Gennari, G. Georgeson, N. Gligoric, S. Goel, G.H. Gonnet, D.S. Goodsell, S. Gordillo, J. Gore, M. Grcar, M. Grgurovic, D. Grosse, Z.-H. Guan, D. Gubiani, M. Guid, C. Guo, B. Gupta, M. Gusev, M. Hahsler, Z. Haiping, A. Hameed, C. Hamzagebi, Q.-L. Han, H. Hanping, T. Härder, J.N. Hatzopoulos, S. Hazelhurst, K. Hempstalk, J.M.G. Hidalgo, J. Hodgson, M. Holbl, M.P. Hong, G. Howells, M. Hu, J. Hyvärinen, D. Ienco, B. Ionescu, R. Irfan, N. Jaisankar, D. Jakobovic, K. Jassem, I. Jawhar, Y. Jia, T. Jin, I. Jureta, D. Juricic, S. K, S. Kalajdziski, Y. Kalantidis, B. Kaluža, D. Kanellopoulos, R. Kapoor, D. Karapetyan, A. Kassler, D.S. Katz, A. Kaveh, S.U. Khan, M. Khattak, V. Khomenko, E.S. Khorasani, I. Kitanovski, D. Kocev, J. Kocijan, J. Kollär, A. Kontostathis, P. Korošec, A. Koschmider, D. Košir, J. Kovac, A. Krajnc, M. Krevs, J. Krogstie, P. Krsek, M. Kubat, M. Kukar, A. Kulis, A.P.S. Kumar, H. Kwasnicka, W.K. Lai, C.-S. Laih, K.-Y. Lam, N. Landwehr, J. Lanir, A. Lavrov, M. Layouni, G. Leban, A. Lee, Y.-C. Lee, U. Legat, A. Leonardis, G. Li, G.-Z. Li, J. Li, X. Li, X. Li, Y. Li, Y. Li, S. Lian, L. Liao, C. Lim, J.-C. Lin, H. Liu, J. Liu, P. Liu, X. Liu, X. Liu, F. Logist, S. Loskovska, H. Lu, Z. Lu, X. Luo, M. Luštrek, I.V. Lyustig, S.A. Madani, M. Mahoney, S.U.R. Malik, Y. Marinakis, D. Marincic, J. Marques-Silva, A. Martin, D. Marwede, M. Matijaševic, T. Matsui, L. McMillan, A. McPherson, A. McPherson, Z. Meng, M.C. Mihaescu, V. Milea, N. Min-Allah, E. Minisci, V. Mišic, A.-H. Mogos, P. Mohapatra, D.D. Monica, A. Montanari, A. Moroni, J. Mosegaard, M. Moškon, L. de M. Mourelle, H. Moustafa, M. Možina, M. Mrak, Y. Mu, J. Mula, D. Nagamalai, M. Di Natale, A. Navarra, P. Navrat, N. Nedjah, R. Nejabati, W. Ng, Z. Ni, E.S. Nielsen, O. Nouali, F. Novak, B. Novikov, P. Nurmi, D. Obrul, B. Oliboni, X. Pan, M. Pancur, W. Pang, G. Papa, M. Paprzycki, M. Paralic, B.-K. Park, P. Patel, T.B. Pedersen, Z. Peng, R.G. Pensa, J. Perš, D. Petcu, B. Petelin, M. Petkovšek, D. Pevec, M. Piculin, R. Piltaver, E. Pirogova, V. Podpecan, M. Polo, V. Pomponiu, E. Popescu, D. Poshyvanyk, B. Potocnik, R.J. Povinelli, S.R.M. Prasanna, K. Pripužic, G. Puppis, H. Qian, Y. Qian, L. Qiao, C. Qin, J. Que, J.-J. Quisquater, C. Rafe, S. Rahimi, V. Rajkovic, D. Rakovic, J. Ramaekers, J. Ramon, R. Ravnik, Y. Reddy, W. Reimche, H. Rezankova, D. Rispoli, B. Ristevski, B. Robic, J.A. Rodriguez-Aguilar, P. Rohatgi, W. Rossak, I. Rožanc, J. Rupnik, S.B. Sadkhan, K. Saeed, M. Saeki, K.S.M. Sahari, C. Sakharwade, E. Sakkopoulos, P. Sala, M.H. Samadzadeh, J.S. Sandhu, P. Scaglioso, V. Schau, W. Schempp, J. Seberry, A. Senanayake, M. Senobari, T.C. Seong, S. Shamala, c. shi, Z. Shi, L. Shiguo, N. Shilov, Z.-E.H. Slimane, F. Smith, H. Sneed, P. Sokolowski, T. Song, A. Soppera, A. Sorniotti, M. Stajdohar, L. Stanescu, D. Strnad, X. Sun, L. Šajn, R. Šenkerfk, M.R. Šikonja, J. Šilc, I. Škrjanc, T. Štajner, B. Šter, V. Štruc, H. Takizawa, C. Talcott, N. Tomasev, D. Torkar, S. Torrente, M. Trampuš, C. Tranoris, K. Trojacanec, M. Tschierschke, F. De Turck, J. Twycross, N. Tziritas, W. Vanhoof, P. Vateekul, L.A. Vese, A. Visconti, B. Vlaovic, V. Vojisavljevic, M. Vozalis, P. Vracar, V. Vranic, C.-H. Wang, H. Wang, H. Wang, H. Wang, S. Wang, X.-F. Wang, X. Wang, Y. Wang, A. Wasilewska, S. Wenzel, V. Wickramasinghe, J. Wong, S. Wrobel, K. Wrona, B. Wu, L. Xiang, Y. Xiang, D. Xiao, F. Xie, L. Xie, Z. Xing, H. Yang, X. Yang, N.Y. Yen, C. Yong-Sheng, J.J. You, G. Yu, X. Zabulis, A. Zainal, A. Zamuda, M. Zand, Z. Zhang, Z. Zhao, D. Zheng, J. Zheng, X. Zheng, Z.-H. Zhou, F. Zhuang, A. Zimmermann, M.J. Zuo, B. Zupan, M. Zuqiang, B. Žalik, J. Žižka,
Informatica
An International Journal of Computing and Informatics
Web edition of Informatica may be accessed at: http://www.informatica.si.
Subscription Information Informatica (ISSN 0350-5596) is published four times a year in Spring, Summer, Autumn, and Winter (4 issues per year) by the Slovene Society Informatika, Litostrojska cesta 54, 1000 Ljubljana, Slovenia.
The subscription rate for 2015 (Volume 39) is
- 60 EUR for institutions,
- 30 EUR for individuals, and
- 15 EUR for students
Claims for missing issues will be honored free of charge within six months after the publication date of the issue.
Typesetting: Borut Žnidar.
Printing: ABO grafika d.o.o., Ob železnici 16, 1000 Ljubljana.
Orders may be placed by email (drago.torkar@ijs.si), telephone (+386 1 477 3900) or fax (+386 1 251 93 85). The payment should be made to our bank account no.: 02083-0013014662 at NLB d.d., 1520 Ljubljana, Trg republike 2, Slovenija, IBAN no.: SI56020830013014662, SWIFT Code: LJBASI2X.
Informatica is published by Slovene Society Informatika (president Niko Schlamberger) in cooperation with the
following societies (and contact persons):
Robotics Society of Slovenia (Jadran Lenarcic)
Slovene Society for Pattern Recognition (Janez Perš)
Slovenian Artificial Intelligence Society (Dunja Mladenic)
Cognitive Science Society (Urban Kordeš)
Slovenian Society of Mathematicians, Physicists and Astronomers (Andrej Likar) Automatic Control Society of Slovenia (Sašo Blažic)
Slovenian Association of Technical and Natural Sciences / Engineering Academy of Slovenia (Vojteh Leskovšek) ACM Slovenia (Andrej Brodnik)
Informatica is financially supported by the Slovenian research agency from the Call for co-financing of scientific periodical publications.
Informatica is surveyed by: ACM Digital Library, Citeseer, COBISS, Compendex, Computer & Information Systems Abstracts, Computer Database, Computer Science Index, Current Mathematical Publications, DBLP Computer Science Bibliography, Directory of Open Access Journals, InfoTrac OneFile, Inspec, Linguistic and Language Behaviour Abstracts, Mathematical Reviews, MatSciNet, MatSci on SilverPlatter, Scopus, Zentralblatt Math
Volume 39 Number 1 March 2015
ISSN 0350-5596
Informatica
An International Journal of Computing and Informatics
Discriminating Between Closely Related Languages on Twitter
Call Routing Based on a Combination of the Construction-Integration Model and Latent Semantic Analysis: A Full System
An Exact Analytical Grossing-Up Algorithm for Tax-Benefit Models
The slWaC Corpus of the Slovene Web
A Rule-Based System for Automatic De-identification of Medical Narrative Texts
Probabilistic 2D Point Interpolation and Extrapolation via Data Modeling
Swarm Intelligence and its Application in Abnormal Data Detection
N. Ljubešic, D. KranjCic
G. Jorge-Botana, R. Olmos, A. Barroso
M. Verbic, M. Cok, T. Turk
T. Erjavec, N. Ljubešic, N. Logar
J. Jacimovic, C. Krstev, D. Jelovac
D.J. Jaköbczak
B. Liu, M. Cai, J. Yu
Experimental Comparisons of Multi-class Classifiers L. Li, Y. Wu, M. Ye
An Efficient Algorithm for Mining Frequent Closed Itemsets
An Approach for Context-based Reasoning in Ambient Intelligence
G. Fang, Y. Wu, M. Li, J. Chen
H. Gjoreski
23 35 43 53 63
71 87
99
1
9
Informatica 39 (2015) Number 1, pp. 1-101