Volume 38 Number 1 March 2014 ISSN 0350-5596 Informatica An International Journal of Computing and Informatics Special Issue: Advances in Semantic Information Retrieval Guest Editors: Vitaly Klyuev Maxim Mozgovoy Editorial Boards, Publishing Council Informatica is a journal primarily covering intelligent systems in the European computer science, informatics and cognitive community; scientific and educational as well as technical, commercial and industrial. Its basic aim is to enhance communications between different European structures on the basis of equal rights and international refereeing. It publishes scientific papers accepted by at least two referees outside the author's country. In addition, it contains information about conferences, opinions, critical examinations of existing publications and news. Finally, major practical achievements and innovations in the computer and information industry are presented through commercial publications as well as through independent evaluations. Editing and refereeing are distributed. Each editor from the Editorial Board can conduct the refereeing process by appointing two new referees or referees from the Board of Referees or Editorial Board. Referees should not be from the author's country. If new referees are appointed, their names will appear in the list of referees. Each paper bears the name of the editor who appointed the referees. Each editor can propose new members for the Editorial Board or referees. Editors and referees inactive for a longer period can be automatically replaced. Changes in the Editorial Board are confirmed by the Executive Editors. The coordination necessary is made through the Executive Editors who examine the reviews, sort the accepted articles and maintain appropriate international distribution. The Executive Board is appointed by the Society Informatika. Informatica is partially supported by the Slovenian Ministry of Higher Education, Science and Technology. Each author is guaranteed to receive the reviews of his article. When accepted, publication in Informatica is guaranteed in less than one year after the Executive Editors receive the corrected version of the article. Executive Editor - Editor in Chief Anton P. Železnikar Volariceva 8, Ljubljana, Slovenia s51em@lea.hamradio.si http://lea.hamradio.si/~s51em/ Executive Associate Editor - Managing Editor Matjaž Gams, Jožef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia Phone: +386 1 4773 900, Fax: +386 1 251 93 85 matjaz.gams@ijs.si http://dis.ijs.si/mezi/matjaz.html Executive Associate Editor - Deputy Managing Editor Mitja Luštrek, Jožef Stefan Institute mitja.lustrek@ijs.si Executive Associate Editor - Technical Editor Drago Torkar, Jožef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia Phone: +386 1 4773 900, Fax: +386 1 251 93 85 drago.torkar@ijs.si Contact Associate Editors Europe, Africa: Matjaz Gams N. and S. America: Shahram Rahimi Asia, Australia: Ling Feng Overview papers: Maria Ganzha Editorial Board Juan Carlos Augusto (Argentina) Costin Badica (Romania) Vladimir Batagelj (Slovenia) Francesco Bergadano (Italy) Marco Botta (Italy) Pavel Brazdil (Portugal) Andrej Brodnik (Slovenia) Ivan Bruha (Canada) Wray Buntine (Finland) Zhihua Cui (China) Ondrej Drbohlav (Czech Republic) Hubert L. Dreyfus (USA) Jozo Dujmovic (USA) Johann Eder (Austria) Ling Feng (China) Vladimir A. Fomichov (Russia) Maria Ganzha (Poland) Sumit Goyal (India) Marjan Gušev (Macedonia) N. Jaisankar (India) Dimitris Kanellopoulos (Greece) Samee Ullah Khan (USA) Hiroaki Kitano (Japan) Igor Kononenko (Slovenia) Miroslav Kubat (USA) Ante Lauc (Croatia) Jadran Lenarcic (Slovenia) Shiguo Lian (China) Suzana Loskovska (Macedonia) Ramon L. de Mantaras (Spain) Natividad Martinez Madrid (Germany) Angelo Montanari (Italy) Pavol Nävrat (Slovakia) Jerzy R. Nawrocki (Poland) Nadia Nedjah (Brasil) Franc Novak (Slovenia) Marcin Paprzycki (USA/Poland) Ivana Podnar Žarko (Croatia) Karl H. Pribram (USA) Luc De Raedt (Belgium) Shahram Rahimi (USA) Dejan Rakovic (Serbia) Jean Ramaekers (Belgium) Wilhelm Rossak (Germany) Ivan Rozman (Slovenia) Sugata Sanyal (India) Walter Schempp (Germany) Johannes Schwinn (Germany) Zhongzhi Shi (China) Oliviero Stock (Italy) Robert Trappl (Austria) Terry Winograd (USA) Stefan Wrobel (Germany) Konrad Wrona (France) Xindong Wu (USA) Yudong Zhang (China) Editors's Introduction to the Special Issue on "Advances in Semantic Information Retrieval" Semantic technologies and information retrieval hold a firm place among topical research directions of modern computer science. Advances in this field define the ways we use computers in the age of Internet and mobile technologies. We see high level of interest to semantics and information retrieval at our annual International Workshop on Advances in Semantic Information Retrieval (ASIR) workshop that continues to attract researchers from different parts of the world. In this special issue of Informatica journal we introduce four revised and extended papers, presented at the workshop. The first paper entitled Automatic Detection of Antisocial Behaviour in Texts by Myriam Munezero, Maxim Mozgovoy, Tuomo Kakkonen, Vitaly Klyuev, and Erkki Sutinen is devoted to detection of antisocial behavior (ASB) manifestations in written documents. The authors search for linguistic features that pertain to ASB in order to use those features for the automatic identification of ASB in text. They used a collected ASB text corpus as a machine learning resource and approach the detection of ASB in text as a binary classification problem. The results from the experiments show that by exploiting the emotional information together with Bag-of-Words (BoW) the accuracy of over 90% in the classification of ASB in text is reached. These findings will have positive implications in the early detection of potentially harmful behavior. The next paper entitled Leveraging User Experience through Input Style Transformation to Improve Access to Music Search Services by Marina Purgina, Andrey Kuznetsov, and Evgeny Pyshkin addresses the problems of music searching and main tasks the developers face in the domain of music information retrieval. The authors introduce the architecture of the software and the data model for integrated access to existing music search web services. The authors illustrate their approach by developing a mobile accessed software prototype that allows the users of Android-running touch screen devices to access several music search engines including Musipedia, Music Ngram Viewer, and FolkTuneFinder. The designed application supports various styles of music input query. The authors pay special attention to input style transformation aimed to fit well the requirements of the supported search services. The third paper entitled User Annotations as a Context for Related Document Search on the Web and Digital Libraries by Jakub Ševcech, Robert Moro, Michal Holub, and Maria Bielikova proposes a method for query construction enabling search for other documents related to the currently studied one using not only the document's content, but also user created annotations as indicators of user's interests. In the proposed approach, annotations are used to activate nodes in a graph created from the document's content employing spreading activation algorithm. The authors evaluate the proposed method in Annota — a service for bookmarking and collaborative annotation of Web pages and PDF documents displayed in a web browser. Along with its main purpose, Annota is designed to support scenarios useful for a novice researcher working together with his or her mentor. Based on Annota usage data the authors also analyzed properties of various types of annotations. Discovered annotation properties served as a basis for simulation performed to determine optimal parameters of the query construction. The authors compared the proposed method to the commonly used tf-idf based method that was outperformed with the method introduced in the paper. Therefore, annotations proved to be a viable source of information for user's i nterest detection. The fourth paper entitled SOAROAD: an Ontology of Architectural Decisions Supporting Assessment of Service Oriented Architectures by Piotr Szwed, Pawel Skrzynski, Grzegorz Rogus, and Jan Werewka describes SOAROAD (SOA Related Ontology of Architectural Decisions) developed to support the evaluation of architectures of information systems based on the Service-Oriented Architecture (SOA) approach. The main goal of the ontology is to provide constructs for documenting architecture. However, it is designed to support future reasoning about architecture quality and fulfilling the nonfunctional system requirements such as scalability, ease of maintenance, reuse of software components, etc. Another important reason is building a common knowledgebase. When building the ontology, the Architecture Tradeoff Analysis Method (ATAM) was adopted which was chosen as a reference methodology of architecture evaluation. As ASIR chairs, we are strongly committed to our basic aim: to create an atmosphere of friendship and cooperation for everyone, interested in computational P linguistics and information retrieval. The workshop is firmly established as an event within Federated conference on computer science and information systems (FedCSIS), annually organized by the System Research Institute of the Polish Academy of Sciences and the Polish Information Processing Society, and sponsored by the IEEE. In its turn, ASIR is supported by the University of Aizu (Japan), known as Japan's first university solely dedicated to computer science engineering. The University of Aizu is a major center of international education and the home of several conferences, sponsored by the ACM and the IEEE. We would wish to acknowledge selfless efforts of our committee members and FedCSIS conference organizers, who ensured high quality of publications and flawless arrangement of the forum. We would like to specially mention professors Marcin Paprzycki, Maria Ganzha, and Halina Kwasnicka, responsible for FedCSIS. We had a great support from our international team of reviewers, consisting of: Grzegorz J. Nalepa, Ryszard Tadeusiewicz (AGH University of Science and Technology, Poland); Eloisa Vargiu (Barcelona Digital Technology Centre, Spain); Larisa Soldatova (Brunel University, United Kingdom); Shih-Hung Wu (Chaoyang University of Technology, Taiwan); Cristian Lai (CRS4, Italy); Ahsan Morshed (CSIRO ICT Centre, Commonwealth Scientific and Industrial Research Organisation, Australia); Roman Shtykh (CyberAgent Inc., Japan); Krzysztof Goczyla (Gdansk University of Technology, Poland); Yannis Haralambous (Institut Telecom - Telecom Bretagne, France); Katarzyna Budzynska (Institute of Philosophy and Sociology of the Polish Academy of Sciences, Poland); Piotr Kulicki, Robert Trypuz (John Paul II Catholic University of Lublin, Poland); Stefano Borgo (Laboratory for Applied Ontology, Italy); Janusz Kaczmarek (Lodz University, Poland); Simone Ludwig (North Dakota State University, United States); Mari Carmen Suarez de Figueroa Baonza (Ontology Engineering Group, Scool of Computer Science at Universidad Politecnica de Madrid, Spain); Raul Palma (Poznan Supercomputing and Networking Center, Poland); Jolanta Cybulka, Jacek Martinek, Agnieszka Lawrynowicz (Poznan University of Technology, Poland); Evgeny Pyshkin (Saint Petersburg State Polytechnical University, Russia); Vladimir Dobrynin (Saint Petersburg State University, Russia); Haofen Wang (Shanghai Jiao Tong University, China); Slawomir Zadrozny (Systems Research Institute, Poland); Massimiliano Carrara (Universita di Padova, Italy); Nikolay Mirenkov, Alexander Vazhenin (University of Aizu, Japan); Marek Reformat (University of Alberta, Canada); Tuomo Kakkonen (University of Eastern Finland, Finland); Miroslav Vacura (University of Economics, Czech Republic); Sabina Leonelli (University of Exeter, United Kingdom); Wladyslaw Homenda (Warsaw University of Technology, Poland); Qun Jin (Waseda University, Japan); Maciej Piasecki (Wroclaw University of Technology, Poland). We also thank Professor Matjaz Gams (managing editor of Informatica), who supported the publication of this special issue. In 2014, we are organizing ASIR workshop within FedCSIS in Warsaw, Poland. We will continue to maintain high standards of quality and organization, set by the first workshops. We welcome all the researchers, interested in semantics and information retrieval, to join our event. Vitally Klyuev Maxim Mozgovoy Editors of the special issue Automatic Detection of Antisocial Behaviour in Texts Myriam Munezero, Calkin Suero Montero, Tuomo Kakkonen and Erkki Sutinen School of Computing, University of Eastern Finland P.O.Box 111, FI-80101, Joensuu, Finland E-mail: {firstname.lastname}@uef.fi Maxim Mozgovoy and Vitaly Klyuev The University of Aizu, Tsuruga, Ikki-machi Aizu-Wakamatsu, Fukushima, 965-8580 Japan E-mail: {mozgovoy, vkluev}@u-aizu.ac.jp Keywords: antisocial behaviour, machine learning, emotion detection, language analysis Received: November 23, 2013 A considerable amount of effort has been made to reduce the physical manifestation of antisocial behaviour (ASB) in communities. However, the key to the early detection of ASB is, in many cases, in observing its manifestations in written language, which has not been studied in detail. In this work, we search for linguistic features that pertain to ASB in order to use those features for the automatic identification of ASB in texts. We use an ASB text corpus we have collected as a machine learning resource and approach the detection of ASB in texts as a binary classification problem where discriminating features are taken from the linguistic representation of texts in the form bag-of-words and ontology-based emotion descriptors. Results from preliminary experiments show that by exploiting the emotional information together with Bag-of-Words (BoW) over 90% accuracy in the classification of ASB in texts is reached. Our findings have positive implications in the early detection of potentially harmful behaviour. Povzetek: Pri analizi asocialnih besedil v omrežjih dosežejo napredek v kvaliteti prepoznavanja z uporabo ontologij čustev. 1 Introduction Text mining allows for the automatic assessment of individuals involved in ASB have disclosed in advance linguistic features in texts. Based on the analysis results, their feelings and plans through oral or written it is possible to analyse, for instance the topics that the language [30]. The Internet has been used as the outlet texts deal with, as well as linguistic styles used in the for the expression of such emotional states and / or plans texts. Language syntax and semantics are tools that are of violent acts through the use of blogs or video sites [9]. used to express thoughts, opinions, beliefs and emotions Moreover, online communication is often used as a way through words. The words used can reveal important of shouting out people's intentions before engaging in aspects of someone's social and psychological worlds their acts of violence [21, 2]. [33]. Of interest to us, are words and linguistic features The growth of the volume of harmful material on the that express thoughts or feelings of harming another Web has resulted in increased research for its automatic member of the community. In this paper, we analyse and detection [8]. Being able to automatically detect negative discover the linguistic features that pertain to ASB based material is beneficial, for instance, to managers of on machine learning (ML) and the antisocial behaviour websites that allow users to post content or as part of an (ASB) corpus we introduced in Munezero et al. [27]. early warning system to authorities on possible threats to Identifying these features will allow us to detect new public safety. The automatic detection of ASB could also instances of ASB give rise to self-awareness systems for the individuals ASB is broadly defined as any unconsidered action that are expressing thoughts or emotions related to ASB. taken against individuals or groups of individuals that This paper investigates the linguistic features used in may cause harm or distress to society [5]. Often texts that relate to ASB. By employing ML algorithms __we explore the linguistic features that can be used to This paper is based on: M. Munezero , M. Mozgovoy, T. reliably classify texts , conta^ng ASB. For our,i"itia; Kakkonen, V Klyuev and E. Sutinen, Antisocial experiments, we explore the impact that BoW and Behavior Corpus ^or Harmful Language Detection, emotions as linguistic features have on the classification published in the Proceedings of the 3rd International o . Workshop on Advances in Semantic Information Retrieval (part of the FedCSIS' 2013 conference). 2 Related work Much of the research work on ASB has been performed in the realm of social sciences and psychiatry. There have also been efforts towards detecting and preventing physical manifestations of ASB (such as violence) in communities (e.g. the Home Office in United Kingdom1). As such, this problem has not been particularly tackled from the perspective of computational linguistic analysis for early detection and intervention. As no previous general models for detecting ASB from text exist, we provide an overview of the work done in the context of detecting cyberbullying, terrorism and criminal behaviour which all can be considered as specific forms of ASB. Perhaps the most notable related work has been carried out in a research project entitled "Intelligent information system supporting observation, searching and detection for security of citizens in urban environment" [41]. The project aimed at automatic detection of terrorist threats and recognition of serious criminal behaviour or violence based on multi-media content. Within the context of INDECT, criminal behaviour is defined as "behaviour related to terrorist acts, serious criminal activities or criminal activities in the Internet". Our work differs from the one done in the INDECT project in the focus of the research. While INDECT aims at using the analysis of images, video, and text, our focus is on the analysis of text data. In their cyberbullying study, Dinakar et al., [12] made use of YouTube comments that involved sensitive topics related to race and culture, sexuality and intelligence. Moreover, Yin et al. [45] made use of online forums for detecting online harassment. The cyber-pedophilia research by Bogdanova et al. [3] made use of perverted online journal texts based on which to learn models to discriminated pedophiles from non-pedophiles. While the corpora used in the studies reported above contain some forms of negative behaviours, their focus is more than ours. We make use of a broader ASB corpus that contains text related to ASB ranging from suicide notes to terrorism and online threats. 2.1 Language expressivity in ASB Fitzgerald [15] describes the language of ASB as being "deeply value laden, implying purposeful negative action and or behaviour harmful to others". In addition, some researchers have suggested that certain emotions are closely associated with ASB. Some of these emotions include anger, frustration, arrogance, shame, anxiety, depression, sadness, low levels of fear, and lack of guilt [7]. Based on these descriptions, it is reasonable to expect some distinguishing linguistic features in the ASB corpora that may include the use of words that are 1 http://www.homeoffice.gov.uk/crime/anti-social-behaviour/ deemed threatening, harmful or related to violence and emotions that are perceived as overly negative. 2.2 Detecting emotions in texts Emotions have long been investigated in several studies ranging from social psychology to computational linguistics [19]. Lists of primary or "basic" emotions have been put forward in the psychological field prominently by Frijda [16], Ekman [13] and Plutchik [34] among others. The basic emotion categories used in these lists include: anger, sadness, joy, love, surprise, happiness, fear, and disgust (see [28] [37]), for a detailed compilation of primary emotion lists). Within the Natural Language Processing (NLP) research community, more often than not researchers use Ekman [13] six basic emotion categories: anger, disgust, fear, happiness, surprise and sadness [1] [39]. Performing emotion analysis on various types of text can help us understand and measure the emotions expressed in them. Broadly speaking, two main methods exist for the analysis of emotions within the NLP community: word lists-based and ML-based. Word list based methods use lexical resources such as lists of emotion-bearing words, lexicons or affective dictionaries [29] [14], and databases of commonsense knowledge [20], The General Inquirer (GI) [38], the Affective Norms for English Words (ANEW) [4], the WordNet-Affect [40] [42], and more recently the NRC wordemotion association lexicon [25] [24], are all well-known lexical resources. Whereas ML-based methods cast the problem as a multi-class classification problem, for instance, the automatic emotion classification of news headlines into emotion categories [10]. A significant amount of annotated data is required that represents each of the emotions that are used as the classes. In this work, we use ML to classify texts as containing ASB or not. Our aim is to investigate which features are the best for indentifying instances of ASB in texts. 3 Experimental design For an exploratory purpose, we conducted four experiments using the ASB corpus. We approached the classification task as a binary classification task, that is, a document is classified as either containing or not containing ASB. We compared the positive ASB texts first with each of the three negative sets of examples (Sect 3.1) and then all the corpora together. We approached it in this manner firstly because the corpora are written in different styles and we wanted to observe whether ASB texts show some distinct characteristics allowing for successful classification from each of the three negative sets, secondly because between the sets there was a balance in terms of the number of documents and average size in characters. We experimented with three supervised ML classifiers for the classification task (Sect 3.2) using three sets of features (Sect 3.3). Furthermore, with each experimental corpus, we used ten-cross validation, that is, the entire corpus was first partitioned into a training set and test set of 90% and 10% respectively, this process was performed ten times. The average results of the 10-cross validation are reported in Section 4. 3.1 Corpora The following subsections describe each of the four text corpora used in the experimental study. As we are firstly concerned with the binary classification analysis, we compared both positive (those with ASB) (Sect 3.1.1) and negative (Sect 3.1.2 - 3.1.4) examples (non-ASB texts). In order to obtain the negative examples, we used two popular sentiment corpora, movie reviews [31], the emotion annotated corpus (ISEAR) [36] and factual Wikipedia article extracts [44]. Table 1 summarises the documents in the four corpora. 3.1.1 ASB corpus The ASB corpus is a collection of aggressive, violent, and hostile texts. The texts were collected from various blog posts and news-websites which Munezero et al. [27] could conclusively identify as being ASB. In total 148 documents were identified as ASB. The collection is all English texts, having topics such as: serial killer manifestos, antisocial texts, terrorism, violence-based texts, and suicide notes. Important to us, the messages in these documents are reflective of the author's thoughts and emotions. The corpus was collected specifically for the purpose of detecting ASB, conflict, crime and violence behaviour from text documents. The collection is based on the research on ASB that has shown that aggression, violence, hostility, and lack of empathy are among the traits that are most directly associated with ASB [6] [32]. 3.1.2 International Survey on Emotion Antecedents and Reactions The ISEAR corpus is a collection of student reports on situations in which the respondents felt any of the seven major emotions: joy, fear, anger, sadness, disgust, shame, and guilt. The responses include descriptions of how they appraised the situation and how they reacted [36]. 3.1.3 Movie reviews This collection consists of 2000 movie reviews. They are labelled in respect to their polarity: negative and positive. The corpus was first used in [31], and now is often applied in sentiment analysis and opinion mining research as a standard development and test set. 3.1.4 Wikipedia text extracts We searched and collected Wikipedia articles by using similar concepts such as those we found to be characteristic ASB: killing, terror, violence, aggression, and frustration. The aim of including these texts was to observe how well our classification algorithms could distinguish between ASB texts and informative texts containing similar keywords. 3.2 Classifiers For classifying the documents into the two classes, we experimented with three supervised ML classifiers: Multinomial Naive Bayes, SMO for the implementation of Support Vector Machines, and J48 for Decision Trees. The three selected algorithms have shown to be effective in various text classification studies. We made use of the WEKA tool [17] to implement the classifiers used in our study. Multinomial Näive Bayes (MNB). The NB classifier is a probabilistic model that assumes independence of the attributes used in the classification. The classifier has shown good performance even when the sample size is small [11]. We used the MNB classifier implemented in WEKA, which uses a multinomial distribution for each of the features. Support Vector Machine (SVM) is based on the maximum margin hyperplane rather than probabilities as the Naive Bayes [23]. The SVM classifier aims to find a hyperplane, represented by a vector that maximally separates the document vectors in one class from those in the other [31]. J48 Decision Tree (J48) is an implementation of the C4.5 decision tree in WEKA. Decision trees are predictive models that are used for classification tasks by starting at the root of tree and moving through it until a leaf is encountered [35]. The decision tree is built from the input training data using the property of information gain or entropy to build and divide nodes of the decision tree in a manner that best represents the training data and the feature vector [12]. 3.3 Classification features 3.3.1 Bag-of-Words As a first experiment with the ASB corpus, we used the Vector Space Model approach so as to consider the words as independent entities. The model makes an implicit assumption that the order of words in document does not matter, which is also referred to as the Bag-of-Words (BoW) assumption. The approach is sufficient for many classification tasks, as the collection of words appearing in the document (in any order) is usually sufficient to differentiate between semantic concepts [23]. Each document in the corpora was represented as a feature vector composed of binary attributes for each word that occurs in the file. Let {fi,^, fm} be a predefined set of m features that can appear in a document. Let ni(d) be the number of times fi occurs in a document d. Then each document d is represented by the document vector d:=(ni(d), n2(d),^,nm(d)) [31]. If a word appears in a given document, its corresponding attribute is set to 1; otherwise it is set to 0. Generally, the BoW approach works well for text classification. However, it does not take into consideration any semantic and contextual information. Moreover, in order to reduce the number of words in the BOW representation we used the LovinsStemmer [22] in order to replace each word by its stem. Table 1: Corpora description with source, number of documents and average document size. Corpus Source No. of Documents Avg. Document Size (characters) ASB [27] 148 680 ISEAR [36] 265 110 Movie reviews [31] 178 390 Wikipedia extracts [44] 212 680 Total 803 1860 3.3.2 Emotions Emotions reveal connections of individuals to values in the social world and hence, are the triggers of many social psychological phenomena, such as altruism, antisocial behavior and aggression [32]. In our experiments, we analyse in particular, emotions that might be present in the ASB corpus and analyse whether they are reliable classification features. To identify the emotions presented in the corpora, we made use of an emotion ontology introduced in [26]. It is an ontology of emotion categories whereby each category contains a set of emotion classes and emotion words. Figure 1, demonstrates the negative emotion with samples from the disliking class. The emotion ontology is based on WordNetAffect and it contains 85 classes and 1,499 words. On average, the ontology contains 17.6 words per emotion class which gives a relatively wide coverage of emotion classes and emotion words. This together with the fact that the ontology was not fitted on to any particular dataset or text corpus makes it suitable to be used in our experiments as a basis for ASB classification. For the classification, we made use of two types of emotion-based features: ontology-dependent and ontology-independent emotion features. The ontology-dependent features are collected through a tagging process using the emotion ontology. Through the tagging process, we collected tags such as the sum of all the relative frequencies of the emotion classes that belong to the emotion categories represented in the ontology. While the ontology-independent emotion features were obtained by using the SentiStrength system [42] to calculate the emotion strength of a text. Figure1: Sample from the negative-emotion category section of the ontology. 4 Results For an exploratory purpose, we conducted four experiments using the ASB corpus and three corpora as negative examples of ASB (Subsection 3.1.2 - 3.2.4). We explored the impact that BoW and emotions as classification features have on the detection of ASB texts. In the first experiment, binary classifiers using the three classifiers described in Subsection 3.2 were trained on ASB+ISEAR, in the second on ASB+Movie reviews, and in the third on ASB+Wikipedia extracts. Finally, all the corpora were combined into a single data set. SVM J4a MNB lASB-ISEAR "A SB-Movie Reviews ■ ASB-Wikipedia Figure 2: Accuracy results of SVM, J48 and MNB classifier with emotions+BoW as classification features (%). Table 2: Results from SVM classifier (%). Corpora Features Accuracy Precision Recall F-measure ASB + ISEAR Emotions 84.9 85.8 84.9 84.9 BoW 86.2 88.7 86.2 86.0 Emotions+BoW 86.7 89.1 86.8 86.7 ASB + MovieReview Emotions 81.1 82.5 81.2 79.1 BoW 95.8 95.8 95.8 95.7 Emotions+BoW 95.4 95.5 95.4 95.3 ASB + Wikipedia Emotions 69.7 70.3 69.7 69.0 BoW 79.6 80.2 79.6 79.6 Emotions+BoW 80.2 80.3 80.3 80.3 Table 3: Results from J48 classifier (%). Corpora Features Accuracy Precision Recall F-measure ASB + ISEAR Emotions 84.9 84.9 84.9 84.9 BoW 79.2 79.5 79.2 79.2 Emotions+BoW 83.0 83.2 83.0 83.0 ASB + MovieReview Emotions 84.6 84.6 84.6 84.6 BoW 98.1 98.1 98.1 98.1 Emotions+BoW 96.5 96.6 96.5 96.5 ASB + Wikipedia Emotions 62.5 62.3 62.5 62.3 BoW 82.9 83.6 82.9 82.9 Emotions+BoW 81.6 82.1 81.6 81.6 Table 4: Results from MNB classifier (%). Corpora Features Accuracy Precision Recall F-measure ASB + ISEAR Emotions 71.7 72.0 71.7 71.5 BoW 88.7 90.0 88.7 88.6 Emotions+BoW 88.7 89.6 88.7 88.6 ASB + MovieReview Emotions 76.9 77.0 76.9 74.2 BoW 98.4 98.4 98.4 98.4 Emotions+BoW 99.6 99.6 99.6 99.6 ASB + Wikipedia Emotions 58.5 59.4 58.6 58.5 BoW 89.5 90.4 89.5 89.3 Emotions+BoW 90.1 90.8 90.1 90.1 The performances of the classifiers were then compared in terms of accuracy, precision, recall and F-measure. We made use of ten-fold cross validation whereby samples of data are randomly drawn for analysis and the classification algorithm then computes predicted values [23]. Table 2, 3 and 4 show the average of the ten-fold cross validation results on the corpora for each of the ML classifiers with a) BoW, b) emotions, and c) emotions + BoW, as features. Using SVM classifier, the emotion + BoW features performed better in two of the experiments (ASB+ISEAR and ASB+wikipedia). With J48, the emotions were the better discriminator for the ASB+ISEAR. However the BoW model performed better for the other sets. In looking at the MNB classifier, the emotions + BoW feature set performed better in all the three sets. Hence, in majority of the cases, the addition of emotions to BoW provided better accuracy results. We note that our experiments are preliminary, especially as there is no standard ASB corpus available and the number of documents in the ASB corpora is relatively small. Figure 2 summarizes the accuracy results of the three classifiers using both the emotions + BoW as features. From Figure 2, we see that accuracy results in all three classifiers are over 85% which indicates a relatively high accuracy. The best accuracy (99,6%) was reached on the ASB + Movie reviews set with the MNB classifier. We further noted that when all the four corpora (ASB + ISEAR + MovieReviews + Wikipedia) were combined, the classifiers were not learning. This was due to the imbalance in the class distribution, i.e. the majority of the texts were from the negative (not ASB) class, which then causes ML algorithms to perform poorly on the minority class [18]. However, with the MNB classifier, we observed that it was able to learn in spite of the imbalance. A closer look at the most predictive features revealed emotion classes such as 'general-dislike', 'hate', 'anxiety', and 'sadness' as expected based on the known connection between ASB and negative emotions. Surprisingly, however the emotion class 'affection' also appeared as a contributing attribute. 5 Conclusion and future work In this paper, we applied text classification techniques for the analysis and detection of ASB. We reported on experiments where the linguistic features, BoW and emotions were used for the classification of ASB. Our experimental results illustrated that linguistic features such as BoW and emotions can be used successfully to classify ASB in text. We found that the performance of MNB was consistently better than that of J48 and SVM when using the emotions + BoW features. In comparison, when using emotion features alone, the J48 and SVM had the highest accuracy on the ASB+ISEAR (84,9%) and with the BoW features alone, J48 had the highest accuracy with the ASB+MovieReview (98,1%). Thus both features are essentially for achieving high classification accuracy. Deeper analysis of the features further revealed subsets of emotion features that most contributed to the classification accuracy. ASB is a growing concern to the society, and in some instances to the government and law enforcement agencies around the world. In line with creating a safer community, identifying the individuals who pose a danger to a community involves analysing the information they put forward. Thus future work involves exploiting the identified linguistic features to build a model to classify new instances of ASB in text as part of an early detection system. Using the features, we would also like to explore the categorizations between different types of ASB, for example physical manifestations of ASB such as violence to other individuals, and non-physical acts such as cyberbullying. Additionally, with the identified features, we would like to extend the corpus. A larger corpus would allow us to have a larger training set for ML algorithms allowing for learning of new features for building a classification model. In this case, fewer than 200 records were used that could be confidently identified as ASB, and due to this amount, we observed that the SVM and J48 classification models were not learning due to the imbalance in the data. These experiments were our first attempt at automatically detecting ASB in texts. The results we demonstrated are promising, but experiments on large-scale date are necessary to confirm the robustness of our approach. Moreover, in this paper, we investigate BoW and emotions as features, but in future we plan to include semantic analysis which could additionally reveal features for ASB identification. Regardless, in our work, we have found that NLP techniques have potential for the early detection of ASB while the harmful behaviour might still be at its planning stage. Our results have direct applications for national and local security. Acknowledgement This work was supported partially by the grant for "Detecting and visualizing their changes in Text" project No. 14166, funded by the Academy of Finland and JSPS KAKENHI Grant Number 25330410. References [1] Alm, C. O., & Sproat, R. (2005). Emotional sequencing and development in fairy tales. Springer, (pp. 668-674). [2] Böckler, N., Seeger, T., Sitzer, P., & Heitmeyer, W. (2013). School Shootings: International Research, Case Studies, and Concepts for Prevention. New York, USA: Springer. [3] Bogdanova, D., Rosso, P., & Solorio, T. (2012). Modelling fixated discourse in chats with cyberpedophiles. Proceedings of the Workshop on Computational Approaches to Deception Detection (pp. 86-90). Association for Computational Linguistics. [4] Bradley, M. M., & Lang, P. J. (1999). Affective norms for English words (ANEW). Instruction Manual and Affective Ratings. Technical report, University of Florida, The Center for Research in Psychophysiology. [5] Card, R., & Ward, R. (1998). The Crime and Disorder Act. Retrieved 3 28, 2013, from legislation.gov.uk: http://www.legislation.gov.uk/ukpga/1998/37/conte nts [6] Clarke, D. (Abingdon, UK). Pro-Social and AntiSocial Behaviour. 2003: Taylor & Francis. [7] Cohen, L. J. (2005). Neurobiology of Antisociality. In C. Stough, Neurobiology of Exceptionality (pp. 107-124). New York, USA: Kluver Academic/Plenum Publishers. [8] Correa, D., & Sureka, A. (2013). Solutions to Detect and Analyze Online Radicalization: A Survey. Delhi, India: Indraprastha Institute of Information Technology. [9] Crowley, S. (2007, 11 7). Finland Shocked at Fatal Shooting. Retrieved 3 28, 2013, from BBC News: http://news.bbc.co.uk/1/hi/world/europe/7084045.st m [10] Danisman, T., & Alpkocak, A. (2008). Feeler: Emotion Classification of Text Using Vector Space Model. AISB 2008 Convention, Communication, Interaction and Social Intelligence. 2, pp. 53-59. Aberdeen, UK. Affective Language in Human and Machine. [11] De Ferrari, L., & Struart, A. (2006). Mining housekeeping genes with a Naive Bayes classifier. BMC Genomics, 7(277). [12] Dinakar, K., Reichart, R., & Lieberman, H. (2011). Modeling the detection of textual cyberbullying. International Conference on Weblog and Social Media-Social Mobile Web Workshop. [13] Ekman, P. (1993). Facial Expression and Emotion. American Psychologist, 8(4), 376-379. [14] Elliot, C. (1992). The affective reasoner: A process model of emotions in a multi-agent system. Ph.D. thesis,, Northwestern University, Institute for the Learning Sciences. [15] Fitzgerald, M. (2011). Submission to the Department of Human Services on behalf of Public Housing Tenants in relation to Human Rights concerns raised by the Anti-Social Behavior Pilot. Fitzroy Legal Service. [16] Frijda, N. H. (1986). Emotional Behavior. In The Emotions. Studies in Emotion and Social Interaction. Cambridge University Press. [17] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA Data Mining Software: An Update;. SIGKDD Explorations, 11(1). [18] Hulse, J. v., Khoshgoftaar, T. M., & Napolitano, A. (2007). Experimental Perspectives on Learning from Imbalanced Data. Proceedings of the 24th International Conference on Machine Learning. Corvallis, OR. [19] Liu, B. (2010). Sentiment analysis and subjectivity. In N. Indurkhya, & F. J. Damerau (Eds.), Handbook of Natural Language Processing, (2nd ed.). Boca Raton, Florida, USA: CRC Press, Taylor and Francis Group. [20] Liu, H., Lieberman, H., & Selker, T. (2003). A Model of Textual Affect Sensing using Real-World Knowledge. Proceedings of the 2003 IUI, (pp. 125132). [21] Logan, M. (2012, July). Case Study: No More Bagpipes. Retrieved March 15, 2013, from The Threat of the Psychopath: http://www.fbi.gov/stats-services/publications/law-enforcement-bulletin/july-2012/case-study [22] Lovins, J. B. (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistics., 11, 22-31. [23] Miner, G., Elder, J., Hill, T., Nisbet, R., Delen, D., & Fast, A. (2012). Practical text mining and statistical analysis for non-structured text data applications (1st ed). Waltham, MA: Academic Press. [24] Mohammad, S. M. (2012). Portable Features for Classifying Emotional Text. Proceedings of the 2012 NAACL HLT, (pp. 587-591). [25] Mohammad, S. M., & Turney, P. D. (2010). Emotions Evoked by Common Words and Phrases: Using Mechanical Turk to Create an Emotion Lexicon. Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, (pp. 26-34). [26] Montero, C. S., Kakkonen, T., & Munezero, M. (2014). Investigating the Role of Emotion-based Features in Author Gender Classification of Informal Text. A. Gelbukh (Ed.): CICLing 2014, Part II, Lecture Notes in Computer Science 8404, (pp. 98-114), Springer-Verlag Berlin Heidelberg 2014. [27] Munezero, M., Mozgovoy, M., Kakkonen, T., Klyuev, V., & Sutinen, E. (2013). Antisocial behavior corpus for harmful language detection. Federate Conference in Computer Science. Krakow, Poland. [28] Ortony, A., Clore, G. L., & Collins, A. (1994). The Structure of the Theory. In Chapter 2 in The Cognitive Structure of Emotions (pp. 15-33). Cambridge University Press. [29] Ortony, A., Clore, G. L., & Foss, M. A. (1987). The Referential Structure of the Affective Lexicon. Cognitive Science, 11, 341-364. [30] O'Toole, M. E. (2000). School Shooter: A Threat Assessment Perspective. National Center for the Analysis of Violent Crime. Quantico, Virginia, USA.: Federal Bureau of Investigation. [31] Pang, B., Lee, L., & Vaithyanatha, S. (2002). Thumbs up?: sentiment classification using machine learning techniques. Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10: (pp. 79-86). Association for Computational Linguistics. [32] Parrot, G. W. (2001). Emotions in Social Psychology. Philadelphia, Pennsylvania, USA: Taylor & Francis. [33] Pennebaker, J. W., Mehl, M. R., & Niederhoffer, K. G. (2003). Psychological Aspects of Natural Language Use: Our Words, Our Selves. Annual Review of Psychology, 54, 547-577. [34] Plutchik, R. (2001). The Nature of Emotions. American Scientist, 89(4), 344-350. [35] Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Mateo, Calif: Morgan Kaufmann Publishers. [36] Scherer, K. R., & Wallbott, H. G. (1994). Evidence for universality and cultural variation of differential emotion response patternin. Journal of personality and social psychology, 66, 310. [37] Shaver, P., Schwartz, J., Kirson, D., & O'Connor, C. (1987). Emotion knowledge: Further exploration of a prototype approach. Journal of Personality and Social Psychology, 52, 1061-1086. [38] Stone, P. J., Dunphy, D. C., Smith, M. S., & Ogilvie, D. M. (1966). The General Inquirer: A Computer Approach to Content Analysis. Cambridge, Massachusetts, USA: The MIT Press. [39] Strapparava, C., & Mihalcea, R. (2008). Learning to Identify Emotions in Text. Proceedings of the ACM SAC'08, (pp. 1556-1560). [40] Strapparava, C., & Valitutti, A. (2004). WordNet-Affect: an Affective Extension of WordNet. Proceedings of the 4th LRE, (pp. 1083-1086). [41] The INDECT Consortium. (n.d.). XML Data Corpus: Report on Methodology for Collection, Cleaning and Unified Representation of Large Textual Data from Various Sources: News Reports Weblogs Chat. Retrieved 10 10, 2010, from http ://www. indect-pro - ject.eu/files/deliverables/public/INDECT_Deliverab le_4.1_v20090630a.pdf (2010, Dec. 10). [42] Thelwall, M., Bucley, K., Paltoglou, G., & Cai, D. (2010). Sentiment Strength Detection in Short Informal Text. Journal Of The American Society for Information Science And Technology., 61(12), 2544-2558. [43] Valitutti, A., Strapparava, C., & Stock, O. (2004). Developing Affective Lexical Resources. PsychNology, 2(1), 61-83. [44] Wikimedia Foundation. (2013, May 08). Wikipedia: The Free Enceclopedia. [45] Yin, D., Xue, Z., Hong, L., Davison, B. D., Kontostathis, A., & Edwards, L. (2009). Detection of harassment on web 2.0. Proceedings of the Content Analysis in the WEB, 2. Leveraging User Experience through Input Style Transformation to Improve Access to Music Search Services Marina Purgina, Andrey Kuznetsov and Evgeny Pyshkin St. Petersburg State Polytechnical University Institute of Computing and Control Polytechnicheskaya ul. 21, St. Petersburg 194021, Russia E-mail: mapurgina@gmail.com, andrei.n.kuznetsov@gmail.com, pyshkin@icc.spbstu.ru http://kspt.ftk.spbstu.ru/info/staff/pyshkin/en Keywords: information retrieval, music search, mobile application, query transformation, input style transformation Received: November 23, 2013 We analyze problems of music searching and main tasks the developers face in the domain of music information retrieval. We introduce the architecture of the software and the data model for integrated access to existing music searching web services. We illustrate our approach by developing a mobile accessed software prototype which allows users of Android running touch screen devices accessing several music searchers including Musipedia, Music Ngram Viewer, and FolkTuneFinder. The application supports various styles of music input query. We pay special attention to input style transformation aimed to fit well the requirements of the supported search services. Povzetek: Opisana je metoda iskanja glasbenih posnetkov na androidnih napravah. 1 Introduction A variety of multimedia resources constitutes considerable part of the present-day Web information content. Numerous search services usually provide special features to deal with different types of media such as books, maps, images, audio and video recordings, software, etc. In addition to general-purpose searching systems there are solutions using specialized domain sensitive interfaces. Truly, quality of a search service depends both on the efficiency of algorithms it relies on, and on user interface facilities. Such interfaces may include special syntax forms, user query visualization facilities, interactive assisting tools, components for non-textual query input, interactive and "clickable" concept clouds, and so on [1]. Depending on the searching tasks, specialized user interfaces may support different kinds of input like mathematical equations or chemical changes, geographic maps, XML-based resource descriptions, fragments of software source code, editable graphs, etc. In text searching such aspects as morphological, synonymic and grammar variations, malapropisms, and spelling errors condition particular difficulties of a searching process. In the music searching domain there are specific complications like tonality changes, omitted or incorrectly played notes or intervals, time and rhythmic errors. Thus, although there are eventual similarities between This paper is based on M. Purgina, A. Kuznetsov and E. Pyshkin An Approach for Developing a Mobile Accessed Music Search Integration Platform published in the proceedings of the 3rd International Workshop on Advances in Semantic Information Retrieval (part of the FedC-SIS'2013 conference). text and music information retrieval, they differs significantly [2]. Human ability to recognize music is strongly interrelated to listener's experience which may be considered itself to be a product of music intelligent perception [3, 4]. Recently (see [5]) we also analyzed internal models of music representation (with most attention to function-based representation) being the foundation of various algorithms for melody extraction, main voice recognition, authorship attribution, etc. Music processing algorithms use the previous user experience implicitly. As examples, we could cite the Skyline melody extraction algorithm [6] based on the empirical principle that the melody is often in the upper voice, or Melody Lines algorithms based on the idea of grouping notes with closer pitches [7]. The remaining text of the article is organized as follows. In section 2 we review music searching systems and approaches of the day. We also introduce our experience in the domain of human centric computing and refer to some recent related works. In section 3 we describe music query input styles and analyze possible transformations of music input forms so as to fit the requirements of search services. Section 4 contains the description of the developed Android application architecture. We show how it works and make an attempt to analyze the searching output from the point of a musicologist. 2 Background and related works In general, we are able to search music either by metadata description, or by music content. Searching by metadata seems to be very similar to textual searching. Metadata is not necessarily to be restricted by bibliographical data like author, title, artist, conductor, editor, date of publication, etc. They may also include information about performance itself like time signature, tempo, musical instruments, tonality, lyrics and so on. In some situations searching by metadata and searching by content are not so different. Let us consider "A Dictionary Of Musical Themes" [8] which includes short snippets (transposed to C-dur tonality) of musical themes of a composition. These snippets can be used to seek unknown composition by its theme. On one hand, these themes extracted from the original composition constitute metadata, on the other hand, they enable searching a composition by its content. In fact, user can use an ordinary text-based search engine to retrieve compositions represented in such a dictionary. Nowadays exactly the same technique are being used in indexing algorithms and fingerprint algorithms with only few differences: a) metadata are extracted automatically, and b) being sort of pure mathematical abstractions (e.g. hashcodes, fingerprint vectors, etc.) metadata may have no sense for humans. Since the time when the first dictionaries of musical themes were created, the world dramatically changed. People developed new multimedia carriers requiring more complicated search scenarios: 1. Searching music information by existing audio fragment considered as an input. 2. Searching compositions by human remembrance represented in a form of singed, hummed, tapped or anyhow else defined melody or rhythm fragment. 3. Searching music by lyrics. 4. Searching music by bibliographical data (e.g. title, author etc.). 5. Searching music by keywords (e.g. "scary Haloween music"). Searching by given audio fragments is supported by many specialized search engines like Audiotag, Tunatic or Shazam[9]. As a rule, it is implemented on the basis of so called audio fingerprinting technique. The idea of such an approach is to convert an audio fragment of fixed length to a low-dimensional vector by extracting certain spectral features from the input signal. Then this vector (being a kind of audio spectral fingerprint) is compared to fingerprints stored in some database [10, 11]. In the case of Searching by human remembrance scenario we can distinguish two situations. In the first one the search engine deals a monophonic user query representing a main voice, a rhythm, a melodic or rhythmic contour. In the second case the user query represents polyphonic music fragment (e.g. while searching by note score). Errors in user queries condition the main problems of Searching by human remembrance. Thus, music fragments comparison algorithms have to be robust in regards to the most popular user errors like expansion, compression, omission or repetition [12, 13]. The another complication is that it is impossible to search directly within the audio resources' binary contents since we usually have no exact faithful audio fragment2. Searching by lyrics is not fully automated. An end user is able to use general purpose text search machine (like google, yahoo or yandex) for this task, but text based tools search in existing textual data which is usually published either by author or by music lovers. In theory it's possible to use speech recognition engine for the purpose of lyrics extraction [14], but in practice the recognition quality is not good enough for automatic lyrics transcription. We experimented with Google Voice service and it showed good results for lyrics recognition if a user is singing in silent environment with no background music (it successfully recognized 17 of 20 songs). But if we try to play a broadcasted recording of popular artists, the Google Voice simply ignored the input just like it was nothing sang at all. Indeed, the Google Voice was designed as a speech recognition service, not lyrics transcription service. The problem of automatic recognition of lyrics in singing exists, but this topic is actually out of scope of this research. The first three cited scenarios are related to the so called cover song identification task. However the final goal of such a kind of searching process is not always a song itself (which may be an object of copyright restrictions). The user may be satisfied with obtaining music bibliographical metadata that can be used to look for the song in an online music store. It is exactly what the fourth scenario Searching music by bibliographical data means. Finally, Searching by keywords is often implemented through tags annotations. In this case a search query is being parsed to find keywords that could be considered to be tags. Then these tags (i.e. words from a predefined vocabulary of genres, moods, instruments etc.) are used for searching through an annotated database. For example, such technique is used by Last.fm or Pandora online radios. The search scenarios we explained here cover only the most common tasks. Besides such kinds of usual tasks, there are many other music information retrieval problems including searching compositions by their emotional properties (this task appears in recommendation systems), identifying exact particular performance instead of cover song retrieval, locating a position inside a song (used in score following, for instance), and many others3. Similar to other kinds of IR systems, a MIR system usually contains frontend and backend components. In this paper we pay attention mostly to a frontend part communicating with existing searchers, considering a backend system as an Application Program Interface (API). 2For the user provided sequence "A4 B4 C4" as an input it may happened that a melody that the user is actually looking for does not contain such notes at all. 3The overview of the most popular MIR tasks together with descriptions of the recent algorithms can be found at Music Information Retrieval Evaluation eXchange (MIREX) home page (see http://www. music-ir.org/mirex). The extensive description of input styles used by music search engines may be found in [15]. Presently there are many searching web services like Midomi, Musipedia, Ritmoteka, Songtapper, Music Ngram Viewer, and Folk-TuneFinder where customers use one of several possible styles to input a music query. Table 1 represents possible ways to access different music web search services and pays attention to the following facilities: - Voice Using voice recorded from microphone - Rhythm Using tapping/clicking with keyboard, mouse or other input device - Tags Support for keywords or tags - Exmpl Using uploaded audio fragment as an example - Lyr Searching by lyrics - Notes Music score or pitch notation - VKB Virtual keyboard generating note sequence with rhythm - URD Parsons code - API External API (SOAP, REST, etc.) Nowadays many services are accessible via browsers since they support Web interface features. Another important issue is the possibility to access some services from inside the software applications by using open protocols. It gives the way to create tools which allow users not to be limited by only one service at a time. Particularly, Musipedia service uses SOAP protocol described in [16]. FolkTuneFinder and Music Ngram Viewer (both are also used in our work as target search services) are based on the REST architectural style and their responses are wrapped in JSON format. Detailed description of the API usage rules and examples for Music Ngram Viewer service may be found in [17]. With respect to music inputs styles, existing tools support the following opportunities to define a music fragment: - To sing or to hum the theme and to transfer the recording to the music search engine. - To write notes by using one of known music notations directly (e.g. music score, Helmholtz or American pitch notation, MIDI notation, etc.). - To tap the rhythm. - To play the melody with the use of a virtual keyboard. - To use MIDI-compatible instrument or it's software model. - To enter Parsons code, or to set the melody contour by using "U, R, D" instructions 4 as shown in Figure 1. - To define keywords or enter text query. - To record a piece of original composition (e.g. record a composition played on a radio with use of microphone) and to transfer the recording to the music search engine. Figure 1: Beethoven's "Ode to Joe" fragment represented in Parsons code. 3 Music query input styles As shown in the above section, we may define the music query by using different input styles. For a searching framework, the important issue is not only featuring different input interfaces but transforming one query form to the another depending on search service availability and it's communication schema. Different input styles are useful since the user music qualification differs. Melody definition by using a virtual or real keyboard is one of the most exact ways to represent the query, since it accumulates most melody components. However it is not common that users are skilled enough to use the piano keyboard as well as to write adequate note score. Contrariwise, tapping a rhythm seems to be relatively simple way to define music searching query. The problem is that the number of possible rhythm patterns is evidently less than the number of compositions. It means that even if we succeed to tap the rhythm correctly, we may apparently have a list of thousands titles in return [5]. If a user didn't record a fragment while the composition was playing, then the only choice is to sing a melody by voice. Recording a voice (so called query-by-humming or query-by-singing) requires both user's singing skills and support for such a facility from the search system. It is important to note that query-by-example and query-by-humming are quite different tasks of MIR5). 4Each pair of consecutive notes is coded as U ("sound goes Up") if the second note is higher than the first note, R ("Repeat") if the consecutive pitches are equal, and D ("Down") otherwise. Some systems use S ("the Same") instead of R to designate pitch repetition. Rhythm is completely ignored. 5http://www.music-ir.org/mirex/wiki/2013: Main_Page Table 1: Accessing Music Searching Web Services Name Access Voice Notes VKB URD Rhythm Tags API Exmpl Lyr Audiotaga - - - - - - - + - Tunaticb - - - - - - - + - Shazamc - - - - - + - + - Midomid + - - - - - - - - Musipediae + + + + + + SOAP - - Ritmotekaf - - - - + - - - - Songtapperg - - - - + - - - - Music Ngram Viewerh - - + - - + REST - - FolkTuneFinder® - - + + + + REST - - Google, Yandex, Yahoo! - + - - - + + - + ahttp://www.audiotag.info bhttp://www.wildbits.com/tunatic/ chttp://www.shazam.com dhttp://www.midomi.com ehttp://www.musipedia.org fhttp://www.ritmoteka.ru ghttp://www.bored.com/songtapper hhttp://www.peachnote.com ihttp://www.folktunefinder.com Mobile devices with touch screens affect strongly usage aspects of music searching interfaces. Such devices make possible simulating many kinds of music instruments, although the virtual piano-style keyboard remains the most popular interface. 3.1 Transformation of input styles We represent relationships between selected music query input styles in form of an oriented graph. In Figure 2 blue nodes correspond to query representation, red nodes correspond to input methods. Notes are used to represent both a query and an input method, and denoted by using grey color. Every transition arc denoted by latin letters (a to r) shows the possible transformation from one input style to another, namely: a) synthesis & automatic notes transcription; b,c,g) equivalent symbolic transformation; d) pitch estimation; e) pitch sequence with fixed rhythm pattern; f) rhythm with fixed pitch pattern; h) keep only pitch values; i) keep only time intervals; j) calculate pitch intervals6; k) calculate inter offset intervals7; l) compare pitches; m) compare time intervals; n) pitch sequence with fixed pitch interval pattern; o) rhythm with fixed IOI pattern; p) compare pitch intervals; q) compare IOI; r) onset time estimation. Since the virtual keyboard based query implicitly includes such note attributes as it's start time, duration and 6Strictly, this is one way transformation (back transformation produces many transpositions, but as we said before, a comparison algorithm should be robust against transpositions) 7Strictly, this is one way transformation (back transformation produces many different tempos, but as we said before, a comparison algorithm should be robust against tempo fluctuation) pitch value, there is no much difficulty to transform the keyboard input into the rhythm or pitch notation. The same information (notes) could be extracted from the voice input. Query-by-humming searching machines usually don't need such kind of transformations and use hummed or singed input "as is". However it's still possible to transform singed input into symbolic form in order to create queries for search machines that don't support query-by-humming method. Most recent comparative study of pitch extraction algorithms can be found in [18], and most recent results for multiple fundamental frequency estimation task can be found on MIREX page8. Regardless of how we get the notes, we can easily transform them into a rhythm or a pitch notation. This transformation is not lossless. When we transform notes into rhythm we lose information about pitch values, and when we transforming to pitch sequence we lose information about rhythm. Next we can reduce absolute values of pitches and time intervals, and we get sequence of pitch-intervals or sequence of Inter Offset Intervals (lOI). Then we can reduce interval values and get melodic or rhythmic contour. Again this is one-way transformation because we lose an information about the value of an interval and leave only sign of the value encoded with letters 'U', 'R' and'D'. Clear that walking through the graph from left to right we reduce the user query, and therefore it seems we couldn't expect better searching results. However such transformations may have sense for at least two reasons: - We attempt to emphasize the meaning of special 8http://www.music-ir.org/mirex/wiki/2013: Multiple_Fundamental_Frequency_Estimation_\%2 6_ Tracking_Results Input methods Query representation Pitch Cntr^ (URD) J > ambiguous transformation ^ unambiguous transformation Figure 2: Graph of music query transformations. melody attributes. - We would like to try to connect a search service which probably uses quite different music database (e.g. specialized on some music genre9) although it supports only restricted input methods (e.g. rhythm or pitch notation). Regarding to the user interface issues, the ability to move from one input style to another renders possible to switch easily between different searching systems within the framework of one mobile or web application without re-entering the query. 3.2 Models to represent queries Usually user queries are relatively short (and it is true not only for the case of music [19]), so we use sequence of Note objects to represent the searching query. The attributes of a Note object are the following: - Note name according to the American pitch notation - Its Octave number - Its Onset time - Its End time This representation is equivalent to a subset of MIDI subset of MusicXML format. It means that we can get query in MusicXML format directly from our representation in order to upload it to statistics server. XML based standard is always a good choice for future compatibility. An ability to upload user queries to a server is extremely 9We turn our attention to the example of such a case in the following section of this paper. important feature for the domain of MIR research, because this is the cheapest way to receive information for further investigations 10. For instance, collected information can be used for modeling user errors. This representation is equivalent to notes representation so we can easily transform it into the desired form like a sequence of pitches or a rhythm pattern. 4 Introducing mobile application for accessing music search services Nowadays, people are happy to use their mobile devices to access different search services at any time from any place. They use different types of such devices which may have different input mechanisms like phone keys, qwerty-keyboards, touch screens, voice recognition, and so on. The variety of devices running on Android operating system is rapidly increasing during last years, so we decided to use Android platform for our music searching application. For our implementation we selected some music searchers which may be accessed programmatically, particularly: Musipedia, Music Ngram Viewer, and Folk-TuneFinder. For three searching systems we implemented four user query input styles: - Note score editor supporting one voice definition - Parsons code - Rhythm tapping 10This approach is very similar to a Game With A Purpose (GWAP) that is widely used to collect annotations for data. As an example we can cite "Major Miner - music labeling game" for audio data or "Google Image Labeler" for pictures. - Piano style virtual keyboard with additional representation of American pitch notation 4.1 Application architecture General application construction ideas are common for various operating platforms. Despite the fact that eventually we developed an application for Android operating system, here we describe common application architecture in platform independent way, but keeping in mind that target device is a mobile device. The Android application can serve as a model for implementing flexible human centric interface which is oriented to present-day style of using hardware and software facilities of various mobile devices. Figure 4 represents main components of our music searching application. There are following UI and non UI components shown in Figure 4: - UI views (boxes with blue background) - Query transformer - Search system adapters (one adapter for each search system, boxes with red background) - Serializers (either JSON or XML) - Connectors (only one connector supported so far: HTTP) This is a sort of scalable architecture. We can easily add new query input methods, new search machines or new communication protocols (like FTP or even SMTP if required). The theory of operations is the following. The main view InputStyleSelection provides the interface for input style choice. According to the selected input style the respective view (MelodyContour, MusicScore, VirtualPi-ano, or RhythmTapper) opens and provides the corresponding input interface. Depending on the input style the user query could be either a sequence of Note objects (Music Score and Virtual Piano produce this output), Rhythm (produced by RhythmTapper) or Melody contour. Then the application iterates through a collection of available adapters for search engines. Depending on the adapter capabilities the input query may be transformed to the acceptable representation. Then the adapter performs a request to a search service. The request is serialized with a serializer (JSON or XML) and performed through one of available connectors. The response is parsed by the appropriate adapter and added to a list of search results to be displayed by the SearchResults view. With respect to search services' application interfaces mentioned in section 2, the web information exchange protocol adapters have been implemented. The SOAP protocol is not recommended for mobile devices since it uses verbose XML format and may be considerably slower in comparison with other middleware technologies. Unfortunately it is the only way to communicate with the Musipedia system. In our case, the mentioned SOAP disadvantages shouldn't case concern since the exchange occurs relatively rarely, only when the respective button is pressed by a user, and there is small amount of information being transferred. We use org.ksoap2 Java package [20] containing classes required for handling SOAP envelopes and literal XML content. To implement interaction with other search services (based on the REST architecture and wrapping their responses in JSON format which is typically more compact in comparison with XML) we use Google Gson Java library [21]. It allows converting Java objects into their JSON representation as well as backward converting JSON strings to equivalent Java objects. 4.2 Usage example The application starts with a welcome screen for the preferred input style selection (Figure 4). Figure 4: Main activity: input style selection. Then a user selects an input style. For example if a user selects the virtual keyboard interface, the virtual piano is displayed on the screen. The melody is stored in form of a note sequence with respect to the following related data: a pitch represented in the American pitch notation (note name and octave number), onset time, end time. Other properties may be computed depending on the requirements of a music searcher. Let us illustrate this by the input represented in form of a simplified timing chart (with respect to the note names rather than sound frequencies). The chart in Figure 5 represents some first notes of the well known Russian folk song "Birch Tree". For the reason that Musipedia searcher requires a sequence of triplets containing an onset time, a MIDI pitch and its duration, the user input shown in Figure 5 is converted to the following query data: Figure 3: Mobile music searching application architecture. 4 E5E5E5E5 D5 C5C5 B4 A4 onset time l0,0 0,66 1,21 1,72 2,41 3,51 3,89 4,62 5,57 end time 0,54 1,13 1,64 2,22 3,39 3,78 4,41 5,43 Figure 5: Test melody: note score representation and timing chart. 0.0, 76, 0.54; 0.66, 76, 0.47; 1.21, 76, 1.43; 1.72, 76, 0.50; 2.41, 74, 0.98; 3.57, 72, 0.27; 3.89, 72, 0.52; 4.62, 71, 0.81; 5.57, 69, 0.75; After this the respective information is included to the SOAP request which is subsequently sent to the Musipedia server. As a result, the searching system returns the list of retrieved compositions as shown in Figure 6(a) 11. We see the confirmation of the known fact that this melody was ^^We selected Tchaikovsky's work, but as you can see, the similar theme may also be recognized in some other known compositions. used by Piotr Tchaikovsky in the 4th movement of his Symphony No.4 in F-moll (compare with the fragment of the symphony note score shown in Figure 7). X ml 9:50 si III ' 3:27 MelodyRetTieved 1 MelodyRetrieved Romance d'amour Pfaff und Ziegeldecl Function -functions -caller ♦ -components \ / sec -calee -interface Component Interface -property InterfaceProperty -property ComponentProperty ApplicationComponent InfrastructureComponent Figure 2: Foundational model of software architecture and its properties T •QualityAttribute t •compability ; #Co-Exlstence - # Interoperability # Efficiency = PerformanceEfficiency • EfficiencyCompiiance • ResourceUtilisation #T1meB«haviour ▼ @Functionaiity = Functionai Suitability ^Accuracy = FunctionalCDrrectnes& • FunctionalityCompliance • interoperability • suitability = FunctionalAppropriateness T ©FunctionalSultabllity = Functionality • FunctionalAppropriateness = Suitability • FunctionalCompletene&s • FunctionalCorrectness = Accuracy T #Maintainabillty •Analyzability • MaintainabilltyCompllance I •Modifiablility • Modularity • Testability y •PerfomianceEfliciency = Etficiency ^ •capacity t •Portability • Adaptability • Co-Existence • in&tallabllity • PortabilityCDmpllance • Replaceabillty t •Reliability • Availability • FaultTolerance • Maturity • RecDverabillty • ReliabilityCompliance • security • Accountibllity •Authenticity •confidentiality • integrity • Non-repudiation • Usability •Accessibility • AppropriatenessRecognizability = Understandabllity •Attractiveness = UserlntertaceAesthetics • Learn ability • Operability • Understandabllity = AppropriatenessRecognizability • usabilityCompliance • UserErrorProtection • UserlntertaceAesthetics = Attractiveness Figure 5: The tree of quality attributes (according to ISO/IEC 9126 and ISO/IEC 25010) Table 1: Component properties Property (values) Description PlatformTechnology (CORBA, EJB, JINI, RMI) Set of technologies used on the platform. ComponentLogic (flexible, fixed, rulebased) Specifies an approach the component logic implementation. Platform Defines the component platform. Has several subclasses: ApplicationServer, Hardware, Operat-ingSystem and VirtualServer ProgrammingLanguage (Cpp, Java, Ruby, PHP, Erlang, Python, C, C_sharp ) Define programming language used to implement a component. StatePersistence (Stateless, Statefull) Specifies whether a component saves internal data during and in between calls of operations on the client's behalf. The ontology provides a taxonomy of quality attributes. Quality attribute is a nonfunctional characteristic of a component or a system. It represents the degree to which software possesses a desired combination of properties, which are defined by means of externally observable features of software systems. Some of the attributes are related to the overall system design, while others are specific to run-time or design time. Quality attributes can be categorized into two broad groups: attributes that can be directly measured (e.g. performance) and attributes that can be indirectly measured (e.g., usability or maintainability). In the latter category, attributes are divided into subcharacteristics. SOAROAD ontology defines 30 quality attributes including both terms defined in software quality model by the ISO/IEC 9126-1 norm [14] and those arising directly from requirements to architectures formulated in the SOA man- Table 2: Properties describing platform (subclasses of Platform). Property (values) Description ApplicationServer Subclass of Platform. Defines an application server on which a component is deployed, can have such attributes, as: version (string), vendor (string) JEECompliantAS (TomEE, Glassfish, JBoss, Interstage, JOnAS, Geronimo, SAPNeatWeaver, WebSphere, Resin, ColdFusion, WebLogic ) Subclass of ApplicationServer dedicated to JEE compliant components. DotNetCompliant-AS (AppFabric, IIS, TNAPS, Base4, Mono) Subclass of ApplicationServer; its individuals define products for .NET enviroment JavaAS (Jetty, Enhydra, iPlanet) Application servers for Java environment Hardware Subclass of Platform. Used to specify a hardware configuration on which the component is deployed. Attributes: memory (double), processor (string), number_of_cores (int) OperatingSystem (Windows, Unix, Linux, iOS, Android, Bada, Blackberry ) Subclass of Platform. Defines types of operating systems on which a component is executed. Attributes: version (string), vendor (string), product (string) VirtualServer (no, yes) Subclass of Platform. Specifies whether a component is deployed on a virtual server ifesto . Examples of classes belonging to the first group (see Fig. 5) are: Functionality, Reliability, Usability, Efficiency, Maintainability and Portability. The example of classes originating from SOA manifesto are ServiceAuton-omy, PlatformIndependency, LooseCoupling, Modularity, OpenStandardAdoptation, BusinessAgility etc. When designing an applications to meet quality requirements, it is necessary to consider a potential impact of design properties on various quality attributes. SOAROAD ontology defines influences object property to this kind of relation. A design pattern can be seen as a structure build of components of particular types, defining their roles and relations among them together with a set of restrictions on their usage. Design patterns do not change the functionalities of a system but only the organization or structure of those functionalities. One of the most important benefits of using design patterns is that they constitute standardized software building blocks with a well defined influence on quality attributes. In SOAROAD ontology the class De-signPattern has 56 subclasses representing patterns dedicated to SOA architecture. The examples of subclasses are: EnterpriseServiceBus, EventDrivenMessaging, Orchestration. The relation is_described_by links a particular Com-positionProperty to one of the defined design patterns. 6 Example We illustrate the proposed approach on an example of a small system aimed at publishing and browsing of free of charge announces. The diagram in Fig. 6 gives the system architecture specified in ArchiMate language. As it can be noticed, two layers: application and technology are http://www.soa-manifesto.org/ presented. In the application layer several system components are distinguished: Data Base with the SQL interface, three Java beans: Announcement JPA (Java Persistence API), Announcement Business Logic and a Facade providing Announcement WS - web service based interface. The last component of the application layer visible on the diagram is Announcement JSF Presenter being responsible for presentation and interaction with end users. It plays here the role of web service consumer. The components are packaged as three artifacts: ANN_DB (Post-greSQL), ann.ear and ann_pres.war and deployed at three separate servers (technology layer nodes) linked with two connections: JDBC and WS Presenter. The above specification is an input for ArchiMate Import tool (indicated in Fig. 1) that transforms it into Architecture Views ABox. Fig. 7 gives and excerpt of this ontology (node marked with boldlines). We focus on three elements application layer: Announcement Facade, WS interface and accessing it JSFpresenter. Following the foundational model that encompasses connections and their properties, the WS Presenter Connection was also included. The remaining elements of of ArchiMate specification are converted into properties. The tool supporting architecture description allows to assign various properties (architectural decisions) to ontology individuals corresponding to components and connections of the software architecture. In the presented example: - Announcement Facade is deployed on Intel Xeon 2.13 GHz machine running Ubuntu 10.4 system and Glass-Fish application server. - Announcement WS is a SOAP web service with low query granularity and exception handling based on soap faults. - WS Presenter Connection is asynchronous, uses SSL based protection mechanism and 10Gb network. - Announcement JSF Presnter is deployed on JBoss application server and is stateless. The resulting graph of interconnected elements with assigned properties presented in a user-friendly browseable form can be input to ATAM analysis performed in the standard manner. The SOAROAD ontology specifies additional relations (Fig. 8) that can be used in architecture assessment. The supports relation indicates that particular elements can be used together, e.g. JBoss (ApplicationServer) supports Document.Literal (SOAP web service style). The supports property has two subproperties: sup-ports_fully and supports_partially, that can be used to indicate possible incompatibility issues. Another way to define potentially conflicting architectural decisions is to use Conflict objects (reified multirole properties) that indicate sets of properties, which should not be used together, provide specification of conflict levels (e.g. partially_compatible, Figure 6: Announcement system expressed in Archimate language. ExcptH.soapFault -number_of_cores -processor Xeon 2.13GHz 4GB StatePers.stateless -interface ^^ ^ ^^ ^^ ' -caller ^^ ^_,ZAnnouncement\^__y'ws PresenterN_^ Ann. JSF ^ Figure 7: ABox describing the architecture of the announcement system. Elements of an architectural view (marked with boldlines) are assigned with design decisions (individuals of classes defined in the ontology) 8 Figure 8: Relations between properties incompatible, error_prone) and textual description (rationales). The required relation can be used to specify that one element requires another. Such assertions can be explored, while reasoning about implicit decisions, i.e. resulting from earlier assignments. The SOAROAD ontology is formalized in the OWL language. In consequence, it should follow the Open World Assumption (OWA) to be compatible with OWL reasoners, e.g. Pellet, Fact+ or Racer. According to OWA, the following approach was adopted: - A lack of the assertion on property of a particular type, means that nothing is known about the assignment. For example in Fig. 7 no information is provided about the hardware or operating system for Annotations JSF Presenter. - A lack of decision is represented explicitly by an individual (constant) of a particular type, e.g. and individual OperatingSystem.not_decided can be assigned to Annotations JSF Presenter. 7 Conclusion This paper describes the SOAROAD ontology and a concept of a tool that supports the documenting architectures of SOA-based systems. The proposed approach addresses the problem that can be encountered during architecture assessment: to be reliable, a reasoning about architecture qualities, must have solid foundations in a knowledge related to a particular domain: architectural styles, design patterns, used technologies and products. The idea behind SOAROAD ontology is to gather experts knowledge to enable even inexperienced users performing ATAM-based architecture evaluation. An advantage of the presented approach is that its result is a joint representation of architecture views and properties attributed to design elements formalized in OWL language. From a software engineering perspective, such centralized information resource may represent a valuable artifact, which, if maintained during the software lifecycle, can provide reference to design decisions that can be examined later in the integration, testing and deployment phases. On the other hand, the machine interpretable representation, constituting a graph of interconnected objects (individuals), can be processed automatically to check consistency, detect potential flaws and calculate metrics. An extensive list of metrics related to architectural design was defined in [35]. We plan to adapt them to match the structural relations in the SOAROAD ontology, as well to develop new ones. Another direction that is at present researched is an application of fuzzy reasoning to evaluate quality attributes. We use fuzzy Mamdani rules encoded in SWRL language defining influence of selected design decisions on quality. The approach taken follows the idea presented in [31]. Further plans are related to the extensions of the currently developed tool. At present its functionality is limited to building the architecture description. Our intention is to fully integrate it with ATAM process allowing specifying scenarios, describing sensitivity points, tradeoffs and risks. Conflicting decisions of the same type can be attributed to a component, e.g. Annotations JSF Presenter can be attributed with Windows and Linux properties. Such conflicts reflect, that in a certain step an alternative is envisaged. During an evaluation process (possibly supported by reasoning with the use of a separately developed set of SWRL rules) such an assertion can be indicated as non valid. Negative assertions about properties are represented by a special ban relation, whose object can be an anonymous individual of a selected type. For example an assertion (Annotations JSF Presenter, ban, IOS.anonymous) can be made, where IOS.anonymous belongs to the class IOS (operating system). References [1] Jena - a semantic web framework for java. [2] Archi, archimate modelling tool, 2011. [Online; accessed 23-June-2012]. [3] P. Bianco, R. Kotermanski, and P. Merson. Evaluating a service-oriented architecture. Technical Report CMU/SEI-2007-TR-015, Carnegie Mellon, September 2007. [4] N. Bouck'e, D. Weyns, K. Schelfthout, and T. Holvoet. Applying the ATAM to an Architecture for Decentralized Control of a Transportation System, volume 4214, pages 180-198. Springer, 2006. [5] R. Capilla, F. Nava, S. Pčrez, and J. C. Duenas. A web-based tool for managing architectural design decisions. ACM SIGSOFT Software Engineering Notes, 31(5), 2006. [6] Y.-C. Chang, P. Mazzoleni, G. A. Mihaila, and D. Cohn. Solving the service composition puzzle. IEEE SCC, 2:387-394, 2008. [7] P. Clements, R. Kazman, and M. Klein. Evaluating Software Architectures: Methods and Case Studies. Addison-Wesley Professional, 2001. [8] R. C. de Boer, P. Lago, A. Telea, and H. van Vliet. Ontology-driven visualization of architectural design decisions. In WICSA/ECSA, pages 51-60. IEEE, 2009. [9] A. Erfanian and F. S. Aliee. An ontology-driven software architecture evaluation method. In Proceedings of the 3rd international workshop on Sharing and reusing architectural knowledge, SHARK '08, pages 79-86, New York, NY, USA, 2008. ACM. [10] S. Ferber, P. Heidl, and P. Lutz. Reviewing product line architectures: Experience report ofATAM in an automotive context, volume 2290, pages 364-382. Springer, 2001. [11] M. Fernandez-Lopez, A. Gomez-Perez, and N. Ju-risto. Methontology: from ontological art towards on-tological engineering. In Proceedings of the AAAI97 Spring Symposium, pages 33-40, Stanford, USA, March 1997. [12] M. Gruninger and M. S. Fox. Methodology for the design and evaluation of ontologies. In International Joint Conference on Artificial Inteligence (IJCAI95), Workshop on Basic Ontological Issues in Knowledge Sharing, 1995. [13] IEEE. IEEE standard 1471-2000, ieee recommended practice for architectural description of softwareintensive systems, 2000. [14] ISO/IEC. Software engineering - product quality, ISO/IEC 9126-1. Technical report, International Organization for Standardization, 2001. [15] ISO/IEC. ISO/IEC cd 25010-3: Systems and software engineering - software product quality requirements and evaluation (SQuaRE) - software product quality and system quality in use models. Technical report, International Organization for Standardization, 2009. [16] M. Javanbakht, M. Pourkamali, and F. M. Derakhshi. A new method for enterprise architecture assessment and decision-making about improvement or redesign. Proceedings of the Fourth International Multi-Conference on Computing in the Global Information Technology, pages 69-76, 2009. [17] L. G. Jones and A. J. Lattanze. Using the architecture tradeoff analysis method to evaluate a wargame simulation system: A case study. Technical Report CMUSEI2001TN022 Software Engineering Institute Carnegie Mellon University Pittsburgh PA, (Decem-ber):33, 2001. [18] Kazman. Atam:method for architecture evaluation. CMUSEI2000TR004, 2000. [19] R. Kazman, M. Barbacci, M. Klein, J. Carriere, and S. G. Woods. Experience with performing architecture tradeoff analysis. Proceedings of the 21st international conference on Software engineering ICSE 99, pages 54-63, 1999. [20] R. Kazman, L. Bass, G. Abowd, and M. Webb. SAAM: a method for analyzing the properties of software architectures, volume 16pp, pages 81-90. IEEE Comput. Soc. Press, 1994. [21] R. Kazman, L. Bass, and M. Klein. The essential components of software architecture design and analysis. Journal of Systems and Software, 79(8):1207-1216, 2006. [22] P. Kruchten. An ontology of architectural design decisions in software intensive systems, pages 54-61. Citeseer, 2004. [23] M. Lange and M. Jan. An experts' perspective on enterprise architecture goals, framework adoption and benefit assessment. Proceedings of the 15th IEEE International Enterprise Distributed Object Computing Conference Workshops, pages 304-313, 2011. [24] J. Lee, S. Kang, H. Chun, B. Park, and C. Lim. Analysis of VAN-core system architecture- a case study of applying the ATAM. In Proceedings of the 2009 10th ACIS International Conference on Software Engineering, Artificial Intelligences, Networking and Parallel/Distributed Computing, SNPD '09, pages 358-363, Washington, DC, USA, 2009. IEEE Computer Society. [25] L. Lee and P. Kruchten. Visualizing Software Architectural Design Decisions, volume 5292, pages 359362. Springer-Verlag, 2008. [26] B. Roy and T. C. N. Graham. Methods for evaluating software architecture : A survey. Computing, 545(2008-545):82, 2008. [27] J. Rumbaugh, I. Jacobson, and G. Booch. Unified Modeling Language Reference Manual, The (2nd Edition). Pearson Higher Education, 2004. [28] M. Shaw and D. Garlan. Software Architecture: Perspectives on an Emerging Discipline, volume 123. Prentice Hall, 1996. [29] J. Sliwa, K. Gleba, W. Chmiel, P. Szwed, and A. Glowacz. IOEM - ontology engineering methodology for large systems. In P. Jedrzejowicz, N. T. Nguyen, and K. Hoang, editors, ICCCI (1), volume 6922 of Lecture Notes in Computer Science, pages 602-611. Springer, 2011. [30] H. Song and Y.-T. Song. Enterprise architecture institutionalization and assessment. Proceedings of the 9th IEEE/ACIS International Conference on Computer and Information Science, pages 870-875, 2010. [31] P. Szwed. Application of fuzzy ontological reasoning in an implementation of medical guidelines. In Human System Interaction (HSI), 2013 The 6th International Conference on, pages 342-349, 2013. [32] P. Szwed, I. Wojnicki, S. Ernst, and A. Glowacz. Application of new ATAM tools to evaluation of the dynamic map architecture. In A. Dziech and A. Czyzewski, editors, Multimedia Communications, Services and Security, volume 368 of Communications in Computer and Information Science, pages 248-261. Springer Berlin Heidelberg, 2013. [33] The Open Group. Archimate 1.0 specificattion, 2009. [34] H. Van Den Berg, H. Bosma, G. Dijk, H. Van Drunen, J. Van Gijsen, F. Langeveld, J. Luijpers, T. Nguyen, R. Oosting, Gerand Slagter, and et al. ArchiMate made practical. Work, 2007. [35] A. Vasconcelos, P. Sousa, and J. Tribolet. Information system architecture metrics: an enterprise engineering evaluation approach. The Electronic Journal Information Systems Evaluation, 10(1):91-122, 2007. [36] P. Wallin, J. Froberg, and J. Axelsson. Making decisions in integration of automotive software and electronics: A method based on ATAM and AHP. Fourth International Workshop on Software Engineering for Automotive Systems SEAS 07, pages 5-5, 2007. [37] N. Zhou and L.-J. Zhang. Analytic architecture assessment in soa solution design and its engineering application. Proceedings of the IEEE International Conference on Web Services, pages 807-814, 2009. Artificial Immune Based Cryptography Optimization Algorithm Xuanwu Zhou1, 2, Kaihua Liu1, Zhigang Jin 1, Shourong Tian3, Yan Fu13 and Lianmin Qin3 1 School of Electronics and Information Engineering, Tianjin University, Tianjin 300072, China 2 Command College of the Chinese Armed Police Forces, Tianjin 300250, China 3 Administrative Centre of Yantai Tax-free Port, Yantai 265400, China E-mail: schwoodchow@163.com Keywords: artificial immune, cryptography optimization, blind signcryption, optimization coefficient, ECDSA Received: November 8, 2012 In the paper, an improved clone selection algorithm for cryptography optimization is proposed, the algorithm integrates genetic algorithm with immune computing and makes use of reproduction and mutation operator to maintain the diversity and optimization of candidate objects. As an experiment of the clone algorithm, a blind signcryption scheme with immune optimized parameter is proposed. In the signcryption scheme, parameters generated with clone selection have relatively higher level of fitness and thus avoids the arbitrary selection of essential parameters. Then we analyze the efficiency and feasibility of immune optimization algorithms with experiment data from the signcryption scheme. The reproduction operator in the algorithm can greatly improve the fitness level of candidate group, while the mutation operator effectively maintains the diversity of candidate individuals. In the experiment, the optimization coefficient (OC) reaches 0.9301 when the clone algorithm is executed just once. Lastly, we make detailed comparison between the optimized signcryption scheme and other typical schemes, including the blind signature of D.Chaum and the ECDSA signature. The data from the experiment and comparison show that the optimization algorithm can effectively improve the efficiency and accuracy of parameter optimization in cryptography systems. Povzetek: Predstavljen je izviren algoritem za kriptografsko optimizacijo, ki temelji na genetskih imunskih sistemih. 1 Introduction Artificial immune system is an important branch of computation intelligence; it simulates the architecture and operating pattern of biological immune system and makes full use of the superior bionic mechanisms. In terms of computing ability, biological immune system is a self-adaptive and self-organized system with highly distributed and parallel architecture, and it has prominent capability in learning, recognition, memorizing and property extracting. Artificial immune system is an application-orientated model of biological immune system based on the bionic mechanisms; it also has superb capability in data processing and problem solving. Presently, artificial immune system has been widely applied in pattern recognition, intelligent optimizing, machine learning, data mining and information security, etc [1,2,3]. In traditional cryptography schemes, system parameters are simply generated with pseudo-random generator or the selection process is just overlooked. The arbitrary selection of system parameters makes the cryptography system more vulnerable to malicious attack. In order to reinforce the stability and security of cryptography algorithms, the random parameters can be generated by intelligent optimization algorithm with random selection. In this paper, we propose an improved clone selection algorithm which integrates genetic algorithm with immune optimization algorithm. Then a signcryption scheme with immune optimized parameter is proposed as an experiment of the clone selection algorithm. Then the optimized signcryption scheme is compared with other typical schemes, including blind signature of D. Chaum and the ECDSA signature scheme. The signcryption scheme and the experiment show that the improved clone algorithm can effectively improve the efficiency and accuracy of parameter optimization in cryptography systems. 2 Artificial immune system and its algorithms Artificial immune system (AIS) is a series of algorithms and systems based on the superior architecture and operating mechanism in biological immune system. Artificial immune system has a wide application in pattern recognition, intelligent optimizing, machine learning, data mining and information security, etc. Biological immune system can recognize and clear invading pathogens, toxin, tumour cells from genetic mutation and prostrate cells to achieve immune defending effect and organism homeostasis. One of two immune responses is innate immune response taking rapid defending measures at first, which is fulfilled by skin, mucous membrane, phagocyte cells, natural killer, compliments etc. The other is adaptive immune response that is mainly executed by T lymphocyte cells and B lymphocyte cells. The hierarchical defence structure of biological immune system is demonstrated in Figure 1[4,5,6]. Skin Physiological environment Innate immunity Acquired immunity Phagocyte cells Lymphocyte cells Figure 1: Hierarchical defence structure of biological immune system. Biological immune system has superior ability to learn, memorize and recognize information, and its operating mechanism is characterized with self-organizing, distribution and diversity. Therefore many researchers have been applying the superior bionic mechanisms of biological immune system to develop corresponding models and algorithms in artificial immune system. The basic bionic mechanisms of biological immune system can be categorized as: immune learning, immune memorizing, immune recognition, clone selection, diversity, distribution, self-adapting and immune network [7, 8]. Simulating the architecture and operating pattern of biological immune system, artificial immune system has three types of immune algorithms: basic immune algorithm, negative selection algorithm and clone selection algorithm. Basic immune algorithm: generally, basic immune algorithms have similar searching strategy to genetic algorithms, and they also apply selecting and mutating in optimization. Negative selection algorithm: this algorithm is based on the principles of negative selection in biological immune system. Negative selection provides protection against mistaken immune response toward normal organisms. Clone selection algorithm: this algorithm is based on the principles of clone selection in biological immune system. In clone selection algorithms, individual objects will also undergo a process of clone reproduction with the stimulus from corresponding evaluation function (antigen). In the process of clone selection, the objects with higher suitability will be selected for reproduction and the suitability (affinity) and scale of these objects will also be gradually improved [9, 10, 11]. 3 Immune optimization in cryptography schemes In cryptography schemes, the proper selection of certain parameters is essential to the security and feasibility of the whole system. In many schemes, such parameters are simply generated with pseudo-random generator or the selection process is just overlooked. The arbitrary selection of system parameters makes the cryptography system more vulnerable to malicious attack. In the scheme, we introduce clone selection optimization into the design and analyzing of cryptography schemes, and put forward improved cryptography schemes with artificial immune optimization. 3.1 Parameter optimization algorithm An optimized random parameter is first generated with clone selection algorithm. (1) Encoding .The candidate parameters should first be encoded as a number string. In our example, we set the length of string as 4 bits. (2) Initial group generating. The initial group is selected with random. And the number of individuals in the group is set as 5 for the convenience of computing. In our example, they are: X1= (0, 0, 0, 1), X2= (0, 1, 1, 0), X3= (1, 1, 1, 0), X4= (1, 0, 1, 0), X5= (1, 0, 0, 1). (3) Computing and evaluating of fitness. To evaluate the fitness of parameters, we use an objective function to decide the difference between selected strings. In our example, we use a linear function to compute the maximum function value as the standard of selection. The objective function is set as: f(x) =-x2+2x+1. (1) Then we compute the function value of different strings. f(x1) = ./(0001) =2, f,X2) = ./(0110) =-23, ./(X3) = ./(1110) =-167, f,X4) = ./(1010) =-79, f,X5) = ./(1001) =-62. (4) Reproduction of individuals. Mimicking clone selection of immune system, a certain number of individuals with high level of fitness should be selected for reproduction. In our example, two individuals with the highest level of fitness will be selected for reproduction, and the scale of reproduction is also directly proportional to its level of fitness. The function value of string X1, X2 is the highest, so the two strings should be selected for reproduction to increase their perception in the whole group. In proportion to the level of fitness, string X1will be reproduced twice, and string X2 once. Now, the temporary individuals in the group are: X1= (0, 0, 0, 1), X1= (0, 0, 0, 1), X1= (0, 0, 0, 1), X2= (0, 1, 1, 0), X2= (0, 1, 1, 0), X3= (1, 1, 1, 0), X4= (1, 0, 1, 0), X5= (1, 0, 0, 1). (5) Mutation. There are two mutation operations in the clone selection algorithm: crossover and self-mutation. In crossover mutation, two individuals are selected from the temporary group to develop into two new stings by exchanging some value of the string. The probability of crossover is connected with the level of fitness in inverse proportion. In our example, string X3, X4and X5have the lowest level of fitness, sox3and X5 are selected to exchange the latter two bits. And two new strings are generated [12, 13, 14]. X6= (1, 1, 0, 1) , X7= (1, 0, 1, 0) = X4. In self-mutation operation, the algorithm will change some bits in some selected strings, that is 0 to 1 or 1 to 0. The probability of mutation is also connected with the level of fitness in inverse proportion. In the example, X4= (1, 0, 1, 0) has relatively lower level of fitness, and will be selected to change some bits of itself. And the new string is X8= (0, 0, 1, 0). (6) Repeating of algorithm. Then we should also compute the function value of the new strings, and make the decision of reproduction, mutation and discarding. f(X1) = ./(0001) =2, J(X2) = ./(0110) =-23, J(X3) = ./(1110) =-167, J(X4) = ./(1010) =-79, J(X5) = ./(1001) =-62, J(X6) = ./(1101) =-142, J(X8) = ./(0010) =1. Then the above operating algorithm will be repeated for certain times until the requirement is satisfied. In our example, the algorithm will be executed only once, and the final temporary strings in the group are: X1= (0, 0, 0, 1), X1= (0, 0, 0, 1), X1= (0, 0, 0, 1), X2= (0, 1, 1, 0), X2= (0, 1, 1, 0), X3= (1, 1, 1, 0), X4= (1, 0, 1, 0), X5= (1, 0, 0, 1), X6= (1, 1, 0, 1), X8= (0, 0, 1, 0). After comparing the function values of different strings, the strings with the lowest fitness level will be excluded from the group. In this example, five strings with the lowest level of fitness should be excluded from the group to keep the stability of group scale, they are X3= (1, 1, 1, 0), X6= (1, 1, 0, 1), X4= (1, 0, 1, 0), X5= (1, 0, 0, 1) , X2= (0, 1, 1, 0). And the final optimized strings of the group are: X1= (0, 0, 0, 1), X1= (0, 0, 0, 1), X1= (0, 0, 0, 1), X2= (0, 1, 1, 0), X8= (0, 0, 1, 0). 3.2 Immune optimized blind signcryption In our scheme, the random parameter is generated with the above optimizing algorithm in advance, and other secret parameters of the scheme can also be generated with clone selection algorithm in advance. Definition 3.2.1 (Elliptic Curve) an elliptic curve E{F) over finite field F is a sextuple: T = (q , a , b , P , l, h ), where P =( ^ , ^ ) is the base point of E (F) , prime l is the order of P .As to t G Z*, Q and G e E(F ), Q = tG denotes multiple double additions on elliptic curve. O is the point at infinity, satisfying IP = O and G + O = G for any point G G E( Fq ) [15,16,17]. Definition 3.2.2 (ECDLP, Elliptic Curve Discrete Logarithm Problem). ECDLP is the following computation X ^ ECDLP(Q, P) ( P is a base point and Q ^P , X e Z* , Q = xP). In the scheme, user A entrusts signcryption generator B to generate a signcryption for message m e Z* without disclosing any information about it. O = (GC, GK, BSC, USC) Common parameters generation: GC (1^ ) ="On input (1^ ) : (T, H,(E, D)) ^ GC (1") T = (q , a , b , P , l, h )where P =( Xp , ^ ) is the base point of E(F), ord(P) = l is a prime, O is the point at infinity. H : {0,1}* ^ Z*, (E , D) is secure symmetric encryption/decryption algorithm. Key pair generation: GK () ="On input (): s^ Z*, P^ = s^P ^ O, ( skA , PK A ) ^ GK (B,1k) ="On input (B,1k): s^ Z*, P^ = s^P ^ O, (skB, PKb ) ^ Signcryption generating: BSC(s^, P^, m) ="On input (s^, P^ , C): r ^ Z*, R = rP ^ O, A ^Q. (u, v, w) ^ Z*, U = uP^ ^ O, k = (U) xmod( E (•)!), C = E, (m), h ^ H(mWlDQ ), F = (h + w)R - vP, e = (h + w)modl, A Q t = (s^ + er)modl,i ^ Z*,I = iP ^ O. A —— Q s = u-^(t - v - h) mod l, A — Q . h' ^ H(c\|(I)x) , s' = (i - skAh') mod l, A ——— Q. s'P + h'P^ = iP = I,h' ? = H(c\\(IX), C C, h,h', s, s', F)." Unsigncryption algorithm: USC (s^, P^ , C )="On input (s^, P^ , C): If i Z* orP^ P > return 1, Parse C into (c, h , s, F , h', s'), If s, si Z* or c i Sp or F i< P > return 1, else coefficient(OC) y can be defined as the following formula. 4 s-'s^ (P^ + F - hP) = U, k = (U^mod(EC-)!) , m = D (c), h ? = H(m\\IDQ) , If the equation holds return m , else return 1 Analysis of the optimization scheme Artificial immune optimization is the simulation of biological immune system and it is also an improved genetic algorithm with biological inheritance and natural selection mechanism. Clone selection algorithm in artificial immune system is an iteration algorithm. While searching for optimized group, clone selection generates a new improved individual from the original one; and from the improved one to another further improved one. Therefore, clone selection algorithm has much superiority in efficiency and stability compared with other optimization algorithm. In our scheme, the optimization of random parameters is executed only once, but the fitness level of the strings has been greatly improved. The comparison can be made in the following table. Initial group Fitness level Temporary group Fitness level Optimized group Fitness level X1= (0, 0, 0, 1) 2 X1= (0, 0, 0, 1) 2 X1= (0, 0, 0, 1) 2 X1= (0, 0, 0, 1) 2 X2= (0, 1, 1, 0) -23 X1= (0, 0, 0, 1) 2 X1= (0, 0, 0, 1) 2 X2= (0, 1, 1, 0) -23 X3= (1, 1, 1, 0) -167 X2= (0, 1, 1, 0) -23 X1= (0, 0, 0, 1) 2 X3= (1, 1, 1, 0) -167 X4= (1, 0, 1, 0) -79 X4= (1, 0, 1, 0) -79 X2= (0, 1, 1, 0) -23 X5= (1, 0, 0, 1) -62 X5= (1, 0, 0, 1) -62 X6= (1, 1, 0, 1) -142 X8= (0, 0, 1, 0) 1 X8= (0, 0, 1, 0) 1 Sum of fitness -229 Sum of fitness -489 Sum of fitness -16 Average level -45.8 Average level -48.9 Average level -3.2 Table 1: Comparison of fitness level. Definition: Let a is the average fitness level of the initial group, and ß is the average fitness level of the temporary group or the optimized group, ö = ß-a is the difference between a and ß , then optimization ö ß-a a a (2) According to the definition of optimization coefficient, the smaller the value of y, the weaker the optimization effect of clone selection algorithm on initial group. The larger the value of y, the stronger the optimization effect of clone selection algorithm on initial group. When y> 0 , the algorithm has positive optimization effect on the group, when y < 0 , the algorithm has negative optimization effect on the group, When y = 0, the algorithm has no optimization effect on the average level of the group. In the above table, the average fitness level of the initial group is -45.8, after clone selection operation, the average fitness level of the optimized group is -3.2. The optimization coefficient y between the initial group and the optimized group is 0.930131, the average fitness level of the initial group has been greatly improved by93.01%, and thus the optimization effect of clone selection algorithm proves to be remarkable. Different immune operations render different optimization effect on the group. In reproduction operation, individuals with higher level of fitness will be reproduced to obtain their majority in the group, and thus the scale of the whole group will be improved. The average level of fitness will also be improved with the increase of ideal individuals. In mutation operation, new individuals can not necessarily be those with relatively higher level of fitness, therefore, the average level of fitness can not necessarily be improved. On the contrary, the fitness level will most probably be reduced. Yet, mutation operation in the immune optimization algorithm maintains the diversity of the candidate group. The comparison of different clone operations can be made in the following table. Initial group Temporary group with reproduction Temporary group with mutation Optimized group Sum of fitness -229 -348 -489 -16 Avera ge level -45.8 -43.5 -48.9 -3.2 OC y 0.0502 -0.1241 0.9346 Table 2: Comparison of different optimization effect. In the above table, the average fitness level of the initial group is -45.8, after reproduction operation, the fitness level is -43.5, and the optimization coefficient y is 0.0502, the fitness level is improved by 5.02%. Yet, after mutation operation, the average fitness level is -48.9, the optimization coefficient y is -0.1241, the average fitness level is reduced by 12.41 %.With the discarding process, the scale of the group keeps stable, and the fitness level is also improved with the discarding of improper individuals with low level of fitness. The average fitness level increases from -48.9 to -3.2 with a prominent optimization coefficient y 0.9346, and the average fitness level is improved by 93.46%. 5 Comparison with other typical schemes In this section, the proposed artificial immune based optimization algorithm and the optimized signcryption scheme will be compared with other typical schemes, including the famous blind signature put forward by D.Chaum and the ECDSA signature algorithm, which has been accepted as standard elliptic curve algorithm in many international standardization organizations, such as ISO14888-3, ANSI X9.62, IEEE1363-2000, etc. 5.1 Comparison with blind signature of D.Chaum The signature algorithm for comparison in our scheme is based the original scheme put forward by D.Chaum and the security of the blind signature is based on elliptic curves cryptosystem. (1)System parameter F is a finite field ( q is a prime number of n bits, n > 190), an elliptic curve on this finite field is defined as the following. E : _y ^ ^ X ^ + ax + b (a, b e F, + 27b^(mod q) ^ 0). (3) P e E(F) is a base point whose order is a large prime number l . # E(Fq) denotes the order of the elliptic curve which has a factor of large prime number larger than 160 bits [18, 19, 20]. (P) is a function which makes the conversion from a point P = (X, y) on elliptic curve to x . In the blind signature scheme, user A requires B to generate a blind signature of his message m e Z* for him. ( K^ = k^P , k^ ), ( K^ = k^P , k^ ) are the public/private key pairs of A and B. In our scheme, the Hash function in signing algorithm is eliminated for simplicity, which can be easily added without loss of generality. (2)Message blinding Before generating signatures, the original user should blind the secret message with blinding parameters. Step1: As to message m e Z* , User A randomly selects parameter v e Z* and computes m'= vm(mod l) (4) V = v'P Then he sends m' and V to B. Step2: The blind signature generator B randomly selects r e Z^ and then computes R = rV ^ 0 (6) t = m'(R)X (mod l) (7) 5 = r - kj^t (mod l) (8) Then he sends (t, 5) to user A. (3)Signature generating After getting the partial signature ( t , 5 ), user A computes the following to get the blind signature. 5 ' = v '5(modl) (9) t = v't (mod l) (10) Then ( 5 ', t ' ) is the blind signature for message m e Z* generated by entrusted signer B. (4)Blind signature verifying t t After getting blind signature (5 , t ), the signature verifier can testify the signature with the public key of the entrusted signer B. R = 5 P +1K t'? = m(RX(modl) (11) (12) If the formula holds, the verifier will accept ( 5 , t ) as a valid blind signature of message m e Z* [21, 22]. Remark 1. As a comparison, in the traditional schemes with random parameter selection, the parameters are selected without any optimization, such as in the step of message blind protocol (4) - (8). In these steps, parameters r and v are generated randomly without any optimization or selection standards. Many insecure parameters or weak keys will be selected to insure the security of the scheme, which will make the cryptography system more vulnerable to malicious attack. While with the proposed signcryption optimized algorithm, many insecure parameters or weak keys will be discarded or undergo the mutation process because of their low level of fitness. 5.2 Comparison with ECDSA signature ECDSA signature scheme is as the following: (1)System parameter System parameters are the same as the above scheme, F e Z* is the private key, KA = kAP is the corresponding public key, H: {0,1}* ^ Z* is a secure one-way hash function. (2) Signing algorithm As to message m e Z*, the signer randomly selects parameter u e Z* and computes U = uP (13) e = H (m ), (14) (15) 5 - u 'ie + k(UX)(mod l) . a = (U , 5) is the signature text. (3)Verifying algorithm After getting signature a = (U , 5 ), the verifier can testify the signature with the public key of the signer. (16) W - 5 U - ew(mod I), (17) U - (UX w(mod l), (18) (UiP + U2K)^ ?= (U) , . (19) If the above formula is correct, the signature verifier will accept a = (U , 5) as a valid signature of message m e Z* [23, 24, 25]. Remark 2. Although ECDSA signature has been accepted as standard signature algorithm in elliptic curves, parameter u e Z*^ in signature generating is still generated randomly without any optimization or selection to avoid weak keys and insecure parameters. Compared with the proposed scheme with immune optimization in the paper, ECDSA is more vulnerable to malicious attack, such as signature forgery and attack on the secret key for signing. 5.3 Comparison of performance In this section, we will make a performance comparison between our immune optimized signcryption scheme and other traditional techniques, including the blind signature of D.Chaum and the ECDSA signature. To fulfil both the functions of encryption and signature as the proposed immune based blind signcryption, the above signature schemes must be improved with a secure symmetric encryption/decryption algorithm, for which the typical ElGamal encryption algorithm is selected with its simplicity and security. ElGamal public key encryption algorithm is as follows. (1)System parameter p is a large prime with binary length no less than 1024 such that p -1 has a large prime factor. G — Z* is a cyclic group under multiplication modulo p in which the discrete exponentiation function is conjectured to be one-way (meaning the discrete logarithm function is computationally hard) . g is the generator of group G ,meaning G = {g°■ , where I - G is the order (size) of G . Then, as to any x e Z* , the computation of y - gx via x and g is called discrete exponentiation function, which is computationally feasible; but the computation of x via y and g is called discrete logarithm problem (DLP), which is computationally infeasible. k e Z* is the private key, and K - gk is the public key. (2)Encryption algorithm As to message m e Z*p , the sender randomly selects r e Z*, and computes Ci - g' (mod p) , (20) ^ - mKr (mod p) . (21) Then (q, ^) is the cipher text. (3)Decryption algorithm C2 (cki)-1 - mKr (grk)-1 - mgrk (grk ) 1 = m (mod p). (22) In these schemes, such computing as modular exponential, modular inverse and elliptic curve addition ,elliptic curve scalar multiplication should be taken into comparison for computing complexity, while computing cost of modular addition, modular multiplication, hash, symmetric encryption/decryption are negligible. To ensure the security of basic cryptographic primitives, the minimum security parameters recommended for current practice are as follows: for DLP, |p|=1024bits, |^|=160bits. For RSA, |^|=1024bits; for ECC, |?|=131bits (79, 109 may also be chosen), |/|=160bits. The block length of the block cipher is 64bits. The length of secure hash function is 128bits. Scheme GC+ GK Sign VF Sum cost IO Length ofC Blind signature 1kP 2kP +3I 2kP 5kP +3I / 2 |l| ECDSA 1kP 1kP +1I 2kP+ 1I 4kP+2I / |l|+ \q\ GC+ GK EC DC Elgamal encryption 1E 2E 1E+1 I 4E+1I / 2 \p\ GC+ GK Sign and EC VF and DC Compound scheme 1 1kP+ 1E 2kP +2E+ 3I 2kP+ 1E+1 I 5kP+4 E+4I / 2 |l|+ 2 p| Compound scheme 2 1kP+ 1E 1kP +2E+ 1I 2kP+ 1E+2 I 4kP +4E+ 3I / |l|+ |q|+ 2 \P| GC+ GK SC USC Immune based blind signcryption 2kP 4kP +1I 1kP+ 1 I 7kP +2I N |E (■) |+2|h|+ 2|l| Table 3: Comparison of computing and communication cost. Notes of notations: 1. GC+GK denotes the common parameters and key generation algorithms; Sign/VF denotes the signature/verification algorithms; lO denotes immune optimization algorithm; EC/DC denotes encryption/decryption algorithm; SC denotes the signcryption algorithm; USC denotes the unsigncryption algorithm; Length of C denotes the length of signcryption text /cipher-text/signature. Compound scheme 1 is the scheme of blind signature+ Elgamal encryption; Compound scheme 2 is the scheme of ECDSA+ Elgamal encryption. 2. E denotes modular exponential; I denotes modular inverse; kP denotes scalar multiplication on elliptic curves. / denotes there is no relevant computation. 3. |E () |denotes the block length of block cipher. 4. N denotes negligible. In the above ECC and Elgamal based schemes, elliptic curve scalar multiplication kP and modular exponential ^ mod p are the most complex computations, so we will compare these two typical computations with the currently recommended security parameters: (1) Elliptic curve scalar multiplication kP, where P e E{ F,) , E is a non-supersingular curve, l» 160, k is a random160-bit integer. (2) Modular exponential ^ mod p , where p is a 1024-bit prime and k is a random160-bit integer. A field multiplication in F takes l2 (q= ) bit operations, then a modular multiplication in (2) takes (1024/160)2« 41times longer than a field multiplication in (1). Computation of kP by repeated doubling and adding on the average requires 160 elliptic curve doublings and 80 elliptic curve additions. From the addition formula for non-supersingular elliptic curves, an elliptic curve addition or doubling requires 1 field inversion and 2 field multiplications. The time to perform a field inversion is equivalent to that of 3 field multiplications. Hence, computing kP requires the equivalent of 1200 field multiplications, or 1024/41« 29 1024-bit modular multiplications. On the other hand, computing ^ mod p by repeated squaring and multiplying requires an average of 240 1024-bit modular multiplications. Thus, the operation in (1) can be expected to be about 240/29 « 8 times faster than the operation in (2) [26]. In the following table, the computation costs of the schemes are compared by the equivalence of kP, ak mod p and field inversion to field multiplication in F ( q= 2l ,|q|« 160bits). Scheme GC+GK Sign VF Sum cost IO Length ofC Blind signature 1200 2409 2400 6009 / 320bits ECDSA 1200 1203 2403 4806 / 291bits GC+GK EC DC Elgamal encryption 9840 19680 9881 39401 / 2048bits GC+GK Sign and EC VF and DC Compound scheme 1 11040 22089 12281 45410 / 2368bits Compound scheme 2 11040 20883 12284 44207 / 2339bits GC+GK SC USC Immune based blind signcryption 2400 4803 2403 9606 N 640bits Table 4: Comparison of computing and communication data. Remark 1. (Comparison with compound scheme 1). Based on the result of Koblitz and Menezes [26], the computing cost in parameter and key generation in our scheme is 2400/11040 ^ 1/5of that in compound scheme1; signcryption operation in ours is about 4803/22089 « 1/5 of that in scheme1, and unsigncryption is about 2403/12281« 1/5 of that in scheme1. To sum up, our scheme reduces about 1-9606/45410 « 78.9% commutating cost compared with compound scheme1. Remark 2. (Comparison with compound scheme 2). As per the result of [26], the computing cost in parameter and key generation in our scheme is 2400/11040 « 1/5of that in compound scheme2; signcryption operation in ours is about 4803/20883 « 1/5 of that in scheme2, and unsigncryption is about 2403/12284 « 1/5 of that in scheme2. To sum up, our scheme reduces about 19606/44207 ~ 78.3% commutating cost compared with compound scheme2. Remark 3. (Comparison of communication efficiency). The length of signcryption text in our scheme is 640/2368 « 1/4of that in compound scheme1 and 640/2339 « 1/4of that in compound scheme2; our scheme reduces about 1-640/2368 « 73% communication cost compared with compound scheme1and reduces about 1640/2339 « 72.6% communication cost compared with compound scheme2. Remark 4. Furthermore, the immune based optimization algorithm in our blind signcryption scheme is an algorithm of polynomial time complexity which can be neglected in the comparison of computation and communication efficiency. For specific application systems, the optimization algorithm can be executed in advance without any influence to the efficiency and designed as a separate computing unite which provide optimization service to other function units, such as encryption, signature, authentication, etc. Therefore, the proposed cryptography optimization algorithm and the blind signcryption scheme prove to be more efficient and applicable to many security schemes in resource-restricted environment. 6 Conclusions This paper studies the unique properties of biological immune system and optimization application in cryptography system. In the scheme, we introduce clone selection optimization into the design and analyzing of cryptography schemes, and put forward an improved signcryption scheme with artificial immune optimization. In the scheme, parameters with high level of security and fitness are selected as candidate individuals, and those with security problem or low level of fitness are rejected. On this basis, the final selection of parameters can be made with random mode. Thus the scheme avoids the security problems of other cryptography scheme and reinforces its stability, adaptability and robustness. Acknowledgement The authors should thank the anonymous reviewers for their constructive advice and comments to the paper, with which we can improve our work clerically and academically. References [1] Han K H, Park K H. Parallel Quantum-inspired Genetic Algorithm for Combinatorial Optimization Problems, Proceedings of the CEC. Piscataway: IEEE Press, 2001:1442-1429. [2] Xuanwu Zhou, Ping Wei,etc. Study on Proxy Signature Schemes with Bionic Optimization[C]. Proceedings of FITME'2009, IEEE Press. 2009, (Vol.3)365-368. [3] D.W. Matolak, and B. Wang. Efficient Statistical Parallel Interference Cancellation for DS-CDMA in Rayleigh Fading Channels. IEEE Transactions On Wireless Communications, vol. 6, no. 2, pp.566-574, February 2007. [4] Alexandra Boldyreva,Adriana Palacio,Bogdan Warinschi. Secure Proxy Signature Schemes for Delegation of Signing Rights [J]. Journal of Cryptology. 2012, 25(1): 57-115. [5] Yong Yu,Yi Mu,Willy Susilo,ect. Provably secure proxy signature scheme from factorization[J]. Mathematical and Computer Modelling. 2012, 55(3-4): 1160-1168. [6] Emura Keita,Miyaji Atsuko,Rahman Mohammad Shahriar. Dynamic attribute-based signcryption without random oracles[J]. International Journal of Applied Cryptography. 2012, 2(32): 199-211. [7] Seung Hyun Seo;Kyu Young Choi;Jung Yeon Hwang;Seungjoo Kim. Efficient certificateless proxy signature scheme with provable security [J]. Information Sciences. 2012, 188: 322-337. [8] Degabriele Paul,Paterson Kenny,Watson Gaven. Provable Security in the Real World[J]. IEEE Security & Privacy. 2011, 9(3): 33-41. [9] Harendra Singh,Girraj Kumar Verma. ID-based proxy signature scheme with message recovery[J].Journal of Systems and Software. 2012, 85(1): 209-214. [10] Xuanwu Zhou, Zhigang Jin, etc. Short Signcryption Scheme for the Internet of Things [J]. Informatica. Vol.35 (4) 521-530, 2011. [11] Han Yu Lin;Chien Lung Hsu;Shih Kun Huang. Improved convertible authenticated encryption scheme with provable security[J]. Information Processing Letters. 2011, 111(13): 661-666. [12] Tzong-Sun Wu;Han-Yu Lin;Pei-Yih Ting. A publicly verifiable PCAE scheme for confidential applications with proxydelegation[J].European Transactions on Telecommunications. 2012, 23(2): 172-185. [13] Zhang Chuanrong. Zhang Yuqing . Li Fageng and Xiao Hong.: New Signcryption Algorithm for Secure Communication of ad hoc Networks. Journal of Communications, 2010, 31(3): 19-24. [14] Han K H,Kim J H.Quantum-inspired Evolutionary Algorithms with a New Termination Criterion, He Gate, and two-phase Scheme. IEEE Transactions on Evolutionary Computation, 2004, 8(2):156-169. [15] Gu Jingiing,Chen Songcan,Zhuang Yi. Wireless Sensor Networks-Based Topology Structure for the Internet of Things Location [J].Chinese Journal of Computer . 2010, 33(9): 1548-1556. [16] Zhu Hongbo,Yang Longxiang,Yu Quan.Investigation of Technical Thought and Application Strategy for the Internet of Things [J]. Journal of Communication. 2010,31(11):2-9. [17] Haipeng Zhang, Mitsuo Gen. Effective Genetic Approach for Optimizing Advanced Planning and Scheduling in Flexible Manufacturing System.GECCO'06, July 8-12, 2006, Seattle, Washington, USA. [18] Z Luo, M Zhao, S Liu,ect. Generalized Parallel Interference Cancellation With Near-Optimal Detection Performance[J].IEEE Transactions On Signal Processing. 2008, 56(1): 304-312. [19] Xuanwu Zhou. Elliptic Curves Cryptosystem Based Electronic Cash Scheme with Parameter Optimization [C]. Proceedings of KESE'2009, IEEE Press. 2009, 182-185. [20] S Manohar, V Tikiya, R Annavajjala,etc. BER Optimal Linear Parallel Interference Cancellation for Multicarrier DSCDMA in Rayleigh Fading [J]. IEEE Transactions On Communications. 2007, 55(6): 1253-1265. [21] Blundo C, Desantis A. Perfectly Secure Key Distribution for Dynamic Conferences. Advances in Cryptology-Crypto'92. New York: Springer-Verlag, 1993, 471-486. [22] Keita Emura,Atsuko Miyaji,Mohammad Shahriar Rahman. Dynamic Attribute-based Signcryption without Random Oracles[J].International Journal of Applied Cryptography,2012,2(3):199-211. [23] Xu Peng,Cui Guohua,Lei Fengfu, Tang Xueming, Chen Jing. An Efficient and Provably Secure IBE Scheme Under the Standard Model[J].Chinese Journal of Computer . 2010, 33(2): 335-1556. [24] Xuanwu Zhou. Elliptic Curves Cryptosystem Based Electronic Cash Scheme with Parameter Optimization[C]. Proceedings of KESE'2009, IEEE Press. 2009, 182-185. [25] Kim Y K,Park K,Ko J.A symbiotic evolutionary algorithm for the integration of process planning and job shop scheduling. Computers and Operations Research.2003, 30:1151- 1171. [26] Koblitz N, Menezes A and Vanstone S. The State of Elliptic Curve Cryptography [J]. Designs, Codes and Cryptography, 2000, 30(19): 173-193. Bilinear Grid Search Strategy Based Support Vector Machines Learning Method Li Lin, Zhang Xiaolong, Zhang Kai and Liu Jun School of Computer Science and Technology, Wuhan University of Science and Technology, China Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, China E-mail: lilin@wust.edu.cn Keywords: support vector machines, model selection, parameters optimization, protein structure prediction Received: October 21, 2013 Support Vector Machines (SVM) learning can be used to construct classification models of high accuracy. However, the performance of SVM learning should be improved. This paper proposes a bilinear grid search method to achieve higher computation efficiency in choosing kernel parameters (C, Y) of SVM with RBF kernel. Experiments show that the proposed method retains the advantages of a small number of training SVMs of bilinear search and the high prediction accuracy of grid search. It has been proved that bilinear grid search method (BGSM) is an effective way to train SVM with RBF kernel. With the application of BGSM, the protein secondary structure prediction can obtain a better learning accuracy compared with other related algorithms. Povzetek: Razvita je nova metoda iskanja parametrov za metodo SVM. 1 Introduction Support Vector Machines (SVM) is a new machine learning method based on statistical learning theory and structural risk minimization [1-3]. The core function of SVM identifies the maximal margin hyperplane and a set of linearly separable data, classifies data correctly, so as to maximize the minimum distance between data and the hyperplane. A number of recent studies on SVM attempt to explore simple and efficient methods to solve the problem of maximal margin hyperplane [4-6]. Many of these works study the performance of SVM learning [7-9]. Several kernel functions can be used in SVM, such as linear function, polynomial function, RBF function, Gaussian function, MLPs with one hidden layer and spline. SVM is used to construct accurate classification models and has been widely applied, such as in handwritten character recognition, web page/text automatic classification, gene analysis and so on [10]. However, there is still no widely accepted way of selecting kernel function and its parameters in SVM learning. The selection of parameters for SVM algorithms usually depends on large-scale search. SVM learning is a kind of quadratic programming (QP) problem. Despite its advantages, there are a number of drawbacks in selecting hyperparameters in the size of matrix involved in the QP problem. Therefore, this paper proposes a bilinear grid search method to compute the penalty parameter and the kernel parameter (C, y) of SVM using RBF kernel. This method is efficient in reducing the training space in QP. Bilinear grid search algorithm has the advantages of both bilinear search and grid search. The proposed algorithm expands the search range of (C, y) so that it can perform SVM learning with a small size of training samples to construct classification models with high accuracy. The rest of the paper is structured as follows: Section 2 introduces SVM learning and relevant search strategies; Section 3 proposes bilinear grid search method in SVM learning with RBF kernel; In section 4, we conduct experiments to test the efficiency and applicability of the proposed algorithm; Finally, Section 5 is devoted to concluding remarks and future research recommendations. 2 Search strategy for SVM learning SVM classification can be described as: Given: - A training set of instance-label pairs (xi, yi), i = 1,...,l, where x, e R" andy e{l,-l}' . Find: - The solution to the minimum value of 1 T ' — wT w + C T^i 2 i;:ii, where y (w^Z + b) > 1 , > 0, i = Here, the training vector xi are mapped into a higher - (probably infinite-) dimensional space using function ^ as Zi =^(xi) ; C(C >0) is the penalty parameter of the error term. Usually, the formation (1) can be considered as the following dual problem: • Minimize F = ^ a^Qa- e^c 0 0 . The RBF kernel nonlinearly maps the training data into a higher dimensional space, so it can handle non linear relation between the class labels and the attributes. Keerthi and Lin [10] prove that a linear kernel with a penalty parameter C has the same performance as the RBF kernel with (C,/) (C is the penalty parameter, / is the kernel parameter). In addition, the application of sigmoid kernel in SVM learning and the similar parameters to RBF kernel are given by [9]. It is known that the number of hyperparameters influences the complexity of model selection. In RBF kernel, 0 < K ^ 1 . However, for polynomial kernels, 1 means its value is pC' xj +r> 1 is the opposite. The authors of there are two cases: infinite; o the best parameter {C,r) = (1.0,0.03125), and its classification accuracy is 70.8123%. sequence to process; multiple alignment : SH3 ■■■F Y D N L Q Q Y L N" al i gnl ...p Y D N L Q Q Y L N" align2 -Y F S A L R H Y I N" ali gnS Y T S L R H Y L N" al isn4 -Y A A E L R R Y I N" 1 V 0 0 0 0 0 0 0 0 0 0 2 L 0 0 0 0 100 0 0 0 60 0 3 I 0 0 0 n 0 0 0 0 40 0 4 m 0 0 0 0 0 0 0 0 0 0 5 F 40 20 0 0 0 0 0 0 0 0 6 W 0 0 0 0 0 0 0 0 0 0 T Y 60 60 0 0 0 0 0 100 0 0 8 G 0 0 0 0 0 0 0 0 0 0 9 A 0 20 20 20 0 0 0 0 0 0 10 P 0 0 0 0 0 0 0 0 0 0 11 S 0 0 20 20 0 0 0 0 0 0 12 T 0 0 20 0 0 0 0 0 0 0 13 C 0 0 0 0 0 0 0 0 0 0 14 H 0 0 0 0 0 0 40 0 0 0 15 R 0 0 0 0 0 60 20 0 0 0 16 K 0 0 0 0 0 0 0 0 0 0 17 Q 0 0 0 0 0 40 40 0 0 0 18 E 0 0 0 20 0 0 0 0 0 0 19 N 0 0 0 40 0 0 0 0 0 100 20 D 0 0 40 0 0 0 0 0 0 0 a output ■ 2 S=0. 2 E=0. 2 N=0. 4 Figure 1: An example of using evolutionary information for coding secondary structure. F^" C:\Py tli on2: tho n. e xe HQ [local] -1 .0 -6 .5 69 .2418 (best c =0 .707106781187, g =0 .03125, rate =70 3408) [local] -1 .0 -4 .0 65 .6866 (best c =0 .707106781187, g =0 .03125, rate =70 3408) [local] -2 .0 -5 .5 68 .9589 (best C =0 .707106781187, g =0 .03125, rate =70 3408) [local] -3 .0 -5 .5 67 .2243 (best C =0 .707106781187, g =0 .03125, rate =70 3408) [local] -0 .5 -5 .5 70 .1562 (best C =0 .707106781187, 3 =0 .03125, rate =70 3408) [local] -3 .5 -5 .5 65 .1823 (best C =0 .707106781187, g =0 .03125, rate =70 3408) [local] -1 .0 -5 .5 69 .8487 (best c =0 .707106781187, g =0 .03125, rate =70 3408) [local] -2 .5 -5 .0 67 .3597 (best C =0 .707106781187, g =0 .03125, rate =70 3408) [local] -2 .5 -6 .0 68 .4914 (best c =0 .707106781187, g =0 .03125, rate =70 3408) [local] -2 .5 -3 .5 51 .2691 (best C =0 .707106781187, g =0 .03125, rate =70 3408) [local] -2 .5 -6 .5 68 .2577 (best c =0 .707106781187, g =0 .03125, rate =70 3408) [local] -2 .5 -4 .0 58 .2482 (best c =0 .707106781187, g =0 .03125, rate =70 3408) [local] -2 .5 -5 .5 68 .3069 (best c =0 .707106781187, g =0 .03125, rate =70, .3408) [local] -2 .0 -3 .0 48 .1609 (best c =0 .707106781187, g =0 .03125, rate =70, .3408) [local] -3 .0 -3 .0 46 .9594 (best c =0 .707106781187, g =0 .03125, rate =70 .3408) [local] -0 .5 -3 .0 54 .2912 (best c =0 .707106781187, g =0 .03125, rate =70, .3408) [local] -3 .5 -3 .0 46 .5904 (best c =0 .707106781187, g =0 .03125, rate =70 .3408) [local] -1 .0 -3 .0 51 .3101 (best c =0 .707106781187, g =0 .03125, rate =70, .3408) [local] -2 .5 -3 .0 47 .4146 (best c =0 .707106781187, g =0 .03125, rate =70 .3408) [local] 0.1 a - -5. 0 70. 8123 (best c = 1.0, 9=0.03125, rate =70.8123) [local] 0.1 a - -6. B 70. 0947 (best c = 1. a, g=0.03125, rate =70.8123) [local] 0.1 a - -3. 5 65. 5677 (best i: 1.0, 9=0.03125, rate =70.8123) [local] 0.1 a - -6. 5 69. 5822 (best c = 1. a, g=0.03125, rate =70.8123) mxm i: Figure 2: the result chart of command-line. Table 6 lists the accuracy of different methods on RS130 data set. The average accuracy for bilinear grid search method is 70.8%, which is competitive compared with those methods proposed by Rost, Sander and Jung-Ying Wang. The average accuracy for the method of Rost and Sander [17] is 68.2% , which employs neural networks for encoding. Other techniques must be incorporated in order to increase accuracy to 70.0%. Jung-Ying Wang utilizes basic SVM to [18] obtain 70.5% of the accuracy . The experiment used the same data set (including the type of alignment profiles) and secondary structure Figure 3: The result chart of contour. definition (reduction from eight to three secondary structures) as those employed by Rost and Sander [17], and Jung-Ying Wang [18]. The same accuracy assessment of the prior ones is used as well, so as to ensure the fairness of comparison. With the application of BGSM, the protein secondary structure prediction also obtains better learning accuracy compared with other algorithms. Different methods Secondary structure prediction accuracy % Neural networks 68.2 Neural networks incorporated With other techniques 70.0 SVM 70.5 SVM with bilinear grid Search method 70.8 Table 6: Comparison of different methods' accuracy on RS130 data set. 6 Conclusion In this paper, we demonstrate an approach of optimization in SVM learning. The proposed bilinear grid search method can effectively improve learning performance and enhance the accuracy of prediction. A comparison has been made between grid search method, bilinear search method and bilinear grid search method when selecting optimal parameters for RBF kernel. Experiment results prove that the proposed algorithm retains the advantages of both bilinear search method and grid search method. Acknowledgement This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant No.61273225, No.61100055 and No.31201121, the Natural Science Foundation of Hubei Province under Grant No. 2011CDB233. Reference [1] Vladimir N. Vapnik (1998). Statistical learning theory. J. Wiley & Sons, New York. [2] C. Cortes, Vladimir N. Vapnik (1995). Support vector networks. Machine Learning, Vol.20, No.3, pp.273-297. [3] Vladimir N. Vapnik (2000). The Nature of Statistical Learning Theory (Second Edition). Springer Press. [4] B Schölkopf, AJ Smola (2002). Learning with kernels. MIT Press. [5] Kai Zhang, Tsang, I.W. , Kwok, J.T.(2009). Maximum margin clustering made practical. IEEE Transactions on Neural Networks, Vol.20 , No.4, pp. 583 - 596. [6] E Blanzieri, F Melgani (2008). Nearest neighbor classification of remote sensing images with the maximal margin principle. IEEE transaction on Geoscience and Remote Sensing, Vol.46, No.6, pp.1804-1811. [7] GB Huang, QY Zhu, CK Siew (2006). Extreme learning machine: theory and applications. Neurocomputing. [8] S. Fine (2001). Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research, Vol.2, pp.243-264. [9] K. M. Lin and C. J. Lin(2003), A Study on reduced support machines. IEEE Trans. on Neural Computation, Vol.14, No.6, pp. 1449-1559. [10] Li. Lin and Zhang Xiaolong (2005). Optimization of SVM with RBF Kernel. Computer Engineering and Applications(in Chinese), Vol.29, No. 10, pp.190-193. [11] S. S. Keerthi, C. J. Lin(2003). Asymptotic behaviours of support vector machines with Gaussian kernel. Neural Computation, No.5:1667-1689. [12] O. Chapelle, V. Vapnik et al(2002). Choosing multiple parameters for support vector machines. Machine Learning, Vol.46, pp.131-159. [13] P. Wang, X. Zhu(2003). Model Selection of SVM with RBF Kernel and its Application. Computer Engineering and Applications(in Chinese), Vol.24, pp.72-73. [14] H. T. Lin, C. J. Lin(2003), A Study on Sigmoid kernels for SVM and the training of Non-PSD kernels by SMO-type methods. Technical Report, National Taiwan University. [15] Blake C., Merz C.(2013), UCI Repository of Machine Learning Databases. http://www.ics.uci.edu /mlearn/MLRepository.html, Dept. of Information and Computer Science, University of California. [16] C. C. Chang, C. J. Lin (2013). LIBSVM: A library for support vector machines. Software Available on-line at: http:// www.csie.ntu.edu.tw/~cjlin /libsvm /index.html . [17] B. Rost and C. Sander(1993). Prediction of protein secondary structure at better than 70% accuracy. Journal of Molecular Biology, Vol.23, No.2, pp.584-599. [18] Jung-Ying Wang(2002). Application of Support Vector Machines in Bioinformatics. Taipei: Department of Computer Science and Information Engineering, National Taiwan University. [19] http://www.cmbi.kun.nl/gv/hssp. [20] W. Kabsch and C. Sander(1983). Dictionary of protein secondary structure: Pat-tern recognition of hydrogen-bonded and geometrical features. Biopolymers, Vol.22, No.12, pp.2577-2637. [21] J. A. Cuff and G. J. Barton(1999). Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins: Struct. Funct. Genet., Vol.34, pp.508-519. An Imperative of a Poorly Recognized Existential Risk: Early Socialization of Smart Young Generation in Information Society Vladimir A. Fomichov Faculty of Business Informatics, National Research University Higher School of Economics Kirpichnaya str. 33, 105187 Moscow, Russia; E-mail: vfomichov@hse.ru; Web: http://www.hse.ru/en/org/persons/67739 Olga S. Fomichova State Educational Centre "Dialogue of Sciences", Universitetsky prospect 5, 119296 Moscow, Russia E-mail: vfomichov@gmail.com Keywords: misuse of information and communication technologies, early socialization of children, positive psychology movement, informational-aesthetic conception of developing cognitive-emotional sphere, theory of dynamic conceptual mappings, methods of emotional-imaginative teaching, levels of consciousness expanded model, system of values, cognitive engagement Received: January 28, 2014 The birth of this paper was motivated by the foundation in July 2013 of the Centre for the Study of Existential Risk (CSER) at the University of Cambridg;e (UK). The first aim of the paper is to attract the attention of the researchers and educators throughout the world to a poorly recognized kind of existential risk: it is the dangerous consequences of misusing the information and communication technologies by adolescents in the content of permanently increasing intelligent capabilities of computers and an easy access to Internet. The second, principal aim of this paper is to propose a constructive way out: it is earlier than usually socialization of children in order to inscribe a deep awareness of social responsibility into the conceptual picture of the child and adolescent and to enable him/her to analyse the consequences of the fulfilled actions. The proposed way out is elaborated under the framework of a new scientific discipline called cognitonics. This way is provided by the System of the Met^hods of Emotional-Imaginative Teaching (the EIT-system). The constructive core of this paper consists of two partes. The first part considerably expands the Level of Consciousness model proposed by P.D. Zelazo in 2004. It considers fours levels of the development of conscious control of thought, emotion, and action and covers the child's age from one to four years. Our model is based on the EIT-system and introduces three additional levels, where the seventh level is called the level of enhanced awareness of social agreements and social responsibility. Our model coders the ages from five - six to 13 - 14 years. The second part of this paper's constructive core presents a new look at the process of education when the values of the student act like a lighthouse for the teacher at the moment of presenting material and arranging the process of education, the process of acquiring knowledge. Povzetek: Kako socializirati mlado generacijo v informacijski družbi? 1 Introduction The technical characteristics of computers have been the humans. The starting point for V. Vinge was the improving since their birth and their intelligent ideas initially formulated by John von Neumann and I.J. capabilities since the end of the 1960s, when the Good [18]. Later the term "the singularity" was scientific-technical field "artificial intelligence" was popularized by R. Kurzweil [22], an inventor and born. Now computer systems are able to understand a futurist. broad range of texts in natural language (English, The achievements obtained during last decades on Russian, Chinese, Japanese, etc.), to recognize visual the way of designing intelligent computer systems and in images, and to do a lot of other things. several other fields, in particular, in genetics, The considerations of the kind became the reason for biotechnology, nanotechnology, became the reasons for introducing in 1993 the notion of singularity by V. Vinge introducing the notion of existential risk and for founding [30], a scientific fiction writer. It is the moment when in July 2013 the Centre for the Study of Existential Risk intelligent capabilities of computer systems will exceed (CSER) [3] as a research centre at the University of the capabilities of human beings, and the further Cambridge (UK). Now CSER is hosted within the development of computer systems will be determined by Cambridge's Centre for Research in the Arts, Social the needs of the global family of computers but not by Sciences, and Humanities. The goal of CSER is to study possible catastrophic threats caused by the existing or future technology. The co-founders of CSER are Dr. Huw Price (Bertrand Russell Professor of Philosophy, Cambridge), Dr. Marteen Rees (an Emeritus Professor of Cosmology and Astrophysics, Cambridge and former President of the Royal Society), and Jaan Tallinn (a computer programmer and a co-founder of Skype) [3, 20, 23]. Professor Price expressed the opinion that "sometime in this or the next century intelligence will escape from the constraints of biology". In this case "we're no longer the smartest things around, and may be at the mercy of "the machines that are not malicious, but machines whose interests don't include us" [20]. According to Price, many people consider his concerns as far-fetched. However, since the risks are very serious and we don't know the time parameters, it is necessary to put the problem into the focus of attention of international scientific community [20]. We completely agree with this opinion. However, we believe that there exists at least one additional kind of existential risk, and it is poorly recognized by international scientific community. This practically unnoticed global problem is the expanding negative consequences of misusing information and communication technologies (ICT) by a part of smart young generation. During last decade, one has been able to observe numerous cases when the hackers-teenagers managed to cause a very considerable damage to significant social objects and even military objects. Let's consider a risk that can emerge many years before the singularity. We will take into account the well known fact: the technical characteristics of computers double every two years. Imagine that one or several decades later a person with high computer skills (may be, an adolescent or a group of adolescents) will pose a socially dangerous task to a multi-agent system consisting of the future computers with very high intelligent capabilities. Then the damage from achieving the formulated goal may be very high, even immensely high. In other words, it is easy to assume that a person (regardless the age, spiritual maturity which includes the developed feeling of responsibility, intelligence maturity, which suggests the improved cognitive mechanisms of information processing) will be able to take power and influence the life of community, society, people all over the world due to his/her well-improved skills of using various ICT. It may happen even with school children, because every new generation born in the information society (IS) is much more skilful than the previous one, and they have much more time to improve their skills, because since the early childhood it is as usual as walk and talk for all the children. On the other hand, the curiosity and strong aspiration to discover the digital world are underpinned by the common (for their age) desire to emulate grown-ups and become as smart and powerful as grown-ups are, or even much smarter and much more powerful in comparison with the people belonging to previous generations. Even nowadays the teachers in various countries complain that school children are smarter and more skilful as they are. It discourages the teachers and makes the relationships with school children of the kind much more complicated. In various countries throughout the world, there is an age requirement for allowing an adolescent to drive a car. Obviously, the reason is dramatic consequences both for other people and for the young person in case a socially immature or a technically insufficiently qualified person will drive a car. In the modern IS, a considerable part of teenagers possesses very high computer skills, and they have access to Internet and its immense technical possibilities. But very often these teenagers are socially rather immature. For instance, UK is the country where the term "screenager" (instead of "teenager") was born [25]. It means that very many teenagers in UK spend much more time for the communication with computers than with people. This fact allows us to conjecture that very many teenagers in UK possess high computer skills. However, the psychologists discovered in 2013 that many boys and girls at the age from 18 until approximately 25 years are rather socially immature (see Figure 1) and should be treated by the psychologists as teenagers. A part of university students living in one home with their parents considerably increased. The parents were recommended to increase the socialization of their children - students by means of asking them to wash their dresses, to pay various receipts, etc. [31]. Taking this situation into account, we may imagine the cases of misusing ICT not due to any bad intention but because the consequences have not been thought over in detail. The problem looks like an iceberg, and the humans in general way may become the passengers of "Titannik", because they don't expert an iceberg on the way. This paper continues the line of the articles [9, 10, 15]. Metaphorically speaking, the aim of this series of publications and of this paper is to propose the kind and the parameters of a manoeuvre preventing the collision of our information society with the iceberg of described sort. This manoeuvre is much earlier socialization of children than it is done now throughout the world; that is, it is a way of early and deep inscription of the notion "responsibility" into the child's conceptual picture. S O C I E T Y Computer and its abilities Computer as a tool teenager Computer as a replacement of the reality Figure 1: Screenagers and society Our key idea is to inscribe into the world's conceptual picture of the child a deep awareness of social agreements and the feeling of social responsibility before the transition age 11 - 12 years (before the age of conflicts and evoking sexuality). The proposed way of early children's socialization has been elaborated under the framework of Cognitonics [8-12]. It is a new scientific discipline, its first aim is explicating the distortions in the development of the personality and national cultures caused by the peculiarities of information society and globalization. The second (principal) aim of Cognitonics is coping with these distortions in different fields by means of elaborating systemic solutions for compensating the negative implications for the personality and society of the stormy development of ICT and globalization processes, in particular, for creating cognitive-cultural preconditions of the harmonic development of the personality in the information society and for ensuring the successive development of national cultures and national languages. The goals and ideas of Cognitonics have evoked a vivid interest of the scholars from over 20 countries located in Africa, Asia, Europe, North America and South America. Due to this interest, three successful international scientific conferences on Cognitonics were organized in the years 2009, 2011, 2013 under the framework of the international scientific multiconference "Information Society" (Slovenia, Ljubljana, Jozef Stefan Institute) [1, 2, 16]. The constructive core of this paper consists of two parts. The first part (Sections 2-6) considerably expands the Level of Consciousness model proposed by P. D. Zelazo in 2004 [32]. It considers fours levels of the development of conscious control of thought, emotion, and action and covers the child's age from one to four years. Our model introduces three additional levels, when the seventh level is called the level of enhanced awareness of social agreements and social responsibility. Our model covers the ages from five - six to 13 - 14 years. The second part (Sections 7 - 11) of this paper's constructive core presents a new look at the process of education when the values of the student act like a lighthouse for the teacher at the moment of presenting material and arranging the process of education, the process of acquiring knowledge. Four discoveries underpinning the proposed complex method of early socialization of children in modern IS are described. The first discovery is the fundamental conclusion that young children and adolescents can be attributed to one of two groups (children with preponderance of material values and children with preponderance of sublime values), and different methods of teaching should be developed and used for achieving educational success for each of these two groups. The second discovery is an original method of splitting young children in two groups of mentioned kinds. The third discovery is two developed different practical approaches to teaching allowing to achieve educational success for each of two groups. The fourth discovery is the proposed notion of cognitive engagement and original methods enabling a teacher to successfully reach the goals of teaching in each of two groups by means of realizing cognitive engagement of students at lessons. As a result, a new psychological and educational paradigm is presented. 2 Central ideas of positive psychology movement Let's consider a broad scientific context being most appropriate for stating and assessing our original, many-staged method of early children's socialization. During the 1990s, it was possible to observe the steady growth of the number of children at school age in the developed countries encountering various social, emotional, and behavioral problems. Numerous observations provide the possibility to conjecture that, to a large extent, it was a consequence of more intensive interaction with computers at lessons and at home and of stormy Internet's expansion. Besides, the criminal films and horror films continued to negatively influence the mental state of very many children and adolescents, in particular, causing anxiety and aggression. These negative shifts became sufficiently noticeable by the beginning of the 2000s. According to [28], approximately one fifth of children and adolescents experienced problems showing their need for mental health services. One of the consequences of this conclusion was the increased attention of the scholars to clarifying the extent of exposure to and use of media and electronic technology by very young children. A large-scale study described in [29] showed, in particular, the following alarming facts: (a) 27% of 5-6-year-old children used a computer during 50 minutes on average on a typical day; (b) more than one third of 3- to 6-year olds also have a television in their bedroom; 54% adults said that it frees up other TV in the house, that is why other family members can watch their own shows, 38% of adults indicated that it keeps the child occupied, so the parents can do things around the house. As a principal way out in the current situation with mental health of the young generation, many psychologists indicated the importance of promoting children's social and emotional experience in schools. As a consequence, a new paradigmatic shift was observed in psychology: a shift from the accent on repairing weakness to the enhancement of positive qualities and preventing the problems before the moment when these problems arise [27]. As a result, the positive psychology movement was born, the principal objective of this movement is studying the positive features of humans development, in particular, investigating such significant traits of the person as "subjective well-being, optimism, happiness, and self-determination" [27, p. 9]. As a logical consequence, the task of promoting positive emotions in children and adolescents was posed [19]. The evidence obtained in the 2000s shows that a critical role in the success of children in school and in their social and emotional competence is played by self-regulation, in particular, by controlling attention and inhibiting aggressive reactions. The publications on positive psychology allow to distinguish a factor being beneficial to well-being, this factor is called mindfulness [21]. According to the definition given in [24], it is a way of directing attention. Generalizing a number of available definitions of this concept, mindfulness can be characterized as the ability to maximally proceed from the context while taking decisions in any situations. It is the ability of paying attention to many details while elaborating a decision but not only "mechanically" following a number of prescribed rules, etc. 3 The key role of broad beauty appreciation The analysis of scientific literature provides weighty grounds for concluding that the first educational system satisfying the criteria of a mindfulness-based program was born and well tested several years before the emergence of the term "mindfulness-based educational program". Such criteria are satisfied by the system of the methods of emotional-imaginative teaching (the EIT-system). The core of the EIT-system was elaborated by O.S. Fomichova in the first half of the 1990s and has been expanded in the second half of the 1990s and in the 2000s. This system is underpinned by our Theory of Dynamic Conceptual Mappings (the DCM-theory). This theory is stated in numerous publications both in English and Russian, in particular, in [5 - 15]. Both the DCM-theory and the EIT-methods form a principal part of the cognitonics constructive core. The main component of the DCM-theory is an original informational-aesthetic conception of developing the cognitive-emotional sphere of the learners: young children, adolescents, and university students [6, 7, 9, 10]. On the one hand, this conception says that it is important to actively develop a broad spectrum of the learners' information processing skills. On the other hand, our conception has a number of original features. First of all, it is the idea of the necessity of inscribing, in a systemic way, the feeling of beauty into the world's conceptual picture of the child. Proceeding from our experience accumulated during 23 years, we consider the following educational processes as the principal instruments of achieving this goal: (a) early support and development of figurative (or metaphoric) reasoning; (b) teaching young children (at the age of 5 - 6) very beautiful language constructions for expressing the impressions from the nature; (c) a unified symbolic approach to teaching natural language (mother tongue and a foreign language), the language of painting, and the language of dance [6 - 15]. The next central idea is the conclusion about the necessity of passing ahead the development of soul in comparison with the development of reasoning skills. A well-developed feeling of beauty plays an especially significant role in the realization of this idea. Besides, it is very important to be aware of the fact that children should have enough time for the development of soul: the time for contemplation, for imbibing the beauty of the nature, etc., i.e. children should have time for self-paced activity [7, 14]. Much more information about our informational-aesthetic conception of developing the cognitive-emotional sphere of the learners can be found in [6, 7, 9, 10, 13, 14]. For the realization of these ideas, an interdisciplinary educational program has been developed by O.S. Fomichova. The elaborated program is intended for teaching children during twelve years, where the starting age is five to six years. The program has been personally tested in Moscow with permanent success by O.S. Fomichova over a period of 24 years. The total number of successfully taught students (young children and adolescents) exceeds nine hundred. The program is implemented at extra-scholastic lessons of a foreign language (English), literature and poetry in mother tongue and second language, symbolic language of painting, communication culture, and classical dance. All these lessons are the links of one twelve-year-long educational chain. More details about the composition of the program can be found in [9 - 15]. A considerable role in the success of the educational program play regular (every semester) performances - a form of demonstrating knowledge and skills acquired during the current semester. The scripts for the performances are original and take into account both the learned materials and the individuality of each student. All personages in the performances are positive, no one negative. During each semester, all young students and teenagers master new elements of classical dance and train the known elements. Each performance includes singing world known songs in English, e.g., "White Christmas" and "Let It Snow" in case of winter performances. The highest form for demonstrating the acquired communication culture and culture of classical dance is the Christmas and Easter Big Balls, including the most part of students at the age from seven to twenty. 4 An environment of conceptual learning One of the distinguishing features of our approach to this problem is that it is realized at lessons of a foreign language (FL) - English, where the mother tongue of children is Russian. The use of original analogies (being the parts of fairy-tales and thrilling stories) for teaching the English alphabet, the rules of reading, and the basic rules of English grammar contributes to developing associative abilities of children at the age of 5 - 6. The EIT-system provides an environment of conceptual learning instead of a memorization-based one. In particular, it is the principal distinguished feature of the developed original approach to teaching FL as an instrument of thinking. The interesting stories about the life of verbs and other words (see, in particular, [5, 7, 8]) establish in the consciousness of the young child a mapping from the objects and situations of the real life or fairy-tale life to the domain of language entities (letters, sounds, verbs, nouns, pronouns, etc.). That is why the consciousness of the young child receives a considerable impulse to developing the ability to establish diverse analogies. The other reason for using the lessons of FL is that (as a 24-year-long experience has shown) young children easier learn beautiful language constructions for describing the impressions from the nature than the equivalent constructions in mother tongue (see [6, 7]). The explanation of this phenomenon is that in the first case children don't feel any contradiction with the everyday use of language. Example. Let's consider a fragment from the home composition "The Winter Day", it was written in English by an eight-year-old Russian speaking student Polina of the third year of studies in experimental groups: THE KINGDOM OF I^HE WINTER One winter day I was sitting near the window looking at the street covered with fresh clean snow. At first time, there was nothing so remarkable in that. Nor did I think it so very much out of the way to see that falling snowflakes, snow storm, the grey cloudy sky and the noisy crows. But when afterwards in the evening going to sleep I thought it over, it occurred to me that I ought to have wondered at this. I thought that the snow storm might be a wicked magician Winter, the grey sky with running clouds - his kingdom. Every beautiful princess that refused to be his wife because he was very angry and cruel was turned by him into a crow. And then their tears he turned into the falling snowHakes. And only the coming of the kind Fairy Spring can destroy this magic. The realization of the teaching objectives mentioned above in this section is an important part of the first stage of supporting and developing the reasoning skills and creativity of the child. A map of cognitive transformations realized at this stage and the maps reflecting the next cognitive transformations can be found in [9, 10]. 5 A known four-level model of consciousness development It seems that the model proposed by Zelazo [32] can be considered as a good working instrument for studying the development of conscious control during the first -fourth years of childhood. This model, called the Levels of consciousness (LOC) model, emerged as a result of reflecting the experimentally discovered regularities of the development of conscious control of thought, action, and emotion. The model describes four transitions from one LOC to another, higher LOC, these transitions depend on age. Let us say about the zero LOC in case of newborn babies and very young children at the age less 11 - 12 months. Zelazo [32] characterizes the consciousness of this period as minimal consciousness; it is responsible for approach and avoidance behaviour based on pleasure and pain and is present-oriented, unreflective and doesn't operate with the Self-concept. The principal distinguished feature of LOC1 is the emergence of concepts and of the connections between the perceived objects and concepts (playing the role of labels of experienced objects). lOC1 is called by Zelazo [32] as the level of recursive consciousness. LOC2 emerges at the end of the second year, the essence of the transition from LOC1 to LOC2 consists in the emergence of symbolic thinking, in children's awareness of Self. The signs of LOC2 are the first use of personal pronouns by children, their self-recognition in mirrors. Besides, children feel first self-conscious emotions, first of all, shame. LOC3 is called by Zelazo as reflective consciousness 1, usually this level characterizes the consciousness of three-year-olds. The manifestation of this level is the ability of children to systematically use a pair of arbitrary rules (for instance, the object of big size and of small size) for sorting the pictures representing these objects. However, the executive function of three-year-olds is still limited, it was shown by the experiments with Dimensional Change Card Sort. For being successful in this game, children must integrate two incomparable pairs of rules into a single structure. This ability characterizes the LOC4, called by Zelazo [32] as reflective consciousness 2. Usually, LOC4 emerges by the end of the forth year, this level is also characterized by a spectrum of meta-cognitive skills. 6 Expansion of the levels of consciousness basic model in cognitonics It seems that the broadly felt necessity of promoting children's emotional and social competence in schools and the lack in the scientific literature of rather simple solutions to this problem are the grounds for putting forward the following conjecture: the levels of consciousness model proposed by Zelazo [32] indicates only some basic stages of consciousness development. The goal of creating appropriate theoretical foundations of promoting children's emotional and social competence will lead to discovering additional, higher stages of the child's consciousness development corresponding to mature emotional and social competence of the child. Realizing this idea, let's give a new interpretation of the methods of developing conscious control of thought, action, and emotion described in [9, 10, 14] and belonging to the System of the Methods of EmotionalImaginative Teaching. We'll suppose that these methods underpin the transition from the level of consciousness 4 (LOC 4) to LOC 5, from LOC 5 to LOC 6, and from LOC 6 to LOC 7. The new levels LOC 5, LOC 6, and LOC 7 will be respectively called the level of broad beauty appreciation, the level of appreciating the value of thought, and the level of enhanced awareness of social a^greements and social responsibility [11, 12]. A very short, preliminary description of these levels is as follows. Reaching LOC 5 by the person means that this person possesses a well-developed feeling of beauty in various manifestations: the beauty of a thing, of an idea, of an expression, of a picture or sculpture, of the interpersonal relationships, etc. [10, 14]. The successful transition from LOC 5 to LOC 6 means that (a) a child is aware of the fact that his/her ideas may be socially significant, i.e. the child may be appraised by the friends or adults for the originality and beauty of his/her idea; (b) a child appreciates the value of the thoughts of other persons [8, 13]. Reaching LOC 7 by a person means that this person is sufficiently mature in the social sense, i.e. possesses an enhanced awareness of social agreements and social responsibility [9, 10]. It should be underlined that modern preschool and school educational systems in various countries encourage only a rather small proportion of children to reach the 5th - 7th levels of conscious control. But to considerably increase this proportion is vitally important for successful socialization of children in information society. Happily, at least one broadly applicable way of solving this problem has been available since the 1990s, it is given by the EIT-system. 7 Two kinds of values imply different methods of teaching 7.1 Two kinds of values The human being is brought up in the own culture and imbibes the spirit of the culture he/she is brought up. On the level of the every-day communication and acting, the culture is revealed in the answers to the following questions: what you value, what you believe, and how you act. It is well known: "For where your treasure is there will your heart be also". It means that main values influence greatly the way a person perceives and processes the information, acquires knowledge, because the values emotionally colour every cognitive process. Figure 2: The updated configuration of the levels of consciousness model. The levels 1 - 4: P.D. Zelazo, 2004; The levels 5 - 7: V. Fomichov and O. Fomichova, 2013. A cognitive process includes analysis, estimation, forecast, decision making, and it is underpinned by a system of values. An educational process under the frame of Cognitonics takes into account the values of students in order to create an inspiring and creative atmosphere at the lessons. If the students share lofty ideas and sublime values, have aspiration to think and act in terms of public good and benefit to the society, then it is advisable to show, for example, the beauty of mathematical solutions and equations, the beauty and value of a thought, a metaphor, to show how one and the same idea is expressed by the language of painting ("Twilight. Moon" by I. Levitan) and natural language (the moment when Alice is dozing off in the book by Lewis Carroll "Alice in Wonderland"). If the students seek for pleasure and share the commercialized values, then their motivation is different: they take a decision here and right now without awareness of their responsibility for next generations and without gratitude to previous generations. It means that they don't consider themselves as a link between generations. In this case it is advisable to be logical, give clear solutions to the equations, do not give the so called "additional information", do not quote poetry. E.g., while explaining mathematics, try to avoid establishing the links between various languages and natural language. The atmosphere of a lesson and the way of presenting information will meet the expectations of the audience, and the process of information processing will be successful and arise curiosity [15]. 7.2 The main parameters of the values assessment The process of assessment is very delicate and can't be called a precise one. The main question the students have to answer to let teachers guess the direction of their way of thinking is as follows: whether it is my cup of tea. If Yes then whether it is good for me; if Yes then it evokes emotions and becomes thought and interest provoking. In case with the young, 6-8-year old children it is helpful to listen to their answers and considerations, paying special attention to the way they put the ideas, answering the following questions: (a) where did you spend your summer holidays; (b) what is your favourite dish cooked by your Mam or Great Mam for you; (c) what do you do when it is raining outside; (d) do you remember the gift Santa Claus presented you with last Christmas? (e) Do you have free time; (f) what is your favourite book: (g) can you give an example of your brightest impression; (h) what is beauty for you? (i) when do you feel yourself happy; (j) what you like to draw? The given answers, the way they are considering, the language they use reveal the atmosphere in which they brought up, the way they view the world around, the point of their interests, the things they are impressed by (remember the song "My favourite things" from the film "Sounds of Music"). While analysing the answers to questions, it is important to pay attention to the following: (a) whether they like dishes cooked by the mother or take away dishes? (b) if they spend summer in one and the same place, whether they are impressed by something? (c) whether children notice the change in the weather, whether they see only dirt (for example, in early spring) or notice dripping roofs, soaked roads, bluish-grey snow, and lots of "mirroirs" scattered everywhere by the spring to make the trees prepare for the spring blooming? (d) what kind of life situations do they appreciate, what makes them think, laugh, cry, feel compassion; (e) what impressed them and what makes them excited and expired; (f) what makes them happy? [15]. 8 How to split children into groups and let them shift from one group to another Let us start with an example. We have received two descriptions of the late autumn. The first one: "It is the time when the weather is getting colder, the day -shorter, the night - darker and longer, but there is no snow". The second one: "It is the time when the water is getting tired, and it means that the snow is near. "What is up?" - "The snow is up or perhaps down". The first child enumerates the signs which help him to understand that the winter will come soon. He acts as a observer, as a researcher, discovering the changes and establishing the links between a cause and a consequence. The second child reveals a poetic way of observing the nature, he uses the metaphor "tired water" in case he knows nothing about metaphors. It is just his way of viewing the world and establishing another kind of links, endowing everything with feelings. The way children perceive the world influences the type of material presentation: so called poetical or scientific. In both cases the curiosity is aroused, information processing ability and sound creativity are improved. Both cases aim at paying a special attention to improving the language skills. It is possible for children to shift from one group to another if the changes in the world perception are revealed. GROUP 1 i 1 ▼ The students with sublime values The students with not strictly defined values and aspiration to sublime values 1 GROUP 2 The students with material values The collection 1 of teaching methods The collection 2 of teaching methods Figure 3: The impact of value system on teaching methods and on a shift (concerning a part of students) from one group to another group. 9 Methods of assessing educational progress The swiftness of establishing the conceptual links between different thematic domains reflects the maturity of a cognitive mechanism. The process of studying and socialization aims, in particular, at constructing a great number of thematic subspaces in the world's conceptual picture of the child [15]. If the conceptual links are not activated while discussing various books, stories, while analysing information, taking a decision, then the child can't use his/her background knowledge. As a result, the processes of information processing, of taking a decision, of socialization become more complicated and very often mislead the child. The examples of constructing conceptual links between different thematic domains at the lesson during a 20-minutes active creative work are given on Figure 4 and Figure 5. V = 8/20 min = 0.40 Figure 4: The speed of forming conceptual links during one lesson (the first year of studies, the age of children is 6 years). After the fifth year of studies in experimental groups, the speed of forming conceptual links during one lesson exceeds the value 1. It means that during 20 minutes a group consisting of 21 - 23 students generates over 20 links between different thematic domains. In order to better understand the difference between computer-dependent and computer-independent thinking, we'll consider the essence of creative thinking with the help of a scheme of constructing creative cognitive pinnacles. Creative thinking suggests the ability of the student to create a new reality or transfigure the existing one. Computer dependent thinking means following the logic of the computer. In case of establishing the conceptual links between various application domains, the qualitative characteristic is defined by the quantity of the application domains linked together, on the one hand, and the remoteness of these application domains from one another (that is, the lack of the evident ties between the domains), on the other hand. A study of metaphoric thinking was carried out according to the logic described The beginning of autumn: a transfiguration of nature No music is heard except lullabies The 13th fairy is asleep The water is getting drowsy: no ripples, splashes, sounds The mice are snuffling in the holes Sleeping Beauty Nothing is astir; the movement has gone, the silence is left The Queen has fallen asleep The King has fallen asleep on the horse back The dogs are snoring in the kennels The cooks and cats are asleep in the kitchen The doves have fallen asleep on the window-sill Figure 5: The speed of forming conceptual links during one lesson of the second year of studies; the age of children is 7 years; V = 12/20 minutes = 0.60. Secondary Creative Pinnacles Waves of the universe left on the shore the big yellow amber to warm the travelers in the darkness and the small pieces of amber to make one of them fall down to the Earth in order to make some one's wish come true. A red downy cat is sleeping, hid her nose in her tail, curled near the milk spilt by her kittens. They called the spilt milk Milky Way. Figure 6: The examples of secondary creative pinnacles below. Step 1. Taking into account the initial metaphor and the number of the metaphors created in accordance with the initial model, the students reach the first creative pinnacle. It corresponds to a new metaphor being very different from the initial one. It is a result of the unexpected coincidence of the phenomena from two application domains. Example. Suppose that the initial metaphor is "The moon is a piece of cheese for the mice". Following this model, the young students generate a lot of metaphors, for instance, "The moon is a big round ice cream", "The moon is a pancake with a sour cream", "The moon is a piece of melon". Then one student reaches the following first creative pinnacle: "The moon is the silvery ball under the circus cupola. In the circus everyone is awaiting for his/her turn to appear on arena lit up with the millions of the sparkling starts scattered from that silvery ball. In the morning the moon will disappear, the stars will fade, and everyone will go for a work. The miracle happens only night". Step 2. The secondary creative pinnacles designate the appearance of a principally new metaphor based on the independent creative pinnacles (see Figure 6). The initial metaphor usually is a response to the request of a teacher. Then the process of creating metaphors goes on until a principally new conceptual metaphor is created (a creative pinnacle). For the researchers, the creation of secondary creative pinnacles is much more interesting. The existence of the tendency of the emergence of the secondary creative pinnacles and the development of the process of the creation reveal the speed and the quality of the development of the cognitive mechanisms. The maturity of the cognitive mechanisms is revealed in the ability of using metaknowledge. Unfortunately, computer dependent thinking reveals only the initial metaphor suggested by the computer and the process of creating metaphors according to a model. But it doesn't reveal creative pinnacles of any other levels, because of the lack of a vivid, lively, inspired atmosphere of discussion without computer support. Computer dependency blocks the ability of creating a new reality as a result of considering this activity as an excessive activity. The digital reality makes the computer and ICT overwhelming in numerous spheres of human's activity. It creates the illusion of a new step on the way of civilization. But the development of the civilization without spiritual development is the greatest distortion that diminishes the creative ability of the mind or transfers it into another form, a form of adjusting but not a kind of breakthrough. There should be two clear, well-balanced main subjects of the educational process of any level: (a) computer literacy, because ICT can directly contribute to human capabilities; computers and the Internet have a crucial influence on individual economic achievements and carrier development in the information society; (b) the development of the cognitive mechanisms of information processing and the improvement of the ability of metaphoric thinking, it leads to improving the serendipity. 10 How to achieve cognitive engagement of the students Cognit^i^e engagement can be defined as the process of highly motivated intellectual activity when the interest towards the subject under discussion is so strong that the students loose the track of time and, as a result, they are not tired. The students' interest determines the level of involvement. The emotional response is very close to inspiration, because they are making their own discoveries, and their mental efforts are appreciated. It helps to provide a conceptual learning environment instead of a memorization based one and enhances the motivation [15]. Cognitive engagement is characterized by the following things: - f^ocused attention; it means that within the first five minutes of a lesson the students have come to the conclusion: it is my cup of tea; - posit^i^e effect (how do you feel about it); it means that the second conclusion is as follows: "it is good for me"; - aesthetics; it means that the way the material is presented meets the expectations of the students, it can be compared with various communicative styles: while communicating, it is better to stick to one style; in this case, it won't disappoint the partner of communication and make the conversation an easy and pleasant business; if the values of the students are clear and they are split into the groups according to their values, then it is easier to arrange the presentation either in a more pragmatic or a more poetical way (metaphorical way); - endurability; it means that a student remembers a good experience and wants to repeat it; - novelty; it is present at every lesson and provides intellectual and spiritual nourishment; - reputation, trust, and expectation; the reputation of a teacher (his/her personal reputation and the professional one) suggests the situation when the students trust the teacher, appreciate his/her time and knowledge and act as the colleagues in the process of co-creation, still being aware of the distance between the teacher and the students, they respect this distance due to reputation of the teacher; in this case, the actions of both sides of the educational process meet the expectations of each other; - motivation; the motivation of the students is closely connected with their values; the human being can be called a biological anticipatory system; everyone answers the questions: "What is good for me and how to achieve the state of complete happiness?"; but everyone defines happiness in his/her own way according to his/her understanding of values; some students are happy if they receive excellent marks; others need not only excellent marks but the awareness of intellectual and spiritual maturity, broad outlook (unconsciously, they are searching for their calling); and only in this case their level of happiness is changed [15]. To achieve cognitive engagement is very important. On the one hand, it is a marvel, because the teacher and the students become colleagues in the process of co-creation and making decision and keep the distance between the students and the teacher which is underpinned by trust, respect, and appreciation. On the other hand, it is a well managed process of knowledge acquisition. This process is underpinned by the described above mechanism of starting up the creative process in the heads of the students and creating at a lesson a special, thought-provoking atmosphere providing an opportunity for the most effective knowledge acquisition and information processing. We have discovered the conditions under which this mechanism works well. The main condition is splitting students into different groups according to their values. The values are taken into account for creating an inspiring atmosphere, it is the most comfortable for knowledge acquisition. The students step by step receive serendipitous information: it is not expected but desirable and conduces to making their own discoveries. 11 Related approaches One of the central ideas of our approach to early socialization of children by means of introducing them, in an original way, to the humanities is to teach young children and adolescents to appreciate beauty in all its manifestations, in particular, the beauty of nature, painting, poetry, music, classical dance. The first additional level of consciousness (LOC) development introduced by us in comparison with the LOC model by P.D. Zelazo [32] is called the level of broad beauty appreciation (see Figure 2 in Section 6). This idea and very positive educational results obtained during 24-years-long study excellently correlate with the conclusions of a three-year study carried out by cognitive neuroscientists from seven leading universities in USA [17]. The latter study was led by Dr. M.S. Gazzaniga from the University of California at Santa Barbara. This study included, in particular, the following conclusions [4, 17]: An interest in a performing art contributes to the development of the sustained attention, and it is necessary both for improving performance and for the training of attention as a precondition of the improvement in other conceptual domains. There are special links (extending far beyond the music) between high levels of music training and the ability of processing information in both working and long-term memory. In particular, as concerns children, the success at the lessons of music develops the skills of solving geometrical problems. There are positive correlations between the regular lessons of music and both reading acquisition and learning of sequences. The training of complex actions improves the memory due to the learning of general skills for manipulating semantic information. The process of learning to dance by means of attentive and effective observation is close to the process of learning with the help of physical practice. The effective observational learning may contribute to the development of other cognitive skills. The educational program realizing the ideas of our informational-aesthetic conception of developing cognitive-emotional sphere of the learners includes, in particular, every-semester performance in each group. The significant components of each performance are classical dances, musical pieces, and songs. The observations accumulated during 24 years confirm the listed conclusions of the three-year long study described in [17]. However, some our theoretical ideas and obtained results considerably expand the conclusions formulated in [17]. First of all, it applies to the following discovery in cognitive biology, cognitive psychology, and cognitive linguistics done in the end of the 1990s: the consciousness of normal, average child at the age of five - six physiologically needs a rich language (much richer than it is broadly accepted to believe) for expressing the impressions from the beauty of nature (see [6, 7] and Section 4 of this paper). The next significant result is an original method of supporting and developing metaphoric thinking of the child (at lessons devoted to foreign language and to studying the symbolic languages of poetry and painting) as a basic tool for supporting and developing creativity of the child, for the realization of the child's Thought-Producing Self [8, 13, 14]. The method of reaching LOC7 (the level of enhanced awareness of social agreements and social responsibility) proceeds from the central idea of J.R. Searle [26] about natural language as the primary means of constructing social reality and considerably expands and works out in detail this idea, inscribes it into educational practice. 12 Conclusions This paper grounds the necessity of much earlier socialization of children in modern information society than it is usually done throughout the world. The paper sets forth the deep connections of cognitonics with the positive psychology movement. It is shown that cognitonics suggests a system of original, mindfulness-based educational methods supporting well-balanced cognitive-emotional development of the personality in modern information society, it is called the system of the methods of emotional-imaginative teaching (the EIT-system). The analysis of central ideas of the EIT-system provided the possibility to enrich developmental psychology: the basic model proposed by P.D. Zelazo (2004) considers 4 levels of consciousness development (corresponding to the age from one to four years), and this paper introduces three new levels (they cover the ages from five - six to 13 - 14 years). The paper presents a new look at the process of education when the values of the student act like a lighthouse for the teacher at the moment of presenting material and arranging the process of education, the process of acquiring knowledge. Four discoveries underpinning the proposed way of solving this problem are shortly described. This way is provided by the EIT-system belonging to the constructive core of Cognitonics. The described methods have been successful tested in the course of a longitude study covering 24 years of introducing young children and adolescents to the humanities. The EIT-system has been mainly realized at lessons of English as a foreign language for Russian-speaking children and at the lessons of poetry and literature in English, at lessons devoted to explaining the symbolic language of painting, the culture of communication, and the symbolic language of classical dance. These kinds of lessons are considered in numerous countries as highly appropriate for young children and teenagers. The carefully selected collection of texts used at lessons is provided by a number of classical, world-known fairytales and novels, in particular, "Snow White", "Cinderella", "Sleeping Beauty", "Pinocchio", "Pollyanna", "The Life and Adventures of Santa Claus" by L. Frank Baum, "Alice in Wonder Land" by Lewis Carroll, "The Wind in the Willows" by Kenneth Grahame, "The Hundred and One Dalmatians" by Dodie Smith, etc. That is why the EIT-system may be used (after a certain adaptation requiring a small time) in English-speaking countries and in numerous countries where the English language is learned as a second language. Acknowledgement We are grateful to our friend, Senior Software Researcher Alexander Artyomov for his generous help concerning the preparation of the electronic versions of the figures in this paper. References [1] Bohanec, M., Gams, M., Rajkovič, V. et al., Eds. (2009). Proceedings of the 12th International Multiconference Information Society - IS 2009, Vol. A, Slovenia, Ljubljana, 12 - 16 October 2009. The Conference Kognitonika/Cognitonics. Jozef Stefan Institute; http://is.ijs.si/is/is2009/zborniki.asp?lang=eng; pp. 427-470; retrieved 15.12.2013 [2] Bohanec, M., Gams, M., Mladenič, D. et al, Eds. (2011). Proceedings of the 14th International Multiconference Information Society - IS 2011, Vol. A, Slovenia, Ljubljana, 10 - 14 October 2011. The Conference Kognitonika/Cognitonics. Jozef Stefan Institute; , http://is.ijs.si/is/is2011/zborniki.asp?lang=eng; pp. 347-430; retrieved 15.12.2013 [3] Centre for the Study of Existential Risk (2013); http://cser.org; retrieved 24.09.2013. [4] Christofidou, A. (2013). Remembrance of things past, a research hypothesis. In Proceedings of the 16th International Multiconference Information Society - IS 2013, Slovenia, Ljubljana, 7 - 11 October 2013. The Conference Kognitonika/Cognitonics. Jozef Stefan Institute; pp. 409-412; http://is.ijs.si/is/is2013/zborniki.asp?lang=eng; retrieved 15.12.2013 [5] Fomichov, V.A., Fomichova, O.S. (1994). The Theory of Dynamic Conceptual Mappings and its Significance for Education, Cognitive Science, and Artificial Intelligence. Informatica. An International Journal of Computing and Informatics (Slovenia), 1994, Vol. 8, No. 2, pp. 31-148. [6] Fomichov, V.A., Fomichova, O.S. (1997). An Informational Conception of Developing the Consciousness of the Child. Informatica. An International Journal of Computing and Informatics (Slovenia), Vol. 21, pp. 371-390. [7] Fomichov, V.A., Fomichova, O.S. (2001). A Many-Staged, Humanities-Based Method of Realizing the Thought-Producing Self of the Child. Consciousness, Literature and the Arts, 2001, Vol. 2, No. 1; http://blackboard.lincoln.ac.uk/bbcswebdav/users/d meyerdinkgrafe/archive/fomichov.htm; retrieved 09.12.2013 [8] Fomichov, V.A., Fomichova, O.S. (2006). Cognitonics as a New Science and Its Significance for Informatics and Information Society; Informatica. An International Journal of Computing and Informatics (Slovenia), Vol. 30, pp. 387-398. [9] Fomichov, V.A., Fomichova, O.S. (2011). A Map of Cognitive Transformations Realized for Early Socialization of Children in the Internet Age, in M. Bohanec et al (eds.). Proceedings of the 14th International Multiconference Information Society - IS 2011, Ljubljana, pp. 353-357; http://is.ijs.si/is/is2011/zborniki.asp?lang=eng; retrieved 14.12.2013 [10] Fomichov, V.A., Fomichova, O.S. (2012). A Contribution of Cognitonics to Secure Living in Information Society; Informatica. An International Journal of Computing and Informatics (Slovenia), Vol. 36, pp. 121-130; www.informatica.si/vol36.htm#No2; retrieved 17.11.2013 [11] Fomichov, V.A., Fomichova, O.S. (2013a). The Peculiarities of the Mindfulness-Based Development of the Personality under the Frame of Cognitonics. In George E. Lasker and Kensei Hiwaki (Eds.) Personal and Spiritual Development in the World of Cultural Diversity, Vol. X. The International Institute for Advanced Studies in Systems Research and Cybernetics (IIAS), Tecumseh, Ontario, Canada, 2013, pp. 37-41. [12] Fomichov, V.A., Fomichova, O.S. (2013b). The Significance of Mindfullnes-Based Educational Methods Provided by Cognitonics for Positive Psychology Movement. M. Gams, R. Piltaver, D. Mladenič et al (Eds.), Proceedings of the 16th International Multiconference Information Society - IS 2013, Slovenia, Ljubljana, 7 - 11 October 2013. Vol. A. The Conference Kognitonika/Cognitonics. Jozef Stefan Institute, pp. 425-429; http://is.ij s. si/is/is2013/zborniki.asp?lang=eng; retrieved 14.12.2013 [13] Fomichova, O.S., Fomichov, V.A. (2000). Computers and the Thought-Producing Self of the Young Child; The British Journal of Educational Technology, Vol. 31, pp. 213-220. [14] Fomichova, O.S., Fomichov, V.A. (2009). Cognitonics as an Answer to the Challenge of Time; Proceedings of the 12th International Multiconference Information Society - IS 2009, Slovenia, Ljubljana, 12 - 16 October 2009. The Conference Kognitonika/Cognitonics. Jozef Stefan Institute, 2009, pp. 431-434; available online at http://is.ijs.si/is/is2009/zborniki.asp?lang=eng; retrieved 10.12.2013 [15] Fomichova, O.S., Fomichov, V.A. The Risk of Postponing Early Socialization of Smart Young Generation in Modern Information Society. M. Gams, R. Piltaver, D. Mladenič et al (Eds.), Proceedings of the 16th International Multiconference Information Society - IS 2013, Slovenia, Ljubljana, 7 - 11 October 2013. Vol. A. The Conference Kognitonika/Cognitonics. Jozef Stefan Institute. 2013, pp. 430-434; http://is.ijs.si/is/is2013/zborniki.asp?lang=eng; retrieved 11.12.2013 [16] Gams, M., Piltaver, R., Mladenič, D. et al., Eds. (2013). Proceedings of the 16th International Multiconference Information Society - IS 2013, Slovenia, Ljubljana, 7 - 11 October 2013. The Conference Kognitonika/Cognitonics. Jozef Stefan Institute; pp. 403-482; http://is.ijs.si/is/is2013/zborniki.asp?lang=eng; retrieved 15.12.2013 [17] Gazzaniga, M.S. (2008). Learning, Arts, and the Brain: The Dana Consortium Report on Arts and Cognition. NY, Washington, DC., Dana Press; http://www. wjh.harvard.edu/~lds/pdfs/DanaSpelke. pdf; retrieved 10.12.2013 [18] Good, I.J. (1965). Speculations concerning the first ultraintelligent machine. In F.L. Alt and M. Rubinoff (Eds.), Advances in Computers, Vol. 6, Academic Press, pp. 31-88. [19] Huebner, E.S., Gilman, R. (2003). Toward a Focus on Positive Psychology in School Psychology; School Psychology Quarterly; V. 18, pp. 99-102. [20] Hui, S. (2012). Cambridge to study technology's risk to humans. Associated Press, November 25, 2012; http://bigstory.ap.org/article/cambridge-study-technologys-risk-humans; retrieved 15.07.2013. [21] Kabat-Sinn, J. (2003). Mindfulness-based Interventions in the Context: Past, Present, and Future; Clinical Psychology: Science and Practice, V. 10, pp. 144-156. [22] Kurzweil, R. (2005). The Singularity Is Near: When Humans Transcend Biology, 2005. [23] Naughton, J. (2012). Could robots soon add to mankind's existential threats?". The Observer. 02 December 2012. Retrieved 12 March 2013; http://www.theguardian.com/technology/2012/dec/0 2/ai-robots-google-car-humans-john-naughton [24] Schonert-Reichl, K.A., Lawlor, M.S. (2010). The Effects of a Mindfulness-based Education Program on Pre- and Early Adolescents' Well-being and Social and Emotional Competence; Mindfulness, Vol. 1, pp. 137-151. [25] Screenager (2013). In Oxford Dictionaries. Language matters. English; http://www.oxforddictionaries.com/definition/ameri can_english/screenager; retrieved 05.12.2013 [26] Searle, J.R. (1995). The Construction of Social Reality. The Penguine Press. [27] Seligman, M.E.P., Csikszentimihalyi, M. (2000). Positive Psychology: an Introduction; American Psychologist, Vol. 55, pp. 5-14. [28] US Public Health Service. Report on the Surgeon's General's Conference on Children's Mental Health: a national action agenda (2000). Washington, DC., Department of Health and Human Services. [29] Vandewater, E.A., Rideout, V.J., Wartella, E.A., Huang, X., Lee, J.H., M.-S. Shim, M.-S. (2007). Digital Childhood: Electronic Media and Technology Use Among Infants, Toddlers, and Preschoolers. Pediatrics. Official Journal of the American Academy of Pediatrics, Vol. 119, No. 5, pp. 1006-1015. [30] Vinge, V. (1993). The Coming Technological Singularity: How to Survive in the Post-Human Era; http://www-rohan.sdsu.edu/faculty/vinge/misc/singularity.html; retrieved 24.11.2013 [31] Wallis, L. (2013). Is 25 the new cut-off point for adulthood? BBC News, 25 Sept. 2013; http: //www. bbc.co. uk/news/magazine-24173194; retrieved 27.09.2013. [32] Zelazo, P.D. (2004). The Development of Conscious Control in Childhood. Trends in Cognitive Sciences, Vol. 8, pp. 12-17. Semantic Searching of Biological Documents Using Gene Ontology Marwa Mostafa1, Enas M.F. El Houby2 and Akram Salah1 1Faculty of computers and information, Cairo University, computer science department 5 Dr. Ahmed Zewail, Orman, Giza, Egypt E-mail: m.mostafa_fci@yahoo.com, akram.salah@fci-cu.edu.eg 2Systems & Information Department, Engineering Division, National Research Centre, Dokki, Giza, Egypt E-mail: em.fahmy@nrc.sci.eg Keywords: information retrieval, biological ontology, ranking methodology, precision, recall Received: September 21, 2013 Semantic information retrieval of biological documents is an information retri^^al approach that utilizes semantics to improve the search recall and precision. This research presents a framework for a semantic biological retrieval system that effectively searches and retrieves meaningful results using Gene Ontology. The system takes two related biological terms as an input and retrieves relevant documents which contain these inputs. Since the user searches for the documents that contain two related biological terms, the system helps the user to know the hierarchical relationship between these two terms using Gene Ontology. The system utilizes the Gene Ontology to infer semantically related terms to the inputs. The inferred words may include synonyms, parents and grandparents of the input terms entered in the search query. The system uses these related inferred terms in expanding the user query to produce meaningful results since it retrieves the documents that contain the input terms and these inferred related terms. The system uses a ranking methodology to help in ordering the retrieved documents based on the rank values. The proposed technique improves the precision of the retrieved documents as well as the recall which saves researcher time and focus. Povzetek: Razvita je metoda iskanja bioloških dokumentov z uporabo genskih ontologij. 1 Introduction The biological repositories contain hundreds of thousands of electronic collections that often contain high quality information [1]. During the past years, the increase in scientific knowledge and the massive data production have caused an exponential growth in the number and size of biological databases and repositories. However, data size, which can reach hundreds of gigabytes, involves serious problems of data access through data storage in local disks. Other challenging issues associated to biological data are that much relevant information is spread out in different databases or repositories [2]. So the biological data is still locked in a large number of resources; remaining not computer-readable. In the current search engines when the user enters two terms it returns a lot of documents including unhelpful ones. Keyword-based search is currently the most commonly employed search strategy in biomedical digital libraries. When users search by a few keywords, a large number of matched results could be returned. Users spend a significant amount of time to browse these results to find out those documents they are truly interested in because the publications returned may not be organized based on the user needs, forcing users to browse thousands of publications. In most cases, it is impossible for users to manually read every returned entry thus leads to loss of many truly relevant publications [3]. The goal of an information retrieval (IR) system is to rank documents optimally given a query to rate the relevance of documents. In order to achieve this goal, the system must be able to score documents so that the relevant document would ideally have a higher score than the irrelevant one [4]. Most of the current forms of web content are designed to be presented to humans; they are not understandable by computers. The semantic web aims at enhancing existing web content with semantic structure in order to make it meaningful to computers as well as to humans. Ontology plays a key role in the semantic web [5], [6] which offers an advanced approach for managing, retrieving information and processing it. Ontology is a formal conceptualization of a particular domain into a human understandable, machine-readable format [7].One of the most important bio-ontology is Gene Ontology [8]. It organizes terms in a parent-child hierarchy. Our first publication about this framework was "Ontology based Biological Information Retrieval System" (OBIRS) [9] which shows how we improved the efficiency of the method used in the system algorithm. The proposed system presented in this paper uses Gene Ontology to infer semantically related terms to the input terms. The inferred terms may include synonyms which are useful in retrieving documents by authors who use different wording in reference to the same concept. The system also infers related terms through parent-child relationship up to 2 levels (parents and grandparents) for each term of the input terms to expand the search query. The proposed system helps the researchers to get more relevant and accurate retrieval of the documents. It allows the researchers to enter two related terms to get the documents that contain both of them. Also the system semantically retrieves the documents if they contain synonyms of the input terms inferred from the Gene Ontology even if these documents do not contain the exact phrase of the input terms. Also the system retrieves the documents that contain the input terms and/or synonyms with any combination of the other inferred terms (parents and grandparents). The system searches for two related terms because its main idea is to retrieve documents that contain relation among related biological terms and we found that the least number of possible terms to find a relation between is two. The system uses a ranking methodology that helps in ranking the retrieved documents to achieve the researchers satisfaction and save time and effort consumed by the researchers to rate the relevance of documents manually. The system groups the retrieved documents into five classes to save the time of the researchers. The system also extracts the relation between the input terms from the gene ontology to give the researchers the hierarchical relation between them. The remainder of the paper is organized as follows: in section 2, an overview for the previous work related to our subject is presented. In section 3, the architecture of the proposed system is described. In section 4, ranking issues are explained. In section 5, an example is introduced to illustrate the proposed system. In section 6 relationship extraction is explained. In section 7, testing the system and the results are produced, before drawing conclusions and future work in section 8. 2 Related work A lot of previous work was studied for the subject of semantic web and biological information retrieval. Sumithiradevi et al.[10]proposed one such tool called BIOMINING that is designed to eliminate anomalous and redundancy in biological web content. The authors use indexing and mining technology on biological databases to summarize the information of biological data in the document. Zhou et al. [11] designed a biological information retrieval and analysis system (BIRAS) based on the Internet. The system could send and receive information from the Entrez search and retrieval system maintained by the National Centre for Biotechnology Information (NCBI) in USA. Marta Bleda et al.[2]proposed the "CellBase" that provides a solution to the growing necessity of integration by easing the access to biological data. CellBase implemented a set of RESTful web services that query a centralized database containing the most relevant biological data sources. Minlie Huang et al.[12]proposed Ontology-based biological relation extraction system to automatically extract biological relations from a huge number of online MEDLINE abstracts. Authors then made Ontology-based semantic annotation of online biological documents. Analia Lourengo et al. [13] present BioDR which is a novel approach that allows the semantic indexing of the results of a query by identifying relevant terms in the documents. This system makes it possible to navigate semantically between documents and relevant terms, taking advantage of the rich contents of full-text. Many other researchers [1], [14], [15], [16], [17], [18], [19] used ontologies, inverted list (different tech.) and query expansion to assist biological information retrieval search. After reviewing several researches that support the retrieval of biomedical information it is our conclusion that the most similar to our system is[12]. However the previous reviewed researches aim to study the design of a biological information retrieval and analysis systems using the Internet. These systems are designed to eliminate anomalies and redundancy in biological web content, integrate biological database, retrieve biological information and extract relations. Our proposed system is a biological semantic retrieval system that tries to improve the recall and precision of the retrieved documents and helps the researchers to get the relevant documents that contain information and relationships between two related biological terms. The system retrieves documents that contain the terms as well as other semantically related terms inferred from the Gene Ontology. The system also ranks the retrieved documents based on their relevance to the input terms. The system retrieves the content of the document, not its address, unlike other retrieval systems. It is our assumption that retrieving relevant documents that contain information about two related biological terms entered from the researchers and ranking them should save the researchers time and effort. 3 Proposed system Testing our previous system presented in [9] shows that there are many documents that satisfy the researcher's needs and have not been retrieved. Many biological terms have synonyms and it is possible to have a document that contains synonyms of the two terms entered in the search query or that contains one term and the synonym of the other term during the searching process. These documents have not been retrieved to the researchers in spite of its relevancy to the query. Also the retrieved documents have not been ranked appropriately. That was a motivation for improving the effectiveness of the previous system since there are many documents that semantically may satisfy the researchers needs and have not been retrieved. The system (EOBIRS) presented in this paper is the Enhanced Biological Information Retrieval System which is the updated version of the previous system that highlights the importance of the synonyms of the searched terms and retrieves documents from corpus even if they have the synonyms only and doesn't contain the same wording of the terms entered in the search query because semantically they are reference to the Pail ofrelatedbiological temis/'Tl.T2'l _Searching process_ EaiM EOBER-S . Prt-Proctssiiig Data Ba»« o «It Ontology Corpu«. of retrieved «iocxinwnl» D Grouping «1— same concept. The system presented in this paper retrieves documents that contain the two entered terms, their synonyms and their parents up to two levels (parents and grandparents). The system presented in this paper has the same system design like the previous system. It also has the same pre-processing instructions like [9] and differs in the searching process instructions, ranking criteria, grouping criteria and reordering the classes based on the balance value that adds another facility to minimize the time and effort of the researchers. The system architecture is shown in Figure 1. Figure 1: System architecture. the system begins to check if the "DBGenes" contains these terms or not. The search process starts if the two terms exist in the "DBGenes". The system gets all the synonymous for both terms from the normalized database "DBGenes". The system gets all the parents up to two levels (parents and grandparents) for both terms from the normalized database "DBGenes". The system expands the query "term1 AND term2" using synonymous provided from the Ontology as well as parents and grandparents using "is_a"relation that describes the parent-child relationship. The query will expanded as follow: b. c. d. 3.1 Pre-processing i. The system normalizes the Gene Ontology to a database named "DBGenes". The database "DBGenes" contains all genes that exist in the Gene Ontology with their attributes such as name, id, definition, synonymous, is_a and part_of. ii. The system builds a dictionary file that contains all the biological terms exist in the normalized database. iii. The system builds inverted list based on the biological terms only that exist in the corpus's documents. The system compares all terms exist in the corpus's documents with the biological terms exist in dictionary file, so a term added to the inverted list if it was found in the dictionary file. The terms were added to the inverted list with a list of the documents that contains these terms and the positions of the terms and the frequencies of their appearance in each document. 3.2 Searching process a. The researcher enters the two related biological terms that he/she wants to search for. Where the system searches for unique identifiers for biological terms, If we assume that the two related biological terms entered to the retrieval system are G1 and G2. The set of synonyms are later called a synset. If the two synsets are GS1={gs11, gs12,^,gs1m} andGS2= {gs21,gs22,^,gs2n}, and if the gene parents are GP1= {gpn, gp12,^,gp1i} and GP2 ={gp21,gp22,^,gp2j}, and if the gene grandparents are GGP1= {ggp11, ggp12,^,ggp1k} and GGP2={ggp21,ggp22,^,ggp2l}, the query will expanded into these queries: Qi retrieves all the documents that contain the two related biological terms and/or the synonyms and their parents and their grandparents. Qi=[((G1)OR(gs11ORgs12OR^gs1m))AND(gp11OR gp12OR_,gp1,) AND(ggp11OR ggp12,OR^,ggp1k)] AND [((G2) OR (gs21ORgs22OR^gs2n))AND(gp21OR gp22OR_,gp2j)AND (ggp21ORggp22OR_,ggp2l)] Q2 retrieves all the documents that contain the two biological terms and/or the synonyms with their parents or grandparents. Q2=[((G1)OR(gs11ORgs12OR^gs1m))AND((gp11O Rgp12OR_,gp1,) OR (ggp11 ORggp12 OR^,ggp1k))] AND [((G2) OR (gs21ORgs22OR^gs2n))AND((gp21OR gp22OR_gp2j) OR (ggp21 OR ggp22 OR ^,ggp2l))] Q3 retrieves all the documents that contain the two terms or their synonyms or one term and the synonym of other term. Q3= [(G1) OR (gs11ORgs12OR^gs1m)] and [(G2) OR (gs21ORgs22OR^gs2n)] The expanded query will be: Q = Q1 OR Q2 OR Q3 e. The system uses the inverted list to get the list of the documents that satisfy the query Q. This list will contain the document's names that contain the two terms (G1 and G2) and/or any combination of the related terms inferred from the Gene Ontology. f. The system calculates the rank value of each document which used to order the retrieved documents. The system ranks the documents under a certain criteria: • The initial value of ranking of the document is the count of occurrence of the two terms multiplied by weight W1. • Finding synonyms of any of the two terms increases the value of ranking by adding W2 of the number of their occurrence. • Finding a parent or grandparent of any of the two terms increases the value of ranking by adding W3 of the number of their occurrence. The rank value will be calculated as follow: Rank value = [(F(T1)+F(T2))*W1]+[(F(ST1)+F(ST2))*W2]+ [(F(PT1)+F(PT2)+ F(GPT1)+F(GPT2)) * W3] (1) Where T1 and T2 are the input terms ST1, ST2 are the synonyms of the input terms, PT1, PT2 are the parents of the input terms and GPT1 and GPT2 are the grandparents of the input terms. F is to count the number of occurrence of the terms. Supposed that: W1> W2> W3 We supposed that W1 to be greater than W2 because we assume that the weight of input terms must be greater than that of synonyms this is due to that we should give a strong concern to the input terms entered by the researcher. We think that the researcher is more concern about the retrieved documents that contain the exact wording of the input terms than the documents that contain the synonyms of the input terms. We choose W2 to be greater than W3 because the existence of the synonyms means the existence of the input terms so we assume that the weight of finding a parent or grandparent must be less than the weight of finding a synonym. The existence of a parent or grandparent adds another prove that the retrieved document talks about the input terms but in the same time it still doesn't represent the same meaning of the input terms so we cannot give it a weight equal to the synonyms. g. The system retrieves from corpus the documents resulted from the query Q ranked by the system ranking values. h. The system calculates the value of the precision and recall of the retrieved documents. i. The system extracts the relation between the two related terms from the Gene Ontology and presents it to the user as additional information about hierarchical relation of the two terms in addition to that mentioned in the documents. j. The user can open any of the retrieved documents and notice the two terms that he/she searches for are highlighted. 4 Ranking Issues During testing the system issue has been released "what about if the user wants to get specific documents as the first outcome in the list of the retrieved documents for example the documents that contain terms and their parents only". Because of this issue we have added a grouping option that the system provides to the user in addition to a list of the whole documents. The list is grouped into the following classes: Class one: Provides all the documents that each one contains the two related biological terms and/or their synonyms and their parents and grandparents. Class two: Provides all the documents that each one contains the two related biological terms and/or their synonyms and their parents. Class three: Provides all the documents that each one contains the two related biological terms and/or their synonyms and their grandparents. Class four: Provides all the documents that each one contains the two related biological terms only. Class five: Provides all the documents that each one contains the synonyms of the two related biological terms only or one term and the synonym of other term. Each class can be ordered based on the frequencies of the two related biological terms under search with concern of the balancing between the frequency of term 1 and the frequency of term 2 to ensure that the documents contain material that tackle the relation between the two terms. The system calculates the absolute value of the difference between the frequency of term1 and the frequency of the term2 in the document. It orders the documents based on the balance value, the document is ordered first if it has low balance value. If there are two or more documents having the same balance value then they will be ranked based on the summation of the "term frequency" value of term 1 and the "term frequency" Figure 2: Ranking workflow. value of term2. The ranking workflow is shown in Figure 2. For illustration: If we have a list of relevant retrieved documents from class four that contains Term1 and Term 2 as shown in Table 1. Table 1: "term frequency" values of the two terms and the calculation of the balance value for each document. Document number Term 1 Term 2 Balance value (absolute value of the difference) D1 2 2 0 D2 4 4 0 D3 2 1 1 D4 6 5 1 D5 4 1 o 3 D6 5 19 14 The system will calculate the balance for this class as shown in Table1. The list of the documents will be ordered based on balance value as shown in Figure 3. Figure 3: List of documents ordered based on balance values. Then the system reorders the documents that have the same balance values based on the summation of the "term frequency" values of both term1 and term2. So for D1 and D2: Documents D1 D2 Balance value 0 0 Summation value 4 8 Figure 4: The comparison between term frequency values of D1 and D2. Based on the calculation in Figure 4 the system will rank D2 higher than D1 because the summation of the "term frequency" values of term 1 and term2 in D2 is greater than their summation in D1. For D3 and D4: Documents D3 D4 Balance value Summation value 3 11 Figure 5: Comparing term frequency values of D3 and D4. Based on the calculation in Figure5 the system will rank D4 higher than D3 because the summation of the "term frequency" values of term1 and term2 in D4 is greater than their summation in D3. The system will present the documents for the user as shown in Figure 6. Figure 6: List of documents ordered by term's frequency values. 5 An example To show how the system searches and orders the documents we present the following example, supposing that W1=1, W2= 0.8, W3=0.25. If the researcher searches for two terms, Term1: "regulation of DNA recombination"and Term 2: "mitochondrion inheritance "and the corpus contains a list of the following documents: D1: mitochondrion inheritance and regulation of DNA recombination are biological_process in the Gene Ontology. D2: mitochondrion inheritance has synonyms and regulation of DNA recombination doesn't have synonyms. Mitochondrial inheritance is a synonym of mitochondrion inheritance and organelle inheritance is a parent for it. D3:mitochondrion inheritance and regulation of DNA recombination have parents. Regulation of DNA metabolic process is a parent of regulation of DNA recombination. Mitochondrial inheritance is a synonym of mitochondrion inheritance and organelle organization is a grandparent for it. D4: regulation of DNA recombination is a biological process. Mitochondrion inheritance and regulation of DNA recombination have parents. Organelle organization is a grandparent of mitochondrion inheritance. D5: Gene Ontology contains regulation of DNA recombination and mitochondrion inheritance. Regulation of DNA recombination is a biological_process. Regulation of DNA recombination is any process that modulates the frequency, rate or extent of DNA recombination. Regulation of DNA recombination is a subset of gosubset_prok. Regulation of DNA recombination has only one parent. Recombination regulates has a relationship with regulation of DNA recombination. Regulation of DNA recombination has "intersection_of^' relation with biological regulation and DNA recombination regulates. We can find regulation of DNA recombination in Gene Ontology version1.2.The id of regulation of DNA recombination is GO:0000018 in the Gene Ontology. Mitochondrion inheritance is a biological_process. Mitochondrion inheritance is the distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton. D6: this document talks about regulation of DNA recombination and mitochondrion inheritance. Regulation of DNA recombination is any process that modulates the frequency, rate or extent of DNA recombination. D7: Gene Ontology contains genes. D8: mitochondrial inheritance is a biological_process in the Gene Ontology. D9: Organelle inheritance is a biological_process in the Gene Ontology. Table2: The calculations of documents rank values for the presented example. ee me eb um cu on Q Term frequency e (D al o ) > (U